# Pre-requisites

**Need to Have:** The dataset JSON file `parrot-qa.json` generated using the 'parrot-qa/dataset' repository.

Upload it to a `data` directory.



In [None]:
# Install packages

!pip install --upgrade pip

!pip install datasets
!pip install nltk rouge_score

#!pip install farm-haystack[colab,faiss]
!pip install git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[colab,faiss]

In [None]:
# Make sure you have a GPU running
!nvidia-smi

# Step 1: Dense Passage Retrieval

We will use the DPR model introduced by Karpukhin et al. (2020, https://arxiv.org/abs/2004.04906). 

Original Code: https://fburl.com/qa-dpr

The original reference notebook is [here](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial6_Better_Retrieval_via_DPR.ipynb).


In [None]:
# Constants

# Split documents into pieces, the module respects sentence boundaries.
PREPROC_SPLIT_LEN_DOC = 100

MAX_SEQ_LEN_QUERY = 256
MAX_SEQ_LEN_PASSAGE = 128
RETRIEVER_BATCH_SIZE = 16

RETRIEVER_TOP_K = 5
READER_TOP_K = 5
USE_CONTEXT_FROM = 'retriever'  # 'retriever' or 'reader'

### Cleaning & Indexing

We group documents by course and index them into the DocumentStore.

In [None]:
import re
import json

from haystack.nodes import PreProcessor


def _format_title(title):
    title = ' '.join(re.findall(r'[a-z0-9.-]+', title, re.IGNORECASE))
    return title


def _get_answer(answers):
    max_val = max(answers['score'])
    max_idx = answers['score'].index(max_val)
    return answers['text'][max_idx]


def extract_docs(dataset):
    # Store one list of documents per course
    docs_db = {}

    for doc in dataset['documents']:
        course = doc['course']
        if course not in docs_db:
            docs_db[course] = []
        docs_db[course].append({
            'content': doc['passage_text'],
            'meta': {'name': _format_title(doc['article_title'])},
        })

    preproc = PreProcessor(split_length=PREPROC_SPLIT_LEN_DOC)
    for course in docs_db.keys():
        docs_db[course] = preproc.process(docs_db[course])

    # It seems preproc sometimes ends up with duplicate IDs, so cleanup manually
    for course, docs in docs_db.items():
        for idx, doc in enumerate(docs):
            doc.id = f'd{idx}'

    return docs_db


def extract_qa_pairs(dataset):
    # Store one list of documents per course
    qa_db = {}

    for qa in dataset['qa_pairs']:
        course = qa['course']
        if course not in qa_db:
            qa_db[course] = []
        if qa['is_answerable'] == False:
            continue
        qa_db[course].append({
            'question': qa['title'],
            'answer': _get_answer(qa['answers'])})

    return qa_db


with open("data/parrot-qa.json") as file_path:
    dataset = json.load(file_path)

docs_db = extract_docs(dataset)
qa_db = extract_qa_pairs(dataset)

### Document Store & Retriever

#### FAISS

FAISS is a library for efficient similarity search on a cluster of dense vectors.
The `FAISSDocumentStore` uses a SQL(SQLite in-memory be default) database under-the-hood
to store the document text and other meta data. The vector embeddings of the text are
indexed on a FAISS Index that later is queried for searching answers.
The default flavour of FAISSDocumentStore is "Flat" but can also be set to "HNSW" for
faster search at the expense of some accuracy. Just set the faiss_index_factor_str argument in the constructor.
For more info on which suits your use case: https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index

#### Retriever

**Here:** We use a `DensePassageRetriever`

**Alternatives:**

- The `ElasticsearchRetriever`with custom queries (e.g. boosting) and filters
- Use `EmbeddingRetriever` to find candidate documents based on the similarity of embeddings (e.g. created via Sentence-BERT)
- Use `TfidfRetriever` in combination with a SQL or InMemory Document store for simple prototyping and debugging

In [None]:
import os

from haystack.nodes import DensePassageRetriever
from haystack.document_stores import FAISSDocumentStore

In [None]:
# For each course, embed the pool of documents and create retrievers

dpr_db = {}

for course, docs in docs_db.items():
    db_file = f'data/faiss_document_store_{course}.db'
    if os.path.isfile(db_file):
        os.remove(db_file)
    document_store = FAISSDocumentStore(
        sql_url=f"sqlite:///{db_file}",
        faiss_index_factory_str="Flat",
    )
    document_store.write_documents(docs, duplicate_documents='fail')

    retriever = DensePassageRetriever(
        document_store=document_store,
        query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
        passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base",
        max_seq_len_query=MAX_SEQ_LEN_QUERY,
        max_seq_len_passage=MAX_SEQ_LEN_PASSAGE,
        batch_size=RETRIEVER_BATCH_SIZE,
        use_gpu=True,
        embed_title=True,
    )
    document_store.update_embeddings(retriever)

    dpr_db[course] = retriever


### Reader

Here we use a FARMReader with the *deepset/roberta-base-squad2* model (see: https://huggingface.co/deepset/roberta-base-squad2)


In [None]:
from haystack.nodes import FARMReader

reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=True)

### Pipeline

With a Haystack `Pipeline` you can stick together your building blocks to a search pipeline.
Under the hood, `Pipelines` are Directed Acyclic Graphs (DAGs) that you can easily customize for your own use cases.
To speed things up, Haystack also comes with a few predefined Pipelines. One of them is the `ExtractiveQAPipeline` that combines a retriever and a reader to answer our questions.
You can learn more about `Pipelines` in the [docs](https://haystack.deepset.ai/docs/latest/pipelinesmd).

In [None]:
from haystack.pipelines import Pipeline, ExtractiveQAPipeline


def attach_context_retriever(qa_db, dpr_db):
    for course, pairs in qa_db.items():
        pipe = Pipeline()
        pipe.add_node(component=dpr_db[course], name='Retriever', inputs=['Query'])
        for qa in pairs:
            context = pipe.run(
                query=qa['question'],
                params={"Retriever": {"top_k": RETRIEVER_TOP_K}}
            )
            qa['contexts'] = [doc.content for doc in context['documents']]


def attach_context_reader(qa_db, dpr_db):
    for course, pairs in qa_db.items():
        pipe = ExtractiveQAPipeline(retriever=dpr_db[course], reader=reader)
        for qa in pairs:
            prediction = pipe.run(
                query=qa['question'],
                params={"Retriever": {"top_k": RETRIEVER_TOP_K}, "Reader": {"top_k": READER_TOP_K}}
            )
            qa['contexts'] = [ans.context for ans in prediction['answers']]


if USE_CONTEXT_FROM == 'retriever':
    attach_context_retriever(qa_db, dpr_db)
elif USE_CONTEXT_FROM == 'reader':
    attach_context_reader(qa_db, dpr_db)
else:
    raise RuntimeError('Invalid configuration for selecting context.')


In [None]:
# Statistics

lengths = []
for course, pairs in qa_db.items():
    for qa in pairs:
        for context in qa['contexts']:
            lengths.append(len(context))

print('Average context length (characters):', round(sum(lengths) / len(lengths)))

### Export

In [None]:
# Write contextualized QA pairs to JSON

qa_export = []
for course, pairs in qa_db.items():
    for qa in pairs:
        item = {'course': course}
        item.update(qa)
        qa_export.append(item)

with open('data/parrot-qa-ctx.json', 'w') as file_path:
    json.dump(qa_export, file_path, indent=4)
