<a href="https://colab.research.google.com/github/leomaurodesenv/big-qa-architecture/blob/main/jupyter/2_Document_Retriever_Experiments.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Document Retriever Experiments

In Question Answering (QA), queries are run over several documents to extract an answer to user questions, consisting of two main steps: (1) Document Retriever — retrieve the most useful documents that may contain the answer to a given question; (2) Document Reader — a machine reader carefully examines the retrieved documents and frame an answer.

In this Jupyter Notebook, we focused in Document Retriever experiments, motivated by the fact that the use of a higher recall algorithm provides a higher end-to-end querying and answering performance.

Blog post: [Automatic Question Answering — Document Retriever (Machine Learning)](https://medium.com/wearesinch/automatic-question-answering-document-retriever-machine-learning-f4f473387739)


---
## Setup

Packages installation and setups.

### Run Configuration

Choose the dataset and the Document Retriever algorithm.

In [1]:
import enum

class Dataset(enum.Enum):
    '''Dataset options'''
    SQuAD = 1
    AdvQA = 2
    DuoRC = 3

class DocRetriever(enum.Enum):
    '''Document Retriever options'''
    BM25  = 1
    TFIDF = 2
    DPR   = 3

In [2]:
# run configuration
NUM_K         = 3 # 3, 10, 20
DATASET       = Dataset.SQuAD
DOC_RETRIEVER = DocRetriever.BM25

### Package Installation

Install Haystack and HuggingFace packages.

In [3]:
# Check if you have a GPU running
# The code runs in CPU as well
!nvidia-smi

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.



In [4]:
# %%capture
# Install the Haystack
!pip install pip==22.2.2 --quiet
!pip install farm-haystack[colab]==1.8.0 --quiet
# !pip install git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[colab]

# Install Huggingface
!pip install datasets==2.4.0 --quiet
!pip install transformers==4.20.1 --quiet
!pip install sentence-transformers==2.2.2 --quiet
!echo "Silent installation with success!"

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m20.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m666.4/666.4 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.4/4.4 MB[0m [31m20.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m23.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.7/16.7 MB[0m [31m38.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.6/5.6 MB[0m [31m58.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[

### Logging

We configure how logging messages should be displayed and which log level should be used before importing Haystack.

In [5]:
import logging

# Setup Haystack logging format
logging.basicConfig(format="%(levelname)s - %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.INFO)

---
## Document Store

We are going to use Elasticsearch as Document Store; Elasticsearch supports queries using [full-text based](https://www.elastic.co/guide/en/elasticsearch/reference/current/full-text-queries.html), [BM25 retrieval](https://www.elastic.co/elasticon/conf/2016/sf/improved-text-scoring-with-bm25), and [vector space for text embeddings](https://www.elastic.co/guide/en/elasticsearch/reference/7.6/dense-vector.html).

### Starting the Elasticsearch
We manually download and execute the Elasticsearch server.

In [6]:
# In Colab / No Docker environments: Start Elasticsearch from source
!wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q
!tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
!chown -R daemon:daemon elasticsearch-7.9.2

import os
from subprocess import Popen, PIPE, STDOUT

es_server = Popen(
    ["elasticsearch-7.9.2/bin/elasticsearch"], stdout=PIPE, stderr=STDOUT, preexec_fn=lambda: os.setuid(1)  # as daemon
)
# wait until ES has started
!sleep 30

In [7]:
# Connect to Elasticsearch
from haystack.document_stores import ElasticsearchDocumentStore

document_store = ElasticsearchDocumentStore(host="localhost", username="", password="", index="document")

INFO:haystack.telemetry:Haystack sends anonymous usage data to understand the actual usage and steer dev efforts towards features that are most meaningful to users. You can opt-out at anytime by calling disable_telemetry() or by manually setting the environment variable HAYSTACK_TELEMETRY_ENABLED as described for different operating systems on the documentation page. More information at https://haystack.deepset.ai/guides/telemetry


---
## Dataset

Imports and downloads the respective dataset.

### Abstract Dataset

In [8]:
import pandas as pd
from abc import ABCMeta, abstractmethod

class AbstactDataset(metaclass = ABCMeta):
    '''Abstract dataset class'''

    def __init__(self):
        self.raw_dataset = self.download()
        self.df_dataset = self._transform_df()
        print(f"## {self.name} ##")
        print(self.raw_dataset)

    def _transform_df(self):
        '''Transform dataset in a pd.DataFrame'''
        return pd.DataFrame(self.raw_dataset)

    @property
    @abstractmethod
    def name(self):
        '''Dataset name'''
        pass

    @abstractmethod
    def download(self):
        '''Download the dataset'''
        pass

    @abstractmethod
    def get_documents(self):
        '''Get the unique documents to store into the Document Store'''
        pass


    @abstractmethod
    def get_validation(self):
        '''Get the validation set'''
        pass

### SQuaD Dataset

https://huggingface.co/datasets/squad

In [9]:
import mmh3
from datasets import load_dataset
from haystack.schema import Label, Document, Answer
from haystack.schema import EvaluationResult, MultiLabel

class SQuadDataset(AbstactDataset):
    '''SQuaD Dataset'''
    name = "SQuaD Dataset"
    _columns = {
        "title": "title",
        "document": "context",
        "question": "question",
    }
    _metadata = {
        "dataset_id": "id"
    }

    def download(self):
        dataset = load_dataset("squad", split="validation")
        return dataset

    def get_documents(self):
        # Remove duplicated contents
        cc = self._columns
        dataset_name = f"{self.name}"
        df = self.df_dataset
        df = df.drop_duplicates(subset=[cc["title"], cc["document"]], keep="first")

        # Create Haystack Document objects
        list_docs = []
        for _, row in df.iterrows():
            document_id = mmh3.hash128(row[cc["document"]], signed=False)
            doc_metadata = {k: row[v] for k,v in self._metadata.items()}
            doc_metadata["title"] = row[cc["title"]]
            doc_metadata["dataset_name"] = dataset_name
            doc = Document(
                id=document_id,
                content_type="text",
                content=row[cc["document"]],
                meta=doc_metadata
            )
            list_docs.append(doc)
        return list_docs

    def _get_answer(self, data):
        # Get question answer
        return data["answers"]["text"][0]

    def get_validation(self):
        # Get dataset info
        cc = self._columns
        df = self.df_dataset
        _self = self

        # Create Haystack labels
        eval_labels = []
        for _, row in df.iterrows():
            document_id = mmh3.hash128(row[cc["document"]], signed=False)
            doc_label = MultiLabel(labels=[
                Label(
                    query = row[cc["question"]],
                    answer = Answer(
                        answer=_self._get_answer(row),
                        type="extractive",
                    ),
                    document = Document(
                        id=document_id,
                        content_type="text",
                        content=row[cc["document"]],
                    ),
                    is_correct_answer=True,
                    is_correct_document=True,
                    origin="gold-label",
                )
            ])
            eval_labels.append(doc_label)
        return eval_labels

### AdversarialQA Dataset

https://huggingface.co/datasets/adversarial_qa

In [10]:
class AdversarialQADataset(SQuadDataset):
    '''AdversarialQA Dataset'''
    name = "AdversarialQA Dataset"

    def download(self):
        dataset = load_dataset("adversarial_qa", "adversarialQA", split="validation")
        return dataset

### DuoRC Dataset

https://huggingface.co/datasets/duorc

In [11]:
class DuoRCDataset(SQuadDataset):
    '''DuoRC  Dataset'''
    name = "DuoRC Dataset"
    _columns = {
        "title": "title",
        "document": "plot",
        "question": "question",
    }
    _metadata = {
        "dataset_id": "question_id"
    }

    def download(self):
        dataset = load_dataset("duorc", "SelfRC", split="validation")
        return dataset

    def _transform_df(self):
        '''Transform dataset in a pd.DataFrame'''
        df = pd.DataFrame(self.raw_dataset)
        # Get questions with answer
        return df[~df["no_answer"]]

    def _get_answer(self, data):
        # Get question answer
        return data["answers"][0]

### Download the dataset

Get the dataset and store its documents into the Document Store.

In [12]:
def dataset_switch(choice):
    '''Get dataset class'''

    if choice == Dataset.SQuAD:
        return SQuadDataset()
    elif choice == Dataset.AdvQA:
        return AdversarialQADataset()
    elif choice == Dataset.DuoRC:
        return DuoRCDataset()
    else:
        return "Invalid dataset"

# Get the dataset
dataset = dataset_switch(DATASET)

Downloading builder script:   0%|          | 0.00/1.97k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.02k [00:00<?, ?B/s]

Downloading and preparing dataset squad/plain_text (download: 33.51 MiB, generated: 85.63 MiB, post-processed: Unknown size, total: 119.14 MiB) to /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/8.12M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.05M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

Dataset squad downloaded and prepared to /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453. Subsequent calls will reuse this data.
## SQuaD Dataset ##
Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 10570
})


In [13]:
# Store documents in the Document Store
docs = dataset.get_documents()
document_store.write_documents(docs)

---
## Document Retriever

In this experiment, we explored the BM25, TF-IDF and Dense Passage Retrieval (DPR).

* https://docs.haystack.deepset.ai/docs/retriever
* https://github.com/facebookresearch/DPR
* https://www.elastic.co/pt/blog/practical-bm25-part-2-the-bm25-algorithm-and-its-variables

### Get the Retriever

In [14]:
from haystack.nodes import BM25Retriever
from haystack.nodes import TfidfRetriever
from haystack.nodes import DensePassageRetriever

def retriever_switch(choice, document_store):
    '''Get Retriever object'''

    if choice == DocRetriever.BM25:
        retriever = BM25Retriever(document_store=document_store)
        return retriever
    elif choice == DocRetriever.TFIDF:
        retriever = TfidfRetriever(document_store=document_store)
        return retriever
    elif choice == DocRetriever.DPR:
        retriever = DensePassageRetriever(
            document_store=document_store,
            query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
            passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base",
            use_fast_tokenizers=True
        )
        document_store.update_embeddings(retriever)
        return retriever
    else:
        return "Invalid retriever"

# Get the retriever
retriever = retriever_switch(DOC_RETRIEVER, document_store)
retriever

<haystack.nodes.retriever.sparse.BM25Retriever at 0x7f0019f94370>

### Build the Pipeline

In [15]:
from haystack.pipelines import DocumentSearchPipeline

pipe = DocumentSearchPipeline(retriever=retriever)

In [16]:
# Testing the pipeline
from haystack.utils import print_documents

# Querying documents
question = "What is your name?"
prediction = pipe.run(query=question, params={"Retriever": {"top_k": 1}})

# Print answer
print_documents(prediction)


Query: What is your name?

{   'content': 'The negotiations were successfully concluded on 17 February '
               '1546. After 8 a.m., he experienced chest pains. When he went '
               'to his bed, he prayed, "Into your hand I commit my spirit; you '
               'have redeemed me, O Lord, faithful God" (Ps. 31:5), the common '
               'prayer of the dying. At 1 a.m. he awoke with more chest pain '
               'and was warmed with hot towels. He thanked God for revealing '
               'his Son to him in whom he had believed. His companions, Justus '
               'Jonas and Michael Coelius, shouted loudly, "Reverend father, '
               'are you ready to die trusting in your Lord Jesus Christ and to '
               'confess the doctrine which you have taught in his name?" A '
               'distinct "Yes" was Luther\'s reply.',
    'name': None}



---
## Evaluation

About the metrics, you can read the [evaluation](https://docs.haystack.deepset.ai/docs/evaluation) web page.

In [17]:
%%time

# For testing purposes, running on the first 100 labels
# For real evaluation, you must remove the [0:100]
eval_labels = dataset.get_validation()[0:100]
eval_result = pipe.eval(labels=eval_labels, params={"Retriever": {"top_k": NUM_K}})

CPU times: user 15.8 s, sys: 143 ms, total: 15.9 s
Wall time: 20.2 s


In [18]:
from pprint import pprint

# Get and print the metrics
metrics = eval_result.calculate_metrics()
pprint(metrics)

{'Retriever': {'map': 0.7150630011454754,
               'mrr': 0.7405498281786943,
               'ndcg': 0.7645699430915053,
               'precision': 0.4467353951890034,
               'recall_multi_hit': 0.8668384879725085,
               'recall_single_hit': 0.8762886597938144}}


In [21]:
# Print a detailed report
# pipe.print_eval_report(eval_result)