# Document Reader Experiments

In Question Answering (QA), queries are run over several documents to extract an answer to user questions, consisting of two main steps: (1) Document Retriever — retrieve the most useful documents that may contain the answer to a given question; (2) Document Reader — a machine reader carefully examines the retrieved documents and frame an answer.

In this Jupyter Notebook, we focused on the Document Reader experiments, motivated by the fact that using a good Reader (higher F1) produces a better and more concise response.

---
## Setup

Packages installation and setups.

### Run Configuration

Choose the dataset and the Document Reader algorithm.

In [None]:
import enum

class Dataset(enum.Enum):
    '''Dataset options'''
    SQuAD = 1
    AdvQA = 2
    DuoRC = 3
    QASports = 4

class DocReader:
    '''Document Reader options'''
    BERT    = "deepset/bert-base-uncased-squad2"
    RoBERTa = "deepset/roberta-base-squad2"
    MiniLM  = "deepset/minilm-uncased-squad2"
    DistilBERT = "distilbert-base-uncased-distilled-squad"
    ELECTRA = "deepset/electra-base-squad2"
    SmallDistilBERT= "laurafcamargos/distilbert-qasports-basket-small"

class Sports:
    BASKETBALL = "basketball"
    FOOTBALL = "football"
    SOCCER = "soccer"
    ALL = ""


In [None]:
# run configuration
NUM_K      = 1 # always = 1
DATASET    = Dataset.QASports
DOC_READER = DocReader.SmallDistilBERT
SPORT = Sports.BASKETBALL

### Package Installation

Install Haystack and HuggingFace packages.

In [None]:
# Check if you have a GPU running
# The code runs in CPU as well
!nvidia-smi

Sat Jul 13 13:43:36 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   40C    P8               9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [None]:
# %%capture
# Install the Haystack
!pip install pip --quiet
!pip install farm-haystack[colab]
!pip install farm-haystack[metrics]
# !pip install git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[colab]
!pip install rapidfuzz
# Install Huggingface
!pip install datasets
!pip install transformers==4.20.1 --quiet
!pip install sentence-transformers==2.2.2 --quiet
!echo "Silent installation with success!"

Collecting farm-haystack[colab]
  Downloading farm_haystack-1.26.2-py3-none-any.whl (763 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m763.7/763.7 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting boilerpy3 (from farm-haystack[colab])
  Downloading boilerpy3-1.0.7-py3-none-any.whl (22 kB)
Collecting events (from farm-haystack[colab])
  Downloading Events-0.5-py3-none-any.whl (6.8 kB)
Collecting httpx (from farm-haystack[colab])
  Downloading httpx-0.27.0-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
Collecting lazy-imports==0.3.1 (from farm-haystack[colab])
  Downloading lazy_imports-0.3.1-py3-none-any.whl (12 kB)
Collecting posthog (from farm-haystack[colab])
  Downloading posthog-3.5.0-py2.py3-none-any.whl (41 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.3/41.3 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pr

Collecting mlflow (from farm-haystack[metrics])
  Downloading mlflow-2.14.3-py3-none-any.whl (25.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m25.8/25.8 MB[0m [31m59.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting rapidfuzz<2.8.0,>=2.0.15 (from farm-haystack[metrics])
  Downloading rapidfuzz-2.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m66.7 MB/s[0m eta [36m0:00:00[0m
Collecting seqeval (from farm-haystack[metrics])
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting jarowinkler<2.0.0,>=1.2.0 (from rapidfuzz<2.8.0,>=2.0.15->farm-haystack[metrics])
  Downloading jarowinkler-1.2.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (114 kB)
[2K     [90m━━━━━━━━━

### Logging

We configure how logging messages should be displayed and which log level should be used before importing Haystack.

In [None]:
import logging

# Setup Haystack logging format
logging.basicConfig(format="%(levelname)s - %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.INFO)

In [None]:
!pip install mmh3

Collecting mmh3
  Downloading mmh3-4.1.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (67 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/67.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.6/67.6 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: mmh3
Successfully installed mmh3-4.1.0


---
## Dataset

Importing and download the respective dataset.

### Abstract Dataset

In [None]:
import pandas as pd
from abc import ABCMeta, abstractmethod

class AbstactDataset(metaclass = ABCMeta):
    '''Abstract dataset class'''

    def __init__(self):
        self.raw_dataset = self.download()
        self.df_dataset = self._transform_df()
        print(f"## {self.name} ##")
        print(self.raw_dataset)


    def _transform_df(self):
        '''Transform dataset in a pd.DataFrame'''
        return pd.DataFrame(self.raw_dataset)

    @property
    @abstractmethod
    def name(self):
        '''Dataset name'''
        pass

    @abstractmethod
    def download(self):
        '''Download the dataset'''
        pass

    @abstractmethod
    def get_documents(self):
        '''Get the unique documents to store in the Document Store'''
        pass


    @abstractmethod
    def get_validation(self):
        '''Get the validation set'''
        pass

### SQuaD Dataset

https://huggingface.co/datasets/squad

In [None]:
import mmh3
from datasets import load_dataset
from haystack.schema import Label, Document, Answer
from haystack.schema import EvaluationResult, MultiLabel

class SQuadDataset(AbstactDataset):
    '''SQuaD Dataset'''
    name = "SQuaD Dataset"
    _columns = {
        "title": "title",
        "document": "context",
        "question": "question",
    }
    _metadata = {
        "dataset_id": "id"
    }

    def download(self):
        dataset = load_dataset("squad", split="validation")
        return dataset

    def get_documents(self):
        # Remove duplicated contents
        cc = self._columns
        dataset_name = f"{self.name}"
        df = self.df_dataset
        df = df.drop_duplicates(subset=[cc["title"], cc["document"]], keep="first")

        # Create Haystack Document objects
        list_docs = []
        skipped_count = 0  # contador de linhas puladas
        for _, row in df.iterrows():
            document_value = row[cc["document"]]
            if document_value is not None:
                document_value_bytes = str(document_value).encode('utf-8')  # Converte para bytes
                document_id = mmh3.hash128(document_value_bytes, signed=False)
                doc_metadata = {k: row[v] for k, v in self._metadata.items()}
                doc_metadata["title"] = row[cc["title"]]
                doc_metadata["dataset_name"] = dataset_name
                doc = Document(
                    id=document_id,
                    content_type="text",
                    content=document_value,
                    meta=doc_metadata
                )
                list_docs.append(doc)
            else:
                # Imprimir toda a linha do DataFrame correspondente ao documento
                print(f"Warning: 'document' is None for this row. Skipping document:")
                print(row)
                skipped_count += 1

        print(f"Total rows skipped due to 'document' being None: {skipped_count}")

        return list_docs

    def _get_answers(self, data):
        # Get question answer
        return data["answers"]["text"]

    def get_validation(self):
        # Get dataset info
        cc = self._columns
        df = self.df_dataset
        _self = self

        # Create Haystack labels
        eval_labels = []
        for _, row in df.iterrows():
            document_value = row[cc["document"]]
            if document_value is not None:
                document_id = mmh3.hash128(document_value, signed=False)
                doc_label = MultiLabel(labels=[
                    Label(
                        query=row[cc["question"]],
                        answer=Answer(answer=answer, type="extractive"),
                        document=Document(
                            id=document_id,
                            content_type="text",
                            content=document_value,
                        ),
                        is_correct_answer=True,
                        is_correct_document=True,
                        origin="gold-label",
                    )
                    for answer in _self._get_answers(row)
                ])
                eval_labels.append(doc_label)
            else:
                # Tratar caso em que o valor é None
                print("Warning: 'document' is None for this row in get_validation. Skipping...")

        return eval_labels


INFO:haystack.telemetry:Haystack sends anonymous usage data to understand the actual usage and steer dev efforts towards features that are most meaningful to users. You can opt-out at anytime by manually setting the environment variable HAYSTACK_TELEMETRY_ENABLED as described for different operating systems in the [documentation page](https://docs.haystack.deepset.ai/docs/telemetry#how-can-i-opt-out). More information at [Telemetry](https://docs.haystack.deepset.ai/docs/telemetry).


### AdversarialQA Dataset

https://huggingface.co/datasets/adversarial_qa

In [None]:
class AdversarialQADataset(SQuadDataset):
    '''AdversarialQA Dataset'''
    name = "AdversarialQA Dataset"

    def download(self):
        dataset = load_dataset("adversarial_qa", "adversarialQA", split="validation")
        return dataset

### DuoRC Dataset

https://huggingface.co/datasets/duorc

In [None]:
class DuoRCDataset(SQuadDataset):
    '''DuoRC  Dataset'''
    name = "DuoRC Dataset"
    _columns = {
        "title": "title",
        "document": "plot",
        "question": "question",
    }
    _metadata = {
        "dataset_id": "question_id"
    }

    def download(self):
        dataset = load_dataset("duorc", "SelfRC", split="validation")
        return dataset

    def _transform_df(self):
        '''Transform dataset in a pd.DataFrame'''
        df = pd.DataFrame(self.raw_dataset)
        # Get questions with answer
        return df[~df["no_answer"]]

    def _get_answers(self, data):
        # Get question answer
        print(data)
        return data["answers"]

###QASports Dataset
https://huggingface.co/datasets/PedroCJardim/QASports

In [None]:
import ast

class QASportsDataset(SQuadDataset):
    '''QASports Dataset'''

    name = "QASports Dataset"

    _columns = {
        "title": "context_title",
        "document": "context",
        "question": "question",
    }
    _metadata = {
        "dataset_id": "id_qa"
    }

    def __init__(self, sport=None):
        self.sport = sport
        super().__init__()  # chama o construtor da classe base

    def download(self):
        if self.sport is not None:
            dataset = load_dataset("PedroCJardim/QASports", self.sport, split="validation")
            return dataset
        else:
            dataset = load_dataset("PedroCJardim/QASports", split="validation")
            return dataset
    def _get_answers(self, data):
        # Converte a string que representa um dicionário em um dicionário Python
        answer_dict = ast.literal_eval(data["answer"])

        # Retorna uma lista com o texto da resposta
        return [answer_dict["text"]]

### Download the dataset

Get the dataset and store the documents in the Document Store.

In [None]:
def dataset_switch(choice):
    '''Get dataset class'''

    if choice == Dataset.SQuAD:
        return SQuadDataset()
    elif choice == Dataset.AdvQA:
        return AdversarialQADataset()
    elif choice == Dataset.DuoRC:
        return DuoRCDataset()
    elif choice == Dataset.QASports:
      return QASportsDataset(SPORT)
    else:
      return "Invalid dataset"

# Get the dataset
dataset = dataset_switch(DATASET)
docs = dataset.get_documents()

Downloading data:   0%|          | 0.00/222M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/27.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/27.7M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/185934 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/23242 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/23242 [00:00<?, ? examples/s]

## QASports Dataset ##
Dataset({
    features: ['id_qa', 'context_id', 'context', 'question', 'answer', 'context_title', 'context_categories', 'url'],
    num_rows: 23242
})
Total rows skipped due to 'document' being None: 0


---
## Document Reader

In this experiment, we explored three Transformer based models for extractive Question Answering using the [FARM framework](https://github.com/deepset-ai/FARM).
* [BERT paper](https://arxiv.org/abs/1810.04805), [implementation](https://huggingface.co/deepset/bert-base-uncased-squad2)
* [RoBERTa paper](https://arxiv.org/abs/1907.11692), [implementation](https://huggingface.co/deepset/roberta-base-squad2)
* [MiniLM paper](https://arxiv.org/abs/2002.10957), [implementation](https://huggingface.co/deepset/minilm-uncased-squad2)


In [None]:
from haystack.nodes import FARMReader

# Get the reader
reader = FARMReader(DOC_READER, use_gpu=True)

INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1
INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1


Downloading:   0%|          | 0.00/477 [00:00<?, ?B/s]

INFO:haystack.modeling.model.language_model: * LOADING MODEL: 'deepset/minilm-uncased-squad2' (Bert)


Downloading:   0%|          | 0.00/127M [00:00<?, ?B/s]

INFO:haystack.modeling.model.language_model:Auto-detected model language: english
INFO:haystack.modeling.model.language_model:Loaded 'deepset/minilm-uncased-squad2' (Bert model) from model hub.


Downloading:   0%|          | 0.00/107 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1


In [None]:
from haystack import Pipeline

# Build the pipeline
pipe = Pipeline()
pipe.add_node(component=reader, name='Reader', inputs=['Query'])

In [None]:
# Testing the pipeline
from haystack.utils import print_answers

# Querying documents
question = "Who did the Raptors face in the first round of the 2015 Playoffs?"
prediction = pipe.run(query=question, documents=docs[0:10], params={"Reader": {"top_k": 3}})

# Print answer
print_answers(prediction)

Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 12.71 Batches/s]

'Query: Who did the Raptors face in the first round of the 2015 Playoffs?'
'Answers:'
[   <Answer {'answer': 'Washington Wizards', 'type': 'extractive', 'score': 0.8629469871520996, 'context': ' Award, becoming the first Raptor to do so. The Raptors faced the Washington Wizards in the first round of the 2015 Playoffs and lost four straight ga', 'offsets_in_document': [{'start': 232, 'end': 250}], 'offsets_in_context': [{'start': 66, 'end': 84}], 'document_ids': ['168073128841507576969410138308837846391'], 'meta': {'dataset_id': '49377324229723915348616814546446752473', 'title': 'Toronto Raptors | Basketball Wiki | Fandom', 'dataset_name': 'QASports Dataset'}}>,
    <Answer {'answer': 'West Virginia', 'type': 'extractive', 'score': 0.022460104897618294, 'context': '1991, 1993 South Florida 0 8 Syracuse 5 1981, 1988, 1992, 2005, 2006 West Virginia 1 2010 4 Villanova 1 1995 1 Virginia Tech 0 5,6 Notes: 1 Villanova ', 'offsets_in_document': [{'start': 109, 'end': 122}], 'offsets_in_context




---
## Evaluation

About the metrics, you can read the [evaluation](https://docs.haystack.deepset.ai/docs/evaluation) web page.

In [None]:
%%time

# For testing purposes, running on the first 100 labels
# For real evaluation, you must remove the [0:100]
eval_labels = dataset.get_validation()

eval_docs = [[label.document for label in multi_label.labels] for multi_label in eval_labels]

eval_result = pipe.eval(labels=eval_labels, documents=eval_docs, params={"Reader": {"top_k": NUM_K}})

Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 59.23 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 59.75 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 56.85 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 67.00 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 63.60 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 63.04 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 61.44 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 63.61 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 64.09 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 65.08 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 64.43 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 34.34 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 66.45 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00

CPU times: user 16 s, sys: 596 ms, total: 16.6 s
Wall time: 16.8 s


In [None]:
pprint(eval_labels[0])

<MultiLabel: {'labels': [{'id': '207f53e0-70f3-4185-9ec2-de7a56b5fe15', 'query': 'In what year did he become a New York State Basketball Player?', 'document': {'id': '162155919273417084454839941680164668430', 'content': ' Early life Brand was born in Cortlandt Manor, New York. At the age of thirteen, he enrolled in Peekskill High School, where he was immediately added to the varsity basketball roster. He averaged 40 points and 20 rebounds per game, played AAU basketball with future NBA players Lamar Odom and Ron Artest, and by his senior year he was consistently ranked among the top high school basketball players in the country and was selected as New York State Mr.', 'content_type': 'text', 'meta': {}, 'id_hash_keys': ['content'], 'score': None, 'embedding': None}, 'is_correct_answer': True, 'is_correct_document': True, 'origin': 'gold-label', 'answer': {'answer': 'senior', 'type': 'extractive', 'score': None, 'context': None, 'offsets_in_document': None, 'offsets_in_context': None, '

In [None]:
from pprint import pprint

# Get and print the metrics
metrics = eval_result.calculate_metrics()
pprint(metrics)

{'Reader': {'exact_match': 0.59,
            'f1': 0.6043181818181818,
            'num_examples_for_eval': 100.0}}


In [None]:
# Print a detailed report
pipe.print_eval_report(eval_result)

                   Pipeline Overview
                      Query
                        |
                        |
                      Reader
                        |
                        | exact_match:  0.43
                        | exact_match_top_1:  0.43
                        | f1: 0.464
                        | f1_top_1: 0.464
                        | num_examples_for_eval: 1e+02
                        | num_examples_for_eval_top_1: 1e+02
                        |
                      Output

                Wrong Reader Examples
Query: 
 	What is the name of the football team that is currently in the Championship League?
Gold Document Ids: 
 	
Metrics: 
 	exact_match: 0.0
 	f1: 0.0
Answers: 
 	answer: Farsley Celtic 
 	context: ford Park Avenue · Chester · Chorley · Curzon Ashton · Darlington · Farsley Celtic · Gateshead · Gloucester City · Guiseley · Hereford · Kettering Tow 
_______________________________________________________
Query: 
 	Where was Coutinho?
Gol