<a href="https://colab.research.google.com/github/leomaurodesenv/big-qa-architecture/blob/main/jupyter/3_Document_Reader_Experiments.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Document Reader Experiments

In Question Answering (QA), queries are run over several documents to extract an answer to user questions, consisting of two main steps: (1) Document Retriever — retrieve the most useful documents that may contain the answer to a given question; (2) Document Reader — a machine reader carefully examines the retrieved documents and frame an answer.

In this Jupyter Notebook, we focused on the Document Reader experiments, motivated by the fact that using a good Reader (higher F1) produces a better and more concise response.

---
## Setup

Packages installation and setups.

### Run Configuration

Choose the dataset and the Document Reader algorithm.

In [None]:
import enum

class Dataset(enum.Enum):
    '''Dataset options'''
    SQuAD = 1
    AdvQA = 2
    DuoRC = 3
    QASports = 4

class DocReader:
    '''Document Reader options'''
    BERT    = "deepset/bert-base-uncased-squad2"
    RoBERTa = "deepset/roberta-base-squad2"
    MiniLM  = "deepset/minilm-uncased-squad2"
    DistilBERT = "distilbert-base-uncased-distilled-squad"
    FineDistilBERT = "laurafcamargos/distilbert-qasports-basket-small"
    ELECTRA = "deepset/electra-base-squad2"

class Sports:
    BASKETBALL = "basketball"
    FOOTBALL = "football"
    SOCCER = "soccer"
    ALL = None

In [None]:
# run configuration
NUM_K      = 1 # always = 1
DATASET    = Dataset.QASports
DOC_READER = DocReader.RoBERTa
SPORT      = Sports.SOCCER

### Package Installation

Install Haystack and HuggingFace packages.

In [None]:
# Check if you have a GPU running
# The code runs in CPU as well
!nvidia-smi

Thu Feb 27 11:26:06 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   39C    P8              9W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [None]:
# %%capture
# Install the Haystack
!pip install farm-haystack==1.26.2 --quiet

# Install Huggingface
!pip install transformers==4.39.3 --quiet
!pip install sentence-transformers==2.2.2 --quiet
!pip install huggingface_hub==0.25.0
!echo "Huggingface installation with success!"
# Extra
!pip install mmh3
!pip install datasets
!pip install rapidfuzz
!echo "Extra installation with success!"

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m153.9/153.9 kB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m763.7/763.7 kB[0m [31m37.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.8/8.8 MB[0m [31m115.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m98.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.7/48.7 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m65.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.9/75.9 kB[0m [31m7.3 MB/s[0m eta [3

### Logging

We configure how logging messages should be displayed and which log level should be used before importing Haystack.

In [None]:
import logging

# Setup Haystack logging format
logging.basicConfig(format="%(levelname)s - %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.INFO)

---
## Dataset

Importing and download the respective dataset.

### Abstract Dataset

In [None]:
import pandas as pd
from abc import ABCMeta, abstractmethod

class AbstactDataset(metaclass = ABCMeta):
    '''Abstract dataset class'''

    def __init__(self):
        self.raw_dataset = self.download()
        self.df_dataset = self._transform_df()
        print(f"## {self.name} ##")
        print(self.raw_dataset)

    def _transform_df(self):
        '''Transform dataset in a pd.DataFrame'''
        return pd.DataFrame(self.raw_dataset)

    @property
    @abstractmethod
    def name(self):
        '''Dataset name'''
        pass

    @abstractmethod
    def download(self):
        '''Download the dataset'''
        pass

    @abstractmethod
    def get_documents(self):
        '''Get the unique documents to store in the Document Store'''
        pass


    @abstractmethod
    def get_validation(self):
        '''Get the validation set'''
        pass

### SQuaD Dataset

https://huggingface.co/datasets/squad

In [None]:
import mmh3
from datasets import load_dataset
from haystack.schema import Label, Document, Answer
from haystack.schema import EvaluationResult, MultiLabel

class SQuadDataset(AbstactDataset):
    '''SQuaD Dataset'''
    name = "SQuaD Dataset"
    _columns = {
        "title": "title",
        "document": "context",
        "question": "question",
    }
    _metadata = {
        "dataset_id": "id"
    }

    def download(self):
        dataset = load_dataset("squad", split="validation")
        return dataset

    def get_documents(self):
        # Remove duplicated contents
        cc = self._columns
        dataset_name = f"{self.name}"
        df = self.df_dataset
        df = df.drop_duplicates(subset=[cc["title"], cc["document"]], keep="first")

        # Create Haystack Document objects
        list_docs = []
        for _, row in df.iterrows():
            document_id = mmh3.hash128(row[cc["document"]], signed=False)
            doc_metadata = {k: row[v] for k,v in self._metadata.items()}
            doc_metadata["title"] = row[cc["title"]]
            doc_metadata["dataset_name"] = dataset_name
            doc = Document(
                id=document_id,
                content_type="text",
                content=row[cc["document"]],
                meta=doc_metadata
            )
            list_docs.append(doc)
        return list_docs

    def _get_answers(self, data):
        # Get question answer
        return data["answers"]["text"]

    def get_validation(self):
        # Get dataset info
        cc = self._columns
        df = self.df_dataset
        _self = self

        # Create Haystack labels
        eval_labels = []
        for _, row in df.iterrows():
            document_id = mmh3.hash128(row[cc["document"]], signed=False)
            doc_label = MultiLabel(labels=[
                Label(
                    query = row[cc["question"]],
                    answer = Answer(answer = answer, type = "extractive"),
                    document = Document(
                        id=document_id,
                        content_type="text",
                        content=row[cc["document"]],
                    ),
                    is_correct_answer=True,
                    is_correct_document=True,
                    origin="gold-label",
                )
                for answer in _self._get_answers(row)
            ])
            eval_labels.append(doc_label)
        return eval_labels

INFO:haystack.telemetry:Haystack sends anonymous usage data to understand the actual usage and steer dev efforts towards features that are most meaningful to users. You can opt-out at anytime by manually setting the environment variable HAYSTACK_TELEMETRY_ENABLED as described for different operating systems in the [documentation page](https://docs.haystack.deepset.ai/docs/telemetry#how-can-i-opt-out). More information at [Telemetry](https://docs.haystack.deepset.ai/docs/telemetry).


### AdversarialQA Dataset

https://huggingface.co/datasets/adversarial_qa

In [None]:
class AdversarialQADataset(SQuadDataset):
    '''AdversarialQA Dataset'''
    name = "AdversarialQA Dataset"

    def download(self):
        dataset = load_dataset("adversarial_qa", "adversarialQA", split="validation")
        return dataset

### DuoRC Dataset

https://huggingface.co/datasets/duorc

In [None]:
class DuoRCDataset(SQuadDataset):
    '''DuoRC  Dataset'''
    name = "DuoRC Dataset"
    _columns = {
        "title": "title",
        "document": "plot",
        "question": "question",
    }
    _metadata = {
        "dataset_id": "question_id"
    }

    def download(self):
        dataset = load_dataset("duorc", "SelfRC", split="validation")
        return dataset

    def _transform_df(self):
        '''Transform dataset in a pd.DataFrame'''
        df = pd.DataFrame(self.raw_dataset)
        # Get questions with answer
        return df[~df["no_answer"]]

    def _get_answers(self, data):
        # Get question answer
        return data["answers"]

### QASports Dataset

https://huggingface.co/datasets/PedroCJardim/QASports

In [None]:
class QASportsDataset(SQuadDataset):
    '''QASports  Dataset'''
    name = "QASports Dataset"
    _columns = {
        "title": "context_title",
        "document": "context",
        "question": "question",
    }
    _metadata = {
        "dataset_id": "id_qa"
    }

    def __init__(self, sport=None):
        self.sport = sport
        super().__init__()

    def download(self):
        dataset = load_dataset("PedroCJardim/QASports", self.sport, split="validation") if self.sport is not None \
                  else load_dataset("PedroCJardim/QASports", split="validation")
        return dataset

    def _transform_df(self):
        '''Transform dataset in a pd.DataFrame'''
        df = pd.DataFrame(self.raw_dataset)
        # Get questions with answer
        df["answer"] = df["answer"].astype(str).apply(eval)
        mask = df["answer"].apply(lambda x: True if isinstance(x, dict) and x.get("text", "") != "" else False)
        return df[mask]

    def _get_answers(self, data):
        # Get question answer
        return [data["answer"]["text"]]

### Download the dataset

Get the dataset and store the documents in the Document Store.

In [None]:
def dataset_switch(choice):
    '''Get dataset class'''

    if choice == Dataset.SQuAD:
        return SQuadDataset()
    elif choice == Dataset.AdvQA:
        return AdversarialQADataset()
    elif choice == Dataset.DuoRC:
        return DuoRCDataset()
    elif choice == Dataset.QASports:
        return QASportsDataset(SPORT)
    else:
        return "Invalid dataset"

# Get the dataset
dataset = dataset_switch(DATASET)
docs = dataset.get_documents()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/3.55k [00:00<?, ?B/s]

trainSocc.csv:   0%|          | 0.00/971M [00:00<?, ?B/s]

testSocc.csv:   0%|          | 0.00/121M [00:00<?, ?B/s]

validationSocc.csv:   0%|          | 0.00/121M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/491362 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/61421 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/61421 [00:00<?, ? examples/s]

## QASports Dataset ##
Dataset({
    features: ['id_qa', 'context_id', 'context', 'question', 'answer', 'context_title', 'context_categories', 'url'],
    num_rows: 61421
})


---
## Document Reader

In this experiment, we explored three Transformer based models for extractive Question Answering using the [FARM framework](https://github.com/deepset-ai/FARM).
* [BERT paper](https://arxiv.org/abs/1810.04805), [implementation](https://huggingface.co/deepset/bert-base-uncased-squad2)
* [RoBERTa paper](https://arxiv.org/abs/1907.11692), [implementation](https://huggingface.co/deepset/roberta-base-squad2)
* [MiniLM paper](https://arxiv.org/abs/2002.10957), [implementation](https://huggingface.co/deepset/minilm-uncased-squad2)


In [None]:
from haystack.nodes import FARMReader

# Get the reader
reader = FARMReader(DOC_READER, use_gpu=True)

INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1
INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1


config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

INFO:haystack.modeling.model.language_model: * LOADING MODEL: 'deepset/roberta-base-squad2' (Roberta)


model.safetensors:   0%|          | 0.00/496M [00:00<?, ?B/s]

INFO:haystack.modeling.model.language_model:Auto-detected model language: english
INFO:haystack.modeling.model.language_model:Loaded 'deepset/roberta-base-squad2' (Roberta model) from model hub.


tokenizer_config.json:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1


In [None]:
from haystack import Pipeline

# Build the pipeline
pipe = Pipeline()
pipe.add_node(component=reader, name='Reader', inputs=['Query'])

In [None]:
# Testing the pipeline
from haystack.utils import print_answers

# Querying documents
question = "Who did the Raptors face in the first round of the 2015 Playoffs?"
prediction = pipe.run(query=question, documents=docs[0:10], params={"Reader": {"top_k": 3}})

# Print answer
print_answers(prediction)

Inferencing Samples: 100%|██████████| 1/1 [00:04<00:00,  4.02s/ Batches]


'Query: Who did the Raptors face in the first round of the 2015 Playoffs?'
'Answers:'
[   <Answer {'answer': 'Algeciras v Novelda', 'type': 'extractive', 'score': 0.13129448890686035, 'context': 'sión B or 2013-14 Tercera División, winner from First round match Algeciras v Novelda, received a bye. Teams from 2013-14 Segunda División gained entr', 'offsets_in_document': [{'start': 114, 'end': 133}], 'offsets_in_context': [{'start': 66, 'end': 85}], 'document_ids': ['337129102467665701580299539122507990947'], 'meta': {'dataset_id': '37677575712103584199147929147757751150', 'title': '2013–14 Copa del Rey | Football Wiki | Fandom', 'dataset_name': 'QASports Dataset'}}>,
    <Answer {'answer': 'Rugby Town', 'type': 'extractive', 'score': 0.06147240847349167, 'context': '1 1-1 0-1 2-2 3-1 1-2 4-1 5-1 2-4 2-1 3-0 0-3 2-3 1-1 1-2 4-2 0-1 1-1 Rugby Town 1-1 1-1 3-1 0-1 0-1 2-2 1-2 1-3 1-0 1-2 0-1 0-3 2-0 0-2 1-2 1-2 2-3 1', 'offsets_in_document': [{'start': 1920, 'end': 1930}], 'offsets_in_cont

---
## Evaluation

About the metrics, you can read the [evaluation](https://docs.haystack.deepset.ai/docs/evaluation) web page.

In [None]:
%%time

# For testing purposes, running on the first 100 labels
# For real evaluation, you must remove the [0:100]
eval_labels = dataset.get_validation()
eval_docs = [[label.document for label in multi_label.labels] for multi_label in eval_labels]

eval_result = pipe.eval(labels=eval_labels, documents=eval_docs, params={"Reader": {"top_k": NUM_K}})

[1;30;43mA saída de streaming foi truncada nas últimas 5000 linhas.[0m
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 33.67 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 33.20 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 35.25 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 34.26 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 34.26 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 33.23 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 20.23 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 33.52 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 35.09 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 35.27 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 35.04 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  9.28 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00

CPU times: user 47min 44s, sys: 24.5 s, total: 48min 8s
Wall time: 53min 11s


In [None]:
from pprint import pprint

# Get and print the metrics
metrics = eval_result.calculate_metrics()
pprint(metrics)

{'Reader': {'exact_match': 0.9953243016424889,
            'f1': 0.9953243016424889,
            'num_examples_for_eval': 25023.0}}


In [None]:
# Print a detailed report
# pipe.print_eval_report(eval_result)