<a href="https://colab.research.google.com/github/martindevoto/machine-learning-notebooks-personal/blob/main/Intro_Haystack_pt_5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Evaluation of a Pipeline and its Components

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepset-ai/haystack/blob/master/tutorials/Tutorial5_Evaluation.ipynb)

To be able to make a statement about the quality of results a question-answering pipeline or any other pipeline in haystack produces, it is important to evaluate it. Furthermore, evaluation allows determining which components of the pipeline can be improved.
The results of the evaluation can be saved as CSV files, which contain all the information to calculate additional metrics later on or inspect individual predictions.

In [None]:
# Make sure you have a GPU running
!nvidia-smi

/bin/bash: nvidia-smi: command not found


In [None]:
# Install the latest release of Haystack in your own environment
#! pip install farm-haystack

# Install the latest master of Haystack
!pip install --upgrade pip
!pip install git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[colab]

Collecting pip
  Downloading pip-22.0.3-py3-none-any.whl (2.1 MB)
[K     |████████████████████████████████| 2.1 MB 5.8 MB/s 
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 21.1.3
    Uninstalling pip-21.1.3:
      Successfully uninstalled pip-21.1.3
Successfully installed pip-22.0.3
Collecting farm-haystack[colab]
  Cloning https://github.com/deepset-ai/haystack.git to /tmp/pip-install-3hm1hcrw/farm-haystack_68c23c69ac664c0284dd2f97bd1744b0
  Running command git clone --filter=blob:none --quiet https://github.com/deepset-ai/haystack.git /tmp/pip-install-3hm1hcrw/farm-haystack_68c23c69ac664c0284dd2f97bd1744b0
  Resolved https://github.com/deepset-ai/haystack.git to commit 1e3edef80354559d824daf79d726e69032ab41b3
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting elastic-apm
  Downloading elastic_apm

In [None]:
from haystack.modeling.utils import initialize_device_settings

devices, n_gpu = initialize_device_settings(use_cuda=True)

INFO - haystack.modeling.utils -  Using devices: CPU
INFO - haystack.modeling.utils -  Number of GPUs: 0


## Start an Elasticsearch server
You can start Elasticsearch on your local machine instance using Docker. If Docker is not readily available in your environment (eg., in Colab notebooks), then you can manually download and execute Elasticsearch from source.

In [None]:
# If Docker is available: Start Elasticsearch as docker container
# from haystack.utils import launch_es
# launch_es()

# Alternative in Colab / No Docker environments: Start Elasticsearch from source
! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q
! tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
! chown -R daemon:daemon elasticsearch-7.9.2

In [None]:
import os 
from subprocess import Popen, PIPE, STDOUT

es_server = Popen(
    ['elasticsearch-7.9.2/bin/elasticsearch'], stdout=PIPE, stderr=STDOUT, 
    preexec_fn=lambda: os.setuid(1) # as daemon
)

# Wait until ES has started
! sleep 30

## Fetch, Store And Preprocess the Evaluation Dataset



In [None]:
from haystack.utils import fetch_archive_from_http

# Download evaluation data, which is a subset of Natural Questions development 
# set containing 50 documents with one question per document and
# multiple annotated answers

doc_dir = '../data/nq'
s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/nq_dev_subset_v2.json.zip"
fetch_archive_from_http(url=s3_url, output_dir=doc_dir)

INFO - haystack.utils.import_utils -  Fetching from https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/nq_dev_subset_v2.json.zip to `../data/nq`


True

In [None]:
# make sure these indices do not collide with existing ones, the indices will be wiped clean before data is inserted
doc_index = "tutorial5_docs"
label_index = "tutorial5_labels"

In [None]:
# Connect to Elasticsearch
from haystack.document_stores import ElasticsearchDocumentStore

# Connect to Elasticsearch
document_store = ElasticsearchDocumentStore(
    host='localhost',
    username='',
    password='',
    index=doc_index,
    label_index=label_index,
    embedding_field='emb',
    embedding_dim=768,
    excluded_meta_data=['emb'],
)

In [None]:
from haystack.nodes import PreProcessor

# Add evaluation data to Elasticsearch Document Store
# We first delete the custom tutorial indices to not have duplicate elemnents
# and also split our documents into shorter passages using the PreProcessor
preprocessor = PreProcessor(
    split_length=200,
    split_overlap=0,
    split_respect_sentence_boundary=False,
    clean_empty_lines=False,
    clean_whitespace=False,
)

document_store.delete_documents(index=doc_index)
document_store.delete_documents(index=label_index)

# The add_eval_data() method converts the given dataset in json format into 
# Haystack document and label objects. Those objects are then indexed in their
# respective document and label index in the document store. The method
# can be used with any dataset in SQuAD format.

document_store.add_eval_data(
    filename='../data/nq/nq_dev_subset_v2.json', doc_index=doc_index,
    label_index=label_index, preprocessor=preprocessor
)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


## Initialize the Two Components of an ExtractiveQAPipeline: Retriever and Reader

In [None]:
# Initialize Retriever
from haystack.nodes import ElasticsearchRetriever

retriever = ElasticsearchRetriever(document_store=document_store)
# Alternative: Evaluate dense retrievers (DensePassageRetriever or EmbeddingRetriever)
# DensePassageRetriever uses two separate transformer based encoders for query
# and document.
# In contrast, EmbeddingRetriever uses a single encoder for both.
# Please make sure the "embedding_dim" parameter in the DocumentStore above
# matches the output dimension of your models!
# Please also take care that the PreProcessor splits your files into chunks
# that can be completely converted with the max_seq_len limitations of
# Transformers

# The SentenceTransformer model "all-mpnet-base-v2" generally works well with 
# the EmbeddingRetriever on any kind of English text.
# For more information check out the documentation at: 
# https://www.sbert.net/docs/pretrained_models.html
# from haystack.retriever import DensePassageRetriever, EmbeddingRetriever
# retriever = DensePassageRetriever(document_store=document_store,
#                                   query_embedding_model='facebook/dpr-question_encoder-single-nq-base',
#                                   passaged_embedding_model='facebook/dpr-ctx_encoder_single-nq-base',
#                                   use_gpu=True, max_seq_len=256,
#                                   embed_title=True)
# retriever = EmbeddingRetriever(document_store=document_store,
#                                model_format='sentence_transformers',
#                                embedding_moderl='all-mpnet-base-v2')
# document_store.update_embeddings(retriever, index=doc_index)

In [None]:
# Initialize Reader
from haystack.nodes import FARMReader

reader = FARMReader("deepset/roberta-base-squad2", top_k=4, return_no_answer=True)

# Define a pipeline consisting of the initialized retriever and reader
from haystack.pipelines import ExtractiveQAPipeline

pipeline = ExtractiveQAPipeline(reader=reader, retriever=retriever)

# The evaluation also works with any other pipeline.
# For example you could use a DocumentSearchPipeline as an alternative:
# from haystack.pipelines import DocumentSearchPipeline
# pipeline = DocumentSearchPipeline(retriever=retriever)

INFO - haystack.modeling.utils -  Using devices: CPU
INFO - haystack.modeling.utils -  Number of GPUs: 0
INFO - haystack.modeling.model.language_model -  LOADING MODEL
INFO - haystack.modeling.model.language_model -  Could not find deepset/roberta-base-squad2 locally.
INFO - haystack.modeling.model.language_model -  Looking on Transformers Model Hub (in local cache and online)...


Downloading:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/473M [00:00<?, ?B/s]

INFO - haystack.modeling.model.language_model -  Loaded deepset/roberta-base-squad2


Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/772 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

INFO - haystack.modeling.logger -  ML Logging is turned off. No parameters, metrics or artifacts will be logged to MLFlow.
INFO - haystack.modeling.utils -  Using devices: CPU
INFO - haystack.modeling.utils -  Number of GPUs: 0
INFO - haystack.modeling.infer -  Got ya 2 parallel workers to do inference ...
INFO - haystack.modeling.infer -   0    0 
INFO - haystack.modeling.infer -  /w\  /w\
INFO - haystack.modeling.infer -  /'\  / \


## Evaluation of an ExtractiveQAPipeline
Here we evaluate retriever and reader in open domain fashion on the full corpus of documents i.e. a document is considered
correctly retrieved if it contains the gold answer string within it. The reader is evaluated based purely on the
predicted answer string, regardless of which document this came from and the position of the extracted span.

The generation of predictions is seperated from the calculation of metrics. This allows you to run the computation-heavy model predictions only once and then iterate flexibly on the metrics or reports you want to generate.


In [None]:
from haystack.schema import EvaluationResult, MultiLabel

# We can load evaluation labels from the document store
eval_labels = document_store.get_all_labels_aggregated(drop_negative_labels=True,
                                                       drop_no_answers=False)

# Alternative: Define queries and labels directly
# from haystacl.schema import Answer, Document, Label, Span
# eval_labels = [
#                MultiLabel(labels=[Label(query='who is written in the book of life',
#                                         answer=Answer(answer='every person who is destined for Heaven or the World to Come',
#                                                       offsets_in_context=[Span(374, 434)]),
#                                         document=Document(id='1b090aec7dbd1af6739c4c80f8995877-0',
#                                                           content_type='text',
#                                                           content='Book of Life - wikipedia Book of Life Jump to: navigation, search This article is about the book mentioned in Christian and Jewish religious teachings. For other uses, see The Book of Life. In Christianity and Judaism, the Book of Life (Hebrew: ספר החיים, transliterated Sefer HaChaim; Greek: βιβλίον τῆς ζωῆς Biblíon tēs Zōēs) is the book in which God records the names of every person who is destined for Heaven or the World to Come. According to the Talmud it is open on Rosh Hashanah, as is its analog for the wicked, the Book of the Dead. For this reason extra mention is made for the Book of Life during Amidah recitations during the Days of Awe, the ten days between Rosh Hashanah, the Jewish new year, and Yom Kippur, the day of atonement (the two High Holidays, particularly in the prayer Unetaneh Tokef). Contents (hide) 1 In the Hebrew Bible 2 Book of Jubilees 3 References in the New Testament 4 The eschatological or annual roll-call 5 Fundraising 6 See also 7 Notes 8 References In the Hebrew Bible(edit) In the Hebrew Bible the Book of Life - the book or muster-roll of God - records forever all people considered righteous before God'),
#                                         is_correct_answer=True,
#                                         is_correct_document=True,
#                                         origin='gold-label'
#                                                           )])
# ]

# Similar to pipeline.run() we can execute pipeline.eval()
eval_result = pipeline.eval(labels=eval_labels,
                            params={'Retriever': {'top_k': 5}})

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.45s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.91s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.86s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.87s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.87s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.94s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.88s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.88s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.90s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.06s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.41s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:01<00:00,  1.87s/ Batches]
Inferencing Samples: 100%|██████████| 1/1 [00:02<00:00,  2.76s/ Batches

In [None]:
# The EvaluationResult contains a pandas dataframe for each pipeline node.
# That's why there are two dataframes in the EvaluationResult of an
# ExtractiveQAPipeline.

retriever_result = eval_result['Retriever']
retriever_result.head()

Unnamed: 0,multilabel_id,query,filters,gold_document_contents,content,gold_id_match,answer_match,gold_id_or_answer_match,rank,document_id,gold_document_ids,type,node,eval_mode
0,7221391243398055910,who is written in the book of life,b'null',"[Book of Life - wikipedia Book of Life Jump to: navigation, search This arti...","people considered righteous before God. God has such a book, and to be blott...",0.0,0.0,0.0,1.0,1b090aec7dbd1af6739c4c80f8995877-1,"[1b090aec7dbd1af6739c4c80f8995877-0, 1b090aec7dbd1af6739c4c80f8995877-0]",document,Retriever,integrated
1,7221391243398055910,who is written in the book of life,b'null',"[Book of Life - wikipedia Book of Life Jump to: navigation, search This arti...","as adversaries (of God). Also, according to ib. xxxvi. 10, one who contrives...",0.0,0.0,0.0,2.0,1b090aec7dbd1af6739c4c80f8995877-2,"[1b090aec7dbd1af6739c4c80f8995877-0, 1b090aec7dbd1af6739c4c80f8995877-0]",document,Retriever,integrated
2,7221391243398055910,who is written in the book of life,b'null',"[Book of Life - wikipedia Book of Life Jump to: navigation, search This arti...",the citizens' registers. The life which the righteous participate in is to b...,0.0,0.0,0.0,3.0,1b090aec7dbd1af6739c4c80f8995877-6,"[1b090aec7dbd1af6739c4c80f8995877-0, 1b090aec7dbd1af6739c4c80f8995877-0]",document,Retriever,integrated
3,7221391243398055910,who is written in the book of life,b'null',"[Book of Life - wikipedia Book of Life Jump to: navigation, search This arti...","apostles' names are ``written in heaven'' (Luke x. 20), or ``the fellow-work...",0.0,0.0,0.0,4.0,1b090aec7dbd1af6739c4c80f8995877-3,"[1b090aec7dbd1af6739c4c80f8995877-0, 1b090aec7dbd1af6739c4c80f8995877-0]",document,Retriever,integrated
4,7221391243398055910,who is written in the book of life,b'null',"[Book of Life - wikipedia Book of Life Jump to: navigation, search This arti...",The Absolutely True Diary of a Part-Time Indian - wikipedia The Absolutely T...,0.0,0.0,0.0,5.0,e9260cbbc129f4246ee8fcfbbe385822-0,"[1b090aec7dbd1af6739c4c80f8995877-0, 1b090aec7dbd1af6739c4c80f8995877-0]",document,Retriever,integrated


In [None]:
reader_result = eval_result['Reader']
reader_result.head()

Unnamed: 0,multilabel_id,query,filters,gold_answers,answer,context,exact_match,f1,rank,document_id,gold_document_ids,offsets_in_document,gold_offsets_in_documents,type,node,eval_mode
0,7221391243398055910,who is written in the book of life,b'null',"[all people considered righteous before God, every person who is destined fo...",,,0.0,0.0,1.0,,"[1b090aec7dbd1af6739c4c80f8995877-0, 1b090aec7dbd1af6739c4c80f8995877-0]","[{'start': 0, 'end': 0}]","[{'start': 1107, 'end': 1149}, {'start': 374, 'end': 434}]",answer,Reader,integrated
1,7221391243398055910,who is written in the book of life,b'null',"[all people considered righteous before God, every person who is destined fo...",those whose names are written in the Book of Life from the foundation of the...,"ohn of Patmos. As described, only those whose names are written in the Book ...",0.0,0.083333,2.0,1b090aec7dbd1af6739c4c80f8995877-3,"[1b090aec7dbd1af6739c4c80f8995877-0, 1b090aec7dbd1af6739c4c80f8995877-0]","[{'start': 576, 'end': 658}]","[{'start': 1107, 'end': 1149}, {'start': 374, 'end': 434}]",answer,Reader,integrated
2,7221391243398055910,who is written in the book of life,b'null',"[all people considered righteous before God, every person who is destined fo...",only the names of the righteous,. The Psalmist likewise speaks of the Book of Life in which only the names o...,0.0,0.2,3.0,1b090aec7dbd1af6739c4c80f8995877-1,"[1b090aec7dbd1af6739c4c80f8995877-0, 1b090aec7dbd1af6739c4c80f8995877-0]","[{'start': 498, 'end': 529}]","[{'start': 1107, 'end': 1149}, {'start': 374, 'end': 434}]",answer,Reader,integrated
3,7221391243398055910,who is written in the book of life,b'null',"[all people considered righteous before God, every person who is destined fo...",those who are found written in the book and who shall escape the troubles pr...,those who are found written in the book and who shall escape the troubles pr...,0.0,0.111111,4.0,1b090aec7dbd1af6739c4c80f8995877-6,"[1b090aec7dbd1af6739c4c80f8995877-0, 1b090aec7dbd1af6739c4c80f8995877-0]","[{'start': 135, 'end': 305}]","[{'start': 1107, 'end': 1149}, {'start': 374, 'end': 434}]",answer,Reader,integrated
0,-5738387693046012108,who was the girl in the video brenda got a baby,b'null',[Ethel ``Edy'' Proctor],her cousin,ng a story in the newspaper of a 12-year-old girl getting pregnant by her co...,0.0,0.0,1.0,965a125f65658579529b39f8e4344969-3,[965a125f65658579529b39f8e4344969-3],"[{'start': 423, 'end': 433}]","[{'start': 181, 'end': 202}]",answer,Reader,integrated


In [None]:
# We can filter for all documents retrieved for a given query
retriever_book_of_life = retriever_result[retriever_result['query'] == "who is written in the book of life"]


In [None]:
# We can also filter for all answers predicted for a given query
reader_book_of_life = reader_result[reader_result['query'] == 'who is written in the book of life']

In [None]:
# Save the evaluation results so that we can reload it later and calculate
# evaluation metrics without running the pipeline again.
eval_result.save('../')

## Calculating Evaluation Metrics
Load an EvaluationResult to quickly calculate standard evaluation metrics for all predictions, such as F1-score of each individual prediction of the Reader node or recall of the retriever.

In [None]:
saved_eval_result = EvaluationResult.load('../')
metrics = saved_eval_result.calculate_metrics()
print(f'Retriever - Recall (single relevant documents): {metrics["Retriever"]["recall_single_hit"]}')
print(f'Retriever - Recall (multiple relevant documents): {metrics["Retriever"]["recall_multi_hit"]}')
print(f'Retriever - Mean Reciprocal Rank: {metrics["Retriever"]["mrr"]}')
print(f'Retriever - Precision: {metrics["Retriever"]["precision"]}')
print(f'Retriever - Mean Average Precision: {metrics["Retriever"]["map"]}')

print(f'Reader - F1-Score: {metrics["Reader"]["f1"]}')
print(f'Reader - Exact match: {metrics["Reader"]["exact_match"]}')

Retriever - Recall (single relevant documents): 0.38
Retriever - Recall (multiple relevant documents): 0.38
Retriever - Mean Reciprocal Rank: 0.23433333333333334
Retriever - Precision: 0.07600000000000003
Retriever - Mean Average Precision: 0.23433333333333334
Reader - F1-Score: 0.7578128538128537
Reader - Exact match: 0.72


## Generating an Evaluation Report
A summary of the evaluation results can be printed to get a quick overview. It includes some aggregated metrics and also shows a few wrongly predicted examples.

In [None]:
pipeline.print_eval_report(saved_eval_result)

                   Pipeline Overview
                      Query
                        |
                        |
                      Retriever
                        |
                        | recall_single_hit:  0.42
                        | recall_single_hit_top_1:  0.42
                        |
                      Reader
                        |
                        | exact_match:  0.72
                        | exact_match_top_1:  0.48
                        | f1: 0.758
                        | f1_top_1:   0.5
                        |
                      Output

                Wrong Reader Examples
Query: 
 	who was the girl in the video brenda got a baby
Gold Answers: 
 	Ethel ``Edy'' Proctor
Gold Document Ids: 
 	965a125f65658579529b39f8e4344969-3
Metrics: 
 	f1: 0.0
 	exact_match: 0.0
Answers: 
 	multilabel_id: -5738387693046012108
 	filters: b'null'
 	answer: her cousin
 	context: ng a story in the newspaper of a 12-year-old girl getting pregnant by her co