<a href="https://colab.research.google.com/github/patrickfleith/datapipes/blob/main/SQuADv2_Metric.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install evaluate --quiet
# You can safely ignore ERROR related to requirements to fsspec==2024.10.0 etc.

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m20.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.3/179.3 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gcsfs 2024.10.0 requires fsspec==2024.10.0, but you have fsspec 2024.9.0 which is incompatible

In [2]:
from evaluate import load
squadv2_metric = load("exact_match")

Downloading builder script:   0%|          | 0.00/5.67k [00:00<?, ?B/s]

## SQuAD v2 Metric
This metric wraps the official scoring script for version 2 of the Stanford Question Answering Dataset (SQuAD).

SQuAD is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.

SQuAD 2.0 combines the 100,000 questions in SQuAD 1.1 with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. To do well on SQuAD2.0, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering.

<hr>

**How to use?**

The metric takes two files or two lists - one representing model predictions and the other the references to compare them to.

*Predictions* : List of triple for question-answers to score with the following key-value pairs:

- `id`: the question-answer identification field of the question and answer pair
- `prediction_text` : the text of the answer
- `no_answer_probability` : the probability that the question has no answer

*References*: List of question-answers dictionaries with the following key-value pairs:
- `id`: id of the question-answer pair (see above),
- `answers`: a list of Dict {‘text’: text of the answer as a string}
- `no_answer_threshold`: the probability threshold to decide that a question has no answer.

In [3]:
from evaluate import load
squad_metric = load("squad_v2")

predictions = [
    {'prediction_text': '1976', 'id': '56e10a3be3433e1400422b22', 'no_answer_probability': 1.},
     {'prediction_text': 'Beyonce', 'id': '56d2051ce7d4791d0090260b', 'no_answer_probability': 0.},
      {'prediction_text': 'climate change', 'id': '5733b5344776f419006610e1', 'no_answer_probability': 0.}
    ]

references = [
    {'answers': {'answer_start': [97], 'text': ['1976']}, 'id': '56e10a3be3433e1400422b22'},
     {'answers': {'answer_start': [233], 'text': ['Beyoncé and Bruno Mars']}, 'id': '56d2051ce7d4791d0090260b'},
      {'answers': {'answer_start': [891], 'text': ['climate change']}, 'id': '5733b5344776f419006610e1'}
]

results = squad_metric.compute(predictions=predictions, references=references)
results

Downloading builder script:   0%|          | 0.00/6.47k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/11.3k [00:00<?, ?B/s]

{'exact': 66.66666666666667,
 'f1': 66.66666666666667,
 'total': 3,
 'HasAns_exact': 66.66666666666667,
 'HasAns_f1': 66.66666666666667,
 'HasAns_total': 3,
 'best_exact': 66.66666666666667,
 'best_exact_thresh': 1.0,
 'best_f1': 66.66666666666667,
 'best_f1_thresh': 1.0}

This metric outputs a dictionary with 13 values:

'exact': Exact match (the normalized answer exactly match the gold answer) (see the exact_match metric (forthcoming))
'f1': The average F1-score of predicted tokens versus the gold answer (see the F1 score metric)
'total': Number of scores considered
'HasAns_exact': Exact match (the normalized answer exactly match the gold answer)
'HasAns_f1': The F-score of predicted tokens versus the gold answer
'HasAns_total': How many of the questions have answers
'NoAns_exact': Exact match (the normalized answer exactly match the gold answer)
'NoAns_f1': The F-score of predicted tokens versus the gold answer
'NoAns_total': How many of the questions have no answers
'best_exact' : Best exact match (with varying threshold)
'best_exact_thresh': No-answer probability threshold associated to the best exact match
'best_f1': Best F1 score (with varying threshold)
'best_f1_thresh': No-answer probability threshold associated to the best F1

In [9]:
predictions = [{'prediction_text': 'hello','no_answer_probability': 1, 'id': '0'}]

references = [{'answers': {'answer_start': [0], 'text': []}, 'id': '0'}]

results = squad_metric.compute(predictions=predictions, references=references)
results

{'exact': 0.0,
 'f1': 0.0,
 'total': 1,
 'NoAns_exact': 0.0,
 'NoAns_f1': 0.0,
 'NoAns_total': 1,
 'best_exact': 100.0,
 'best_exact_thresh': 0.0,
 'best_f1': 100.0,
 'best_f1_thresh': 0.0}

In [36]:
predictions = [
    {'prediction_text': 'hello','no_answer_probability': 0, 'id': '0'},
    {'prediction_text': 'bonjour','no_answer_probability': 0, 'id': '1'},
    {'prediction_text': '  facebook','no_answer_probability': 0, 'id': '2'}
    ]

references = [
    {'answers': {'answer_start': [0], 'text': ['hello']}, 'id': '0'},
    {'answers': {'answer_start': [0], 'text': ['bonjour', 'cats']}, 'id': '1'},
    {'answers': {'answer_start': [0], 'text': ['Facebook']}, 'id': '2'},
    ]

results = squad_metric.compute(predictions=predictions, references=references)
results

{'exact': 100.0,
 'f1': 100.0,
 'total': 3,
 'HasAns_exact': 100.0,
 'HasAns_f1': 100.0,
 'HasAns_total': 3,
 'best_exact': 100.0,
 'best_exact_thresh': 0.0,
 'best_f1': 100.0,
 'best_f1_thresh': 0.0}

In [41]:
predictions = [
    {'prediction_text': '','no_answer_probability': 1, 'id': '0'},
    {'prediction_text': 'bonjour','no_answer_probability': 0, 'id': '1'},
    {'prediction_text': '  facebook','no_answer_probability': 0, 'id': '2'}
    ]

references = [
    {'answers': {'answer_start': [0], 'text': []}, 'id': '0'},
    {'answers': {'answer_start': [0], 'text': ['bonjour', 'cats']}, 'id': '1'},
    {'answers': {'answer_start': [0], 'text': ['Facebook']}, 'id': '2'},
    ]

results = squad_metric.compute(predictions=predictions, references=references)
results

{'exact': 100.0,
 'f1': 100.0,
 'total': 3,
 'HasAns_exact': 100.0,
 'HasAns_f1': 100.0,
 'HasAns_total': 2,
 'NoAns_exact': 100.0,
 'NoAns_f1': 100.0,
 'NoAns_total': 1,
 'best_exact': 100.0,
 'best_exact_thresh': 0.0,
 'best_f1': 100.0,
 'best_f1_thresh': 0.0}