# Model Metrics

We've put together our first Q&A model, and explored some of the metrics we can use to measure Q&A performance. In this notebook we're going to merge both of these and measure our Q&A model performance on the SQuAD 2.0 validation set as a whole.

First, we load our SQuAD validation data.

In [1]:
import json
from transformers import BertTokenizer, BertForQuestionAnswering
from transformers import pipeline

with open('../../data/squad/dev.json', 'r') as f:
    squad = json.load(f)

modelname = 'deepset/bert-base-cased-squad2'

tokenizer = BertTokenizer.from_pretrained(modelname)
model = BertForQuestionAnswering.from_pretrained(modelname)


qa = pipeline('question-answering', model=model, tokenizer=tokenizer)

# intialize a list for answers
answers = []

for pair in squad[:5]:
    # pass in our question and context to return an answer
    ans = qa({
        'question': pair['question'],
        'context': pair['context']
    })
    # append predicted answer and real to answers list
    answers.append({
        'predicted': ans['answer'],
        'true': pair['answer']
    })

2022-11-13 16:57:55.944208: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-13 16:57:56.043709: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2022-11-13 16:57:56.364982: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2022-11-13 16:57:56.365046: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or 

Next, let's setup the QA pipeline again using the `deepset/bert-base-cased-squad2` model.

And now we build a list of predicted answers `model_out` and true answers `reference` and calculate the ROUGE score based on these.

In [16]:
from rouge import Rouge
rouge=Rouge()
from tqdm import tqdm

model_out = []
reference = []

In [17]:
for pair in tqdm(squad[:100], leave=True):
    ans = qa({
        'question': pair['question'],
        'context': pair['context']
    })
    # append the prediction and reference to the respective lists
    model_out.append(ans['answer'])
    reference.append(pair['answer'])

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:20<00:00,  4.78it/s]


This make take some time to process. The processing speed of our models will improve as we begin using more efficient implementations over the next few sections.

Once that has finished processing, we can calculate our ROUGE scores just like we did before.

In [18]:
rouge.get_scores(model_out, reference, avg=True)

{'rouge-1': {'r': 0.4621428571428571,
  'p': 0.4565714285714286,
  'f': 0.4363357728149752},
 'rouge-2': {'r': 0.2314404761904762,
  'p': 0.24515151515151515,
  'f': 0.2218395480658946},
 'rouge-l': {'r': 0.4621428571428571,
  'p': 0.4565714285714286,
  'f': 0.4363357728149752}}

That doesn't seem to be scoring as high as we would expect, if we print some of the results we can see why:

In [19]:
# recalculate individual scores
scores = rouge.get_scores(model_out, reference)

In [21]:
print(model_out[4], ' | ', reference[4], ' | ', scores[4]['rouge-1']['f'])

Rollo,  |  Rollo  |  0.0


In [22]:
print(model_out[22], ' | ', reference[22], ' | ', scores[22]['rouge-1']['f'])

"Norseman, Viking".  |  Viking  |  0.0


Clearly the punctuation differences are causing our ROUGE score to view these words as not matching. To fix this, we'll import `re` and remove any characters that are not spaces, letters, or numbers.

In [23]:
import re

clean = re.compile('(?i)[^0-9a-z ]')

# apply this to both lists
model_out = [clean.sub('', text) for text in model_out]
reference = [clean.sub('', text) for text in reference]

In [24]:
# recalculate individual scores
scores = rouge.get_scores(model_out, reference)

In [25]:
print(model_out[4], ' | ', reference[4], ' | ', scores[4]['rouge-1']['f'])

Rollo  |  Rollo  |  0.999999995


In [26]:
print(model_out[22], ' | ', reference[22], ' | ', scores[22]['rouge-1']['f'])

Norseman Viking  |  Viking  |  0.6666666622222223


These scores are looking better now, let's calculate the average again:

In [27]:
rouge.get_scores(model_out, reference, avg=True)

{'rouge-1': {'r': 0.6263095238095236,
  'p': 0.6139090909090908,
  'f': 0.5770818038114618},
 'rouge-2': {'r': 0.2764404761904762,
  'p': 0.2815800865800866,
  'f': 0.25933954789401953},
 'rouge-l': {'r': 0.6263095238095236,
  'p': 0.6139090909090908,
  'f': 0.5770818038114618}}

Now we are seeing much more realistic scores

# Recall, Precision, F1 and ROUGE-L (Longest Common Subsequence, LCS)

## Recall
The recall counts the number of overlapping n-grams found in both the model output and reference — then divides this number by the total number of n-grams in the reference. It looks like this:

<div>
<img src="../../assets/images/rouge_recall.png" width="500"/>
</div>

This is great for ensuring our model is **capturing all of the information** contained in the reference — but this isn’t so great at ensuring our model isn’t just pushing out a huge number of words to game the recall score:

<div>
<img src="../../assets/images/rouge_gaming_recall.png" width="500"/>
</div>


## Precision

To avoid this we use the precision metric — which is calculated in almost the exact same way, but rather than dividing by the reference n-gram count, we divide by the model n-gram count.

<div>
<img src="../../assets/images/rouge_precision_calc.png" width="500"/>
</div>

So if we apply this to our previous example, we get a precision score of just 43%:

<div>
<img src="../../assets/images/rouge_precision_fixes_recall.png" width="500"/>
</div>




## F1-Score

Now that we both the recall and precision values, we can use them to calculate our ROUGE F1 score like so:

<div>
<img src="../../assets/images/rouge_f1_calc.png" width="500"/>
</div>


Let's apply that again to our previous example:

<div>
<img src="../../assets/images/rouge_f1.png" width="500"/>
</div>


That gives us a reliable measure of our model performance that relies not only on the model capturing as many words as possible (recall) but doing so without outputting irrelevant words (precision).


## ROUGE-L

ROUGE-L measures the **Longest Common Subsequence (LCS)** between our model output and reference. All this means is that we count the longest sequence of tokens that is shared between both:

<div>
<img src="../../assets/images/rouge_l.png" width="500"/>
</div>



The idea here is that a longer shared sequence would indicate more similarity between the two sequences. We can apply our recall and precision calculations just like before — but this time we replace the match with LCS.

First we calculate the LCS recall 

### LCS Recall = number of longest common subsequenc / number of tokens in the reference text

<div>
<img src="../../assets/images/rouge_l_recall.png" width="500"/>
</div>

Precision is the same, we just switch our total n-gram count from the reference to the model:

### LCS Precision = number of longest common subsequenc / number of tokens in the predicted text

<div>
<img src="../../assets/images/rouge_l_precision.png" width="500"/>
</div>

And finally, we calculate the F1 score just like we did before:

### LCS F1-Score = 2 * (LCS Recall * LCS Precision) / (LCS Recall + LCS Precision)

<div>
<img src="../../assets/images/rouge_l_f1.png" width="500"/>
</div>



In [31]:
# ### ROUGE-S

# The final ROUGE metric we will look at is the ROUGE-S — or skip-gram concurrence metric.

# Now, this metric is much less popular than ROUGE-N and ROUGE-L covered already — but it’s worth being aware of what it does.

# Using the skip-gram metric allows us to search for consecutive words from the reference text, that appear in the model output but are separated by one-or-more other words.

# So, if we took the bigram “the fox”, our original ROUGE-2 metric would only match this if this exact sequence was found in the model output. If the model instead outputs “the brown fox” — no match would be found.

# ROUGE-S allows us to add a degree of leniency to our n-gram matching. For our bigram example we could match by using a skip-bigram measure:

# ![ROUGE-S recall](../../assets/images/rouge_s_recall.png)

# The same logic applies to our precision metric too:

# ![ROUGE-S precision](../../assets/images/rouge_s_precision.png)

# After calculating our recall and precision, we can calculate the F1 score too just as we did before.

# ### Cons

# ROUGE is a great evaluation metric but comes with some drawbacks. In-particular, ROUGE does not cater for different words that have the same meaning — as it measures syntactical matches rather than semantics.

# So, if we had two sequences that had the same meaning — but used different words to express that meaning — they could be assigned a low ROUGE score.

# This can be offset slightly by using several references and taking the average score, but this will not solve the problem entirely.

# Nonetheless, it’s a good metric which is very popular for assessing the performance of several NLP tasks, including machine translation, automatic summarization, and *for us*, question-and-answering.

# ## In Python

# We've worked through the theory of the ROUGE metrics and how they work. Fortunately, implementing these metrics in Python is incredibly easy thanks to the Python rouge library.

# We can install the library through pip:

# ```
# pip install rouge
# ```

# And scoring our model output against a reference is as easy as this: