# Model Metrics

We've put together our first Q&A model, and explored some of the metrics we can use to measure Q&A performance. In this notebook we're going to merge both of these and measure our Q&A model performance on the SQuAD 2.0 validation set as a whole.

First, we load our SQuAD validation data.

In [1]:
import json

with open('../../data/squad/dev.json', 'r') as f:
    squad = json.load(f)

# we will limit it to the first 100 samples in the interest of time
squad = squad[:100]

Next, let's setup the QA pipeline again using the `deepset/bert-base-cased-squad2` model.

In [2]:
from transformers import BertTokenizer, BertForQuestionAnswering, pipeline

modelname = 'deepset/bert-base-cased-squad2'

tokenizer = BertTokenizer.from_pretrained(modelname)
model = BertForQuestionAnswering.from_pretrained(modelname)

qa = pipeline('question-answering', model=model, tokenizer=tokenizer)

And now we build a list of predicted answers `model_out` and true answers `reference` and calculate the ROUGE score based on these.

In [14]:
from tqdm import tqdm

model_out = []
reference = []

for pair in tqdm(squad, leave=True):
    ans = qa({
        'question': pair['question'],
        'context': pair['context']
    })
    # append the prediction and reference to the respective lists
    model_out.append(ans['answer'])
    reference.append(pair['answer'])

100%|████████████████████████████████████████████████████████████████████████████████| 100/100 [04:32<00:00,  2.73s/it]


This make take some time to process. The processing speed of our models will improve as we begin using more efficient implementations over the next few sections.

Once that has finished processing, we can calculate our ROUGE scores just like we did before.

In [15]:
from rouge import Rouge

# initialize
rouge = Rouge()

# get scores
rouge.get_scores(model_out, reference, avg=True)

{'rouge-1': {'f': 0.43488860887574143,
  'p': 0.45555952380952375,
  'r': 0.4615277777777777},
 'rouge-2': {'f': 0.2218395480658946,
  'p': 0.24515151515151515,
  'r': 0.2314404761904762},
 'rouge-l': {'f': 0.4363357728149752,
  'p': 0.4565714285714286,
  'r': 0.4621428571428571}}

That doesn't seem to be scoring as high as we would expect, if we print some of the results we can see why:

In [16]:
# recalculate individual scores
scores = rouge.get_scores(model_out, reference)

print(model_out[4], ' | ', reference[4], ' | ', scores[4]['rouge-1']['f'])
print(model_out[22], ' | ', reference[22], ' | ', scores[22]['rouge-1']['f'])

Rollo,  |  Rollo  |  0.0
"Norseman, Viking".  |  Norseman, Viking  |  0.0


Clearly the punctuation differences are causing our ROUGE score to view these words as not matching. To fix this, we'll import `re` and remove any characters that are not spaces, letters, or numbers.

In [17]:
import re

clean = re.compile('(?i)[^0-9a-z ]')

# apply this to both lists
model_out = [clean.sub('', text) for text in model_out]
reference = [clean.sub('', text) for text in reference]

In [18]:
# recalculate individual scores
scores = rouge.get_scores(model_out, reference)

print(model_out[4], ' | ', reference[4], ' | ', scores[4]['rouge-1']['f'])
print(model_out[22], ' | ', reference[22], ' | ', scores[22]['rouge-1']['f'])

Rollo  |  Rollo  |  0.999999995
Norseman Viking  |  Norseman Viking  |  0.999999995


These scores are looking better now, let's calculate the average again:

In [19]:
rouge.get_scores(model_out, reference, avg=True)

{'rouge-1': {'f': 0.5754124176500057,
  'p': 0.6127186147186147,
  'r': 0.6256944444444443},
 'rouge-2': {'f': 0.25933954789401953,
  'p': 0.2815800865800866,
  'r': 0.2764404761904762},
 'rouge-l': {'f': 0.5770818038114618,
  'p': 0.6139090909090908,
  'r': 0.6263095238095236}}

Now we are seeing much more realistic scores