# Evaluation of a model using 🤗nlp

*This notebook shows how `nlp` can be leveraged to evaluate **Longformer** on **TriviaQA** .*

- The [`nlp`](https://github.com/huggingface/nlp) library allows simple and intuitive access to nlp datasets and metrics.

- **Longformer** is transformer-based model for long-range sequence modeling introduced by *Iz Beltagy, Matthew E. Peters, Arman Cohan* (see paper [here](https://arxiv.org/abs/2004.05150)) and can now be accessed via Transformers via the [docs](https://huggingface.co/transformers/model_doc/longformer.html).

- **TriviaQA** is a reading comprehension dataset containing question-answer-evidence triplets (see paper here [here](https://homes.cs.washington.edu/~eunsol/papers/acl17jcwz.pdf)).

We will evaluate a pretrained `LongformerForQuestionAnswering` model on the *validation* dataset of **TriviaQA**. Along the way, this notebook will show you how `nlp` can be used for effortless preprocessing of data and analysis of the results.

Alright! Let's start by installing the `nlp` library and loading *TriviaQA*. 

### Installs and Imports

In [None]:
# install nlp
!pip install -qq nlp==0.2.1

# Make sure that we have a recent version of pyarrow in the session before we continue - otherwise reboot Colab to activate it
import pyarrow
if int(pyarrow.__version__.split('.')[1]) < 16:
    import os
    os.kill(os.getpid(), 9)

!pip install -qq transformers==2.11.0

import nlp
import torch

  Building wheel for nlp (setup.py) ... [?25l[?25hdone


In [None]:
# ATTENTION. Rerunning this command remove the cached trivia qa dataset completely 
!rm -rf /root/.cache/

### Data cleaning and preprocessing 

The total *TriviaQA* dataset has a size of 17 GB once processed.
Downloading and preprocessing the dataset will take around *15 minutes*. ☕
Afterwards the data is serialized in *Arrow* format for quick reloading.



In [None]:
validation_dataset = nlp.load_dataset("trivia_qa", "rc", split="validation[:5%]")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=12294.0, style=ProgressStyle(descriptio…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=13607.0, style=ProgressStyle(descriptio…


Downloading and preparing dataset trivia_qa/rc (download: 2.48 GiB, generated: 14.92 GiB, total: 17.40 GiB) to /root/.cache/huggingface/datasets/trivia_qa/rc/1.1.0...


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2665779500.0, style=ProgressStyle(descr…




HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Dataset trivia_qa downloaded and prepared to /root/.cache/huggingface/datasets/trivia_qa/rc/1.1.0. Subsequent calls will reuse this data.


First, let's get an overview of the dataset 🧐

In [None]:
validation_dataset

Dataset(schema: {'question': 'string', 'question_id': 'string', 'question_source': 'string', 'entity_pages': 'struct<doc_source: list<item: string>, filename: list<item: string>, title: list<item: string>, wiki_context: list<item: string>>', 'search_results': 'struct<description: list<item: string>, filename: list<item: string>, rank: list<item: int32>, title: list<item: string>, url: list<item: string>, search_context: list<item: string>>', 'answer': 'struct<aliases: list<item: string>, normalized_aliases: list<item: string>, matched_wiki_entity_name: string, normalized_matched_wiki_entity_name: string, normalized_value: string, type: string, value: string>'}, num_rows: 933)

5% of the validation data corresponds to 933 examples, which we can use as a good snapshot of the actual dataset and get be used to get familiar with the dataset.

Let's check out the datatset's structure.

In [None]:
# check out schema
validation_dataset.schema

question: string not null
question_id: string not null
question_source: string not null
entity_pages: struct<doc_source: list<item: string>, filename: list<item: string>, title: list<item: string>, wiki_context: list<item: string>> not null
  child 0, doc_source: list<item: string>
      child 0, item: string
  child 1, filename: list<item: string>
      child 0, item: string
  child 2, title: list<item: string>
      child 0, item: string
  child 3, wiki_context: list<item: string>
      child 0, item: string
search_results: struct<description: list<item: string>, filename: list<item: string>, rank: list<item: int32>, title: list<item: string>, url: list<item: string>, search_context: list<item: string>> not null
  child 0, description: list<item: string>
      child 0, item: string
  child 1, filename: list<item: string>
      child 0, item: string
  child 2, rank: list<item: int32>
      child 0, item: int32
  child 3, title: list<item: string>
      child 0, item: string
  child 4,

Alright, quite a lot of entries here! For Questions Answering, all we need is the *question*, the *context* and the *answer*. 

The **question** is a single entry, so we keep it.

Because *Longformer* was trained on the Wikipedia part of *TriviaQA*, we will use `validation_dataset["entity_pages"]["search_context"]` as our **context**. 

We can also see that there are multiple entries for the **answer**. In this use case, we define a correct output of the model as one that is one of the answer aliases `validation_dataset["answer"]["aliases"]`. Lastly, we also keep `validation_dataset["answer"]["normalized_value"]`. All other columns can be disregarded. 

We apply the `.map()` function to map the dataset into the format as defined above.

In [None]:
# define the mapping function
def format_dataset(example):
    # the context might be comprised of multiple contexts => me merge them here
    example["context"] = " ".join(("\n".join(example["entity_pages"]["wiki_context"])).split("\n"))
    example["targets"] = example["answer"]["aliases"]
    example["norm_target"] = example["answer"]["normalized_value"]
    return example

# map the dataset and throw out all unnecessary columns
validation_dataset = validation_dataset.map(format_dataset, remove_columns=["search_results", "question_source", "entity_pages", "answer", "question_id"])

933it [00:01, 777.09it/s]


Now, we can check out a first example of the dataset.

In [None]:
validation_dataset[8]

{'context': '',
 'norm_target': 'basket ball',
 'question': 'The Naismith Award is presented in which sport?',
 'targets': ['Basketball',
  'Basketball gear',
  'Bball',
  "Boy's Basketball",
  'B Ball',
  'Shoot hoops',
  'Basketball parity worldwide',
  "Men's Basketball",
  'High school basketball',
  'Basketball Worldwide',
  'Basketball club',
  'B-ball',
  'Basket-ball',
  'Basketball team',
  '🏀',
  'Basketball rim',
  'Basketballer',
  'Rim (basketball)',
  'Basket ball',
  'Basketball net',
  'Baksetball',
  'Basketball player',
  'Basket-Ball',
  "Women's hoops",
  "Men's basketball",
  'BasketBall',
  'Basketball Parity Worldwide',
  'Basket Ball',
  'Baketball',
  'Basketball Player',
  'B ball',
  'Unicycle basketball']}

Great 🙂. That's exactly, the structure we wanted! Some examples might have an empty context so we will filter those examples out.
For this we can use the convenient `.filter()` function of `nlp`.

In [None]:
validation_dataset = validation_dataset.filter(lambda x: len(x["context"]) > 0)
# check out how many samples are left
validation_dataset

100%|██████████| 1/1 [00:00<00:00,  8.11it/s]


Dataset(schema: {'question': 'string', 'context': 'string', 'targets': 'list<item: string>', 'norm_target': 'string'}, num_rows: 531)

Looks like more or less half of our examples have no context and are now filtered out. Let's think about the evaluation on *Longformer* now. 

*Longformer* is able to process inputs of up to a length of **4096** tokens. As a rule of thumb, 4 is the average number of characters per word piece. Therefore, it is a good idea to check for how many examples, the *question* + *context* exceeds 4 * 4096 characters.
Again we can apply the convenient `.map()` function.

In [None]:
print("\n\nLength for each example")
print(30 * "=")

# length for each example
validation_dataset.map(lambda x, i: print(f"Id: {i} - Question Length: {len(x['question'])} - context Length: {len(x['context'])}"), with_indices=True)
print(30 * "=")

print("\n")
print("Num examples larger than 4 * 4096 characters: ")
# filter out examples smaller than 4 * 4096
short_validation_dataset = validation_dataset.filter(lambda x: (len(x['question']) + len(x['context'])) < 4 * 4096)
short_validation_dataset


531it [00:00, 3139.12it/s]
  0%|          | 0/1 [00:00<?, ?it/s]



Length for each example
Id: 0 - Question Length: 48 - context Length: 87872
Id: 0 - Question Length: 48 - context Length: 87872
Id: 0 - Question Length: 48 - context Length: 87872
Id: 1 - Question Length: 85 - context Length: 39997
Id: 2 - Question Length: 146 - context Length: 22353
Id: 3 - Question Length: 114 - context Length: 73891
Id: 4 - Question Length: 58 - context Length: 345
Id: 5 - Question Length: 80 - context Length: 36373
Id: 6 - Question Length: 68 - context Length: 1410
Id: 7 - Question Length: 68 - context Length: 115858
Id: 8 - Question Length: 81 - context Length: 1404
Id: 9 - Question Length: 83 - context Length: 65529
Id: 10 - Question Length: 75 - context Length: 32034
Id: 11 - Question Length: 59 - context Length: 24511
Id: 12 - Question Length: 111 - context Length: 46362
Id: 13 - Question Length: 79 - context Length: 26408
Id: 14 - Question Length: 190 - context Length: 52829
Id: 15 - Question Length: 130 - context Length: 28293
Id: 16 - Question Length: 70 -

100%|██████████| 1/1 [00:00<00:00, 19.04it/s]


Dataset(schema: {'question': 'string', 'context': 'string', 'targets': 'list<item: string>', 'norm_target': 'string'}, num_rows: 127)

Interesting! We can see that only 127 examples have less than 4 * 4096 = 16384 characters...

Most examples seem to have a very long context which will have to be cut to Longformer's maximum length.

### Evaluation

It's time to evaluate *Longformer* on *TriviaQA* 🚀.

Let's write our evaluation function and import the pretrained `LongformerForQuestionAnswering` model. For more details on `LongformerForQuestionAnswering`, see [here](https://huggingface.co/transformers/model_doc/longformer.html?highlight=longformerforquestionanswering#transformers.LongformerForQuestionAnswering).

In [None]:
from transformers import LongformerTokenizerFast, LongformerForQuestionAnswering

tokenizer = LongformerTokenizerFast.from_pretrained("allenai/longformer-large-4096-finetuned-triviaqa")

# download the 1.7 GB pretrained model. It might take ~1min
model = LongformerForQuestionAnswering.from_pretrained("allenai/longformer-large-4096-finetuned-triviaqa")
model.to("cuda")

def evaluate(example):
    def get_answer(question, context):
        # encode question and context so that they are seperated by a tokenizer.sep_token and cut at max_length
        encoding = tokenizer.encode_plus(question, context, return_tensors="pt", max_length=4096)
        input_ids = encoding["input_ids"].to("cuda")
        attention_mask = encoding["attention_mask"].to("cuda")

        # the forward method will automatically set global attention on question tokens
        # The scores for the possible start token and end token of the answer are retrived
        # wrap the function in torch.no_grad() to save memory
        with torch.no_grad():
            start_scores, end_scores = model(input_ids=input_ids, attention_mask=attention_mask)

        # Let's take the most likely token using `argmax` and retrieve the answer
        all_tokens = tokenizer.convert_ids_to_tokens(encoding["input_ids"][0].tolist())
        answer_tokens = all_tokens[torch.argmax(start_scores): torch.argmax(end_scores)+1]
        answer = tokenizer.decode(tokenizer.convert_tokens_to_ids(answer_tokens))[1:].replace('"', '')  # remove space prepending space token and remove unnecessary '"'
        
        return answer

    # save the model's output here
    example["output"] = get_answer(example["question"], example["context"])

    # save if it's a match or not
    example["match"] = (example["output"] in example["targets"]) or (example["output"] == example["norm_target"])

    return example


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=898823.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=831.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1738538029.0, style=ProgressStyle(descr…




We are interested in the performance of the model on short and long samples.
First we evaluate the model on `short_validation_dataset`, which comprises only the samples that are shorter than 4 * 4096 characters.

In [None]:
results_short = short_validation_dataset.map(evaluate)

127it [02:49,  1.33s/it]


Now let's check for how many questions we were correct!

In [None]:
print(f"\nNum Correct examples: {sum(results_short['match'])}/{len(results_short)}")
wrong_results = results_short.filter(lambda x: x['match'] is False)
print(f"\nWrong examples: ")
wrong_results.map(lambda x, i: print(f"{i} - Output: {x['output']} - Target: {x['norm_target']}"), with_indices=True)

100%|██████████| 1/1 [00:00<00:00, 134.97it/s]
48it [00:00, 2939.25it/s]


Num Correct examples: 79/127

Wrong examples: 
0 - Output: Doctor Finlay - Target: dr finlay
0 - Output: Doctor Finlay - Target: dr finlay
0 - Output: Doctor Finlay - Target: dr finlay
1 - Output: film industry - Target: film making
2 - Output: kaleidoscopes - Target: kaleidoscope
3 - Output: fruit and vegetables - Target: fruit
4 - Output:  - Target: motel 6
5 - Output: Collapsible baby buggy - Target: baby buggy
6 - Output: 2012–13 - Target: 3
7 - Output: Chymosin - Target: rennet
8 - Output: *Baby It's Cold Outside (Rainbow City Recordings − 2012) *Hullabaloo (Rainbow City Recordings – 2013) *Dylan Thomas : A Child's Christmas, Poems and Tiger Eggs (Marvels of the Universe – 2014)  Compilation albums  *Brand New Boots and Panties!! (2001) – contributed If I Was With a Woman *Listen to Bob Dylan: A Tribute (2005) – contributed I Believe in You, a Bob Dylan song from Slow Train Coming *Hands Across the Water (2006) – contributed An Occasional Song *Songs for the Young at Heart (2007)




Dataset(schema: {'question': 'string', 'context': 'string', 'targets': 'list<item: string>', 'norm_target': 'string', 'output': 'string', 'match': 'bool'}, num_rows: 48)

79/127 - not bad 🔥. Also we can see that many of the wrong outputs are very close to the correct solution or are the correct solution, but just written differently (Number 0,9, ...). One could obviously make a better post processing script that makes sure solutions like 0 and 9 are counted as correct solutions by adding a couple of lines to the `evaluate` function. 

**Note**: *Longformer reached a new SOTA on TriviaQA - see Table 9 in [paper](https://arxiv.org/abs/2004.05150). In order to reproduce the exact results, please refer to the following [instructions](https://github.com/allenai/longformer/blob/master/scripts/cheatsheet.txt).*

Second, we evaluate `LongformerForQuestionAnswering` on the all of the examples.

In [None]:
results = validation_dataset.map(evaluate)

531it [19:46,  2.23s/it]


In [None]:
print(f"Correct examples: {sum(results['match'])}/{len(results)}")

Correct examples: 269/531


Here, we now see a slight degradation. Only a bit more than half the samples are correct. It is still a very good score though.

Now you should have all the tools necessary to preprocess your data and evaluate your model with 🤗nlp in no time!

🤗 🤗 **Finish** 🤗🤗

Thanks goes out to Iz Beltagy for proof reading the notebook!