# TP7: Fine-tuning BERT on Q&A tasks

**Authors:** 
- julien.denize@centralesupelec.fr
- tom.dupuis@centralesupelec.fr


If you have questions or suggestions, contact us and we will gladly answer and take into account your remarks.

For this tp you need to have some ground understanding of pytorch. A basic introduction is available [here](https://pytorch.org/tutorials/beginner/basics/intro.html).



## Objective

In this TP, we will implement solutions for the Question & Answering (Q&A) task by Fine-tuning a pretrained distilbert.

This TP is built on the [HuggingFace Q&A tutorial](https://huggingface.co/docs/transformers/tasks/question_answering), therefore it relies on libraries associated to HuggingFace. 

Question answering tasks return an answer given a question. There are two common forms of question answering:
- Extractive: extract the answer from the given context.
- Abstractive: generate an answer from the context that correctly answers the question.

In this TP, we will show you how to fine-tune [Distilbert](https://huggingface.co/docs/transformers/model_doc/distilbert) on the SQuAD dataset for extractive question answering.

Distilbert is a smaller transformer architecture than BERT that has been trained by [knowledge distillation](https://en.wikipedia.org/wiki/Knowledge_distillation) of BERT to provide a lightweight faster NLP model with high performance.





## Your task

Fill the missing parts in the code (parts between # --- START CODE HERE and # --- END CODE HERE)

In [1]:
import torch
import numpy as np
import random

# Seed everything
seed=42
torch.manual_seed(seed)
random.seed(seed)
np.random.seed(seed)

## Install the required libraries

We need to install the following pip packages:
- [datasets](https://pypi.org/project/datasets/): to load datasets available on the [HuggingFace Datasets Hub](https://huggingface.co/datasets). 
- [transformers](https://pypi.org/project/transformers/):  to load thousands of pretrained models to perform tasks on different modalities such as text, vision, and audio in Pytorch, Tensorflow or JAX.

In [2]:
! pip install datasets transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## Load the dataset

We will use the [SQUAD dataset](https://rajpurkar.github.io/SQuAD-explorer/).

>Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.

We will instantiate the train and validation splits via the [load_dataset](https://huggingface.co/docs/datasets/v1.11.0/splits.html) function.

There are 87.599 elements in the train split and 10.570 elements in the validation split. We will only request 1% of each to avoid long training time. 

In [3]:
from datasets import load_dataset

# --- START CODE HERE (01)
# Load the SQUAD dataset with 1% of each different splits.
dataset = load_dataset("squad", split=['train[:1%]', 'validation[:1%]'])

train_dataset = dataset[0]
validation_dataset = dataset[1]
# --- END CODE HERE
dataset



  0%|          | 0/2 [00:00<?, ?it/s]

[Dataset({
     features: ['id', 'title', 'context', 'question', 'answers'],
     num_rows: 876
 }), Dataset({
     features: ['id', 'title', 'context', 'question', 'answers'],
     num_rows: 106
 })]

Now we can have access to the samples in the datasets.

In each sample we have the following information:
- the id of the wikipedia article.
- the title of the wikipedia article.
- the context that contains the answer to the question.
- the question.
- the answer along with the index of where the answer start.

In [4]:
train_dataset[0]

{'id': '5733be284776f41900661182',
 'title': 'University_of_Notre_Dame',
 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
 'answers': {'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]}}

HuggingFace provides a nice function to better show what the data looks like.

In [5]:
from datasets import ClassLabel, Sequence
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
        elif isinstance(typ, Sequence) and isinstance(typ.feature, ClassLabel):
            df[column] = df[column].transform(lambda x: [typ.feature.names[i] for i in x])
    display(HTML(df.to_html()))

In [6]:
show_random_elements(train_dataset, 3)

Unnamed: 0,id,title,context,question,answers
0,56beb0683aeaaa14008c9213,Beyoncé,"In 2015 Beyoncé signed an open letter which the ONE Campaign had been collecting signatures for; the letter was addressed to Angela Merkel and Nkosazana Dlamini-Zuma, urging them to focus on women as they serve as the head of the G7 in Germany and the AU in South Africa respectively, which will start to set the priorities in development funding before a main UN summit in September 2015 that will establish new development goals for the generation.",An important UN summit took place when?,"{'text': ['September 2015'], 'answer_start': [374]}"
1,5733926d4776f41900660d8f,University_of_Notre_Dame,"The Rev. John J. Cavanaugh, C.S.C. served as president from 1946 to 1952. Cavanaugh's legacy at Notre Dame in the post-war years was devoted to raising academic standards and reshaping the university administration to suit it to an enlarged educational mission and an expanded student body and stressing advanced studies and research at a time when Notre Dame quadrupled in student census, undergraduate enrollment increased by more than half, and graduate student enrollment grew fivefold. Cavanaugh also established the Lobund Institute for Animal Studies and Notre Dame's Medieval Institute. Cavanaugh also presided over the construction of the Nieuwland Science Hall, Fisher Hall, and the Morris Inn, as well as the Hall of Liberal Arts (now O'Shaughnessy Hall), made possible by a donation from I.A. O'Shaughnessy, at the time the largest ever made to an American Catholic university. Cavanaugh also established a system of advisory councils at the university, which continue today and are vital to the university's governance and development",Which institute involving animal life did Cavanaugh create at Notre Dame?,"{'text': ['Lobund Institute for Animal Studies'], 'answer_start': [522]}"
2,5733a7bd4776f41900660f6c,University_of_Notre_Dame,"The university first offered graduate degrees, in the form of a Master of Arts (MA), in the 1854–1855 academic year. The program expanded to include Master of Laws (LL.M.) and Master of Civil Engineering in its early stages of growth, before a formal graduate school education was developed with a thesis not required to receive the degrees. This changed in 1924 with formal requirements developed for graduate degrees, including offering Doctorate (PhD) degrees. Today each of the five colleges offer graduate education. Most of the departments from the College of Arts and Letters offer PhD programs, while a professional Master of Divinity (M.Div.) program also exists. All of the departments in the College of Science offer PhD programs, except for the Department of Pre-Professional Studies. The School of Architecture offers a Master of Architecture, while each of the departments of the College of Engineering offer PhD programs. The College of Business offers multiple professional programs including MBA and Master of Science in Accountancy programs. It also operates facilities in Chicago and Cincinnati for its executive MBA program. Additionally, the Alliance for Catholic Education program offers a Master of Education program where students study at the university during the summer and teach in Catholic elementary schools, middle schools, and high schools across the Southern United States for two school years.",What type of degree is an M.Div.?,"{'text': ['Master of Divinity'], 'answer_start': [624]}"


## Preprocess the training data



Now that we have access to the data, we need to preprocess it to feed it to our neural networks.

In NLP, this step consist of making the tokenization of the data, meaning convert the string words into unique IDs.

Our model requires the following as input:
- A first sequence that is the question.
- A separator token [SEP].
- A second sequence that may contain the answer.

The label is given by the start and end indices of the tokens that compose the answer.

![](https://miro.medium.com/max/1400/1*QhIXsDBEnANLXMA0yONxxA.png)

### Instantiate the tokenizer

We will use the [AutoTokenizer](https://huggingface.co/docs/transformers/model_doc/auto) class provided by HuggingFace as this will ensure we use the tokenizer that was used to train the distilbert model. For that, we need to use the right checkpoint of the model. The list of checkpoints is available [here](https://huggingface.co/models) and you need to retrieve the basic checkpoint for distilbert that do not care about the case (ENGLISH = english). 

In [7]:
# --- START CODE HERE (02)
# Import the auto tokenizer class and instantiate it
from transformers import AutoTokenizer
    
model_checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
# --- END CODE HERE

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


Moving 0 files to the new cache system


0it [00:00, ?it/s]

You can try the tokenizer with custom strings or from our data. Tokenizer accepts tuple as input, but returns only one concatenated tokenized output with a separator token [SEP] with a starting token [CLS].

In [8]:
custom_string = "Hi, I would love to test the tokenizer."
custom_string_2 = "Sure, go ahead and verify that english = ENGLISH = ENglISh."
tokenized_custom_strings = tokenizer(custom_string, custom_string_2)
tokenized_custom_strings

{'input_ids': [101, 7632, 1010, 1045, 2052, 2293, 2000, 3231, 1996, 19204, 17629, 1012, 102, 2469, 1010, 2175, 3805, 1998, 20410, 2008, 2394, 1027, 2394, 1027, 2394, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

You can now decode the sequence and retrieve the initial string with the special tokens. You can also verify that the case is no longer present in the string as our model do not make the difference between lower and upper case. 

In [9]:
tokenizer.decode(tokenized_custom_strings["input_ids"])

'[CLS] hi, i would love to test the tokenizer. [SEP] sure, go ahead and verify that english = english = english. [SEP]'

Here we can apply the tokenizer on one question from our training set.

In [10]:
print(train_dataset[0]["question"])
print(tokenizer(train_dataset[0]["question"]))
print(tokenizer.decode(tokenizer(train_dataset[0]["question"])["input_ids"]))

To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?
{'input_ids': [101, 2000, 3183, 2106, 1996, 6261, 2984, 9382, 3711, 1999, 8517, 1999, 10223, 26371, 2605, 1029, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
[CLS] to whom did the virgin mary allegedly appear in 1858 in lourdes france? [SEP]


### Deal with long contexts

Our model can only take a maximum number of tokens per input. Our input is composed of both the question and the context separated by the special token [SEP]. 

However, in our dataset we might have some samples where the question plus the context length is larger than this maximum number of tokens. We cannot just truncate the input as for some other tasks as the answer to the question might be located in the cut part. 

Instead, a long context will be splitted in several input features, each of length shorter than the maximum length of the model. To avoid that the answer is located on the splitting point, we will make the input features overlap.

In [11]:
max_length = 384 # The maximum length of a feature (question and context)
doc_stride = 128 # The authorized overlap between two part of the context when splitting it is needed.

Below is the code to find the first example with a long input:

In [12]:
for i, example in enumerate(train_dataset):
    if len(tokenizer(example["question"], example["context"])["input_ids"]) > max_length:
        long_context_idx = i
        break
long_example = train_dataset[long_context_idx]
long_example, len(tokenizer(long_example["question"] + long_example["context"])["input_ids"])

({'id': '5733caf74776f4190066124c',
  'title': 'University_of_Notre_Dame',
  'context': "The men's basketball team has over 1,600 wins, one of only 12 schools who have reached that mark, and have appeared in 28 NCAA tournaments. Former player Austin Carr holds the record for most points scored in a single game of the tournament with 61. Although the team has never won the NCAA Tournament, they were named by the Helms Athletic Foundation as national champions twice. The team has orchestrated a number of upsets of number one ranked teams, the most notable of which was ending UCLA's record 88-game winning streak in 1974. The team has beaten an additional eight number-one teams, and those nine wins rank second, to UCLA's 10, all-time in wins against the top team. The team plays in newly renovated Purcell Pavilion (within the Edmund P. Joyce Center), which reopened for the beginning of the 2009–2010 season. The team is coached by Mike Brey, who, as of the 2014–15 season, his fifteenth at No

To split the input in several features, we need to correctly configure our tokenizer and feed it with inputs following these requirements:
- Pass to the tokenizer the tuple of the question and the context. It will automatically add the [SEP] token between the two.
- Force the tokenizer to split the input if it is too large:
  - only the second part (the context) can be truncated so that the question is shared by all new inputs.
  - allow overlapping between tokens.

All these requirements can be done thanks to the [tokenizer utilities](https://huggingface.co/docs/transformers/v4.23.1/en/internal/tokenization_utils) by applying the correct parameters.

In [13]:
# --- START CODE HERE (03)
# Tokenize the long example with the requirements defined above.
tokenized_long_example = tokenizer(
    long_example["question"],
    long_example["context"],
    max_length=max_length,
    truncation="only_second",
    return_overflowing_tokens=True,
    stride=doc_stride
)
# --- END CODE HERE

print(f"The long example now has {len(tokenized_long_example['input_ids'])} inputs with length {[len(x) for x in tokenized_long_example['input_ids']]}.") # Should have 2 inputs with length [384, 157]
for sequence in tokenized_long_example['input_ids']:
  print(tokenizer.decode(sequence))

The long example now has 2 inputs with length [384, 157].
[CLS] how many wins does the notre dame men's basketball team have? [SEP] the men's basketball team has over 1, 600 wins, one of only 12 schools who have reached that mark, and have appeared in 28 ncaa tournaments. former player austin carr holds the record for most points scored in a single game of the tournament with 61. although the team has never won the ncaa tournament, they were named by the helms athletic foundation as national champions twice. the team has orchestrated a number of upsets of number one ranked teams, the most notable of which was ending ucla's record 88 - game winning streak in 1974. the team has beaten an additional eight number - one teams, and those nine wins rank second, to ucla's 10, all - time in wins against the top team. the team plays in newly renovated purcell pavilion ( within the edmund p. joyce center ), which reopened for the beginning of the 2009 – 2010 season. the team is coached by mike br

The problem with the above solution is that we lack the information of where is located the answer: we need to know where is located the answer for each feature provided. 

The model require the start and end positions of the answers in the tokens, so we will also need to map parts of the original context to some tokens.

We need for each index of our feature the corresponding start and end character in the original text that gave our token in the format (`start_char`, `end_char`). The first token (`[CLS]`) has (0, 0) because it is a special added token that was not present in the original sentence.

This can be done using the tokenizer utilities.

In [14]:
# --- START CODE HERE (04)
# Tokenize the long example with the new requirement defined above.
tokenized_long_example = tokenizer(
    example["question"],
    example["context"],
    max_length=max_length,
    truncation="only_second",
    return_overflowing_tokens=True,
    return_offsets_mapping=True,
    stride=doc_stride
)
# --- END CODE HERE

print(f"The long example now has {len(tokenized_long_example['input_ids'])} inputs with length {[len(x) for x in tokenized_long_example['input_ids']]}.") # Should have 2 inputs with length [384, 157]
for sequence, mapping in zip(tokenized_long_example['input_ids'], tokenized_long_example["offset_mapping"]):
  print("\n")
  print(tokenizer.decode(sequence))
  print(sequence)
  print(mapping)

The long example now has 2 inputs with length [384, 157].


[CLS] how many wins does the notre dame men's basketball team have? [SEP] the men's basketball team has over 1, 600 wins, one of only 12 schools who have reached that mark, and have appeared in 28 ncaa tournaments. former player austin carr holds the record for most points scored in a single game of the tournament with 61. although the team has never won the ncaa tournament, they were named by the helms athletic foundation as national champions twice. the team has orchestrated a number of upsets of number one ranked teams, the most notable of which was ending ucla's record 88 - game winning streak in 1974. the team has beaten an additional eight number - one teams, and those nine wins rank second, to ucla's 10, all - time in wins against the top team. the team plays in newly renovated purcell pavilion ( within the edmund p. joyce center ), which reopened for the beginning of the 2009 – 2010 season. the team is coached by mike 

The mapping can be used to find the position of the start and end tokens of our answer in a feature. To avoid the question part we can use the `sequence_ids` field provided by the tokenizer output to have the knowledge of which tokens are part of the first sequence (the question) or the second sequence (the context, or part of the context).

It returns for each token, the sequence ID (0 for question, 1 for context) and None for special tokens.

In [15]:
sequence_ids = tokenized_long_example.sequence_ids()
print(sequence_ids)

[None, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, None, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 

Now, we can retrieve the answer from our features.

In [16]:
answer = long_example["answers"] # Retrieve the answer from the example
start_char = answer["answer_start"][0] # Retrieve the index of the start character of the answer
end_char = start_char + len(answer["text"][0]) # Retrieve the index of the end character of the answer

# Iterate over the features
for i in range(len(tokenized_long_example["input_ids"])):
  print(f"Looking for the answer `{answer['text'][0]}` to the question `{long_example['question']}` in feature {i+1}.")
  print(f"The feature contains the following decoded sequence:\n{tokenizer.decode(tokenized_long_example['input_ids'][i])}")
  
  # Start token index of the current span in the text.
  token_start_index = 0

  # --- START CODE HERE (05)
  # Find where the context sequence starts and store it in the variable token_start_index.
  while sequence_ids[token_start_index] != 1:
      token_start_index += 1
  # --- END CODE HERE

  # --- START CODE HERE (06)
  # Find where the context sequence ends and store it in the variable token_end_index.
  token_end_index = len(tokenized_long_example["input_ids"][i]) - 1
  while sequence_ids[token_end_index] != 1:
      token_end_index -= 1
  # --- END CODE HERE

  offsets = tokenized_long_example["offset_mapping"][i]
  # Detect if the answer is out of the span.
  if (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):

      # --- START CODE HERE (07)
      # Find where are the start_position and end_position of the answer.
      # Move the token_start_index and token_end_index to the two ends of the answer.
      while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
          token_start_index += 1
      start_position = token_start_index - 1
      while offsets[token_end_index][1] >= end_char:
          token_end_index -= 1
      end_position = token_end_index + 1
      # --- END CODE HERE

      print(f"Answer found by the tokenizer at the token positions: {start_position}, {end_position}")
      print(f"{tokenizer.decode(tokenized_long_example['input_ids'][i][start_position: end_position+1])}")
  else:
      print("The answer is not in this feature.")
  print("\n")



Looking for the answer `over 1,600` to the question `How many wins does the Notre Dame men's basketball team have?` in feature 1.
The feature contains the following decoded sequence:
[CLS] how many wins does the notre dame men's basketball team have? [SEP] the men's basketball team has over 1, 600 wins, one of only 12 schools who have reached that mark, and have appeared in 28 ncaa tournaments. former player austin carr holds the record for most points scored in a single game of the tournament with 61. although the team has never won the ncaa tournament, they were named by the helms athletic foundation as national champions twice. the team has orchestrated a number of upsets of number one ranked teams, the most notable of which was ending ucla's record 88 - game winning streak in 1974. the team has beaten an additional eight number - one teams, and those nine wins rank second, to ucla's 10, all - time in wins against the top team. the team plays in newly renovated purcell pavilion ( wi

### Tokenize the whole dataset

Now we can implement a function that will prepare the whole dataset following the above process.

In [17]:
def prepare_train_features(examples, tokenizer, max_length: int = 384, doc_stride: int = 128):
    # Some of the questions have lots of whitespace on the left, which is not useful and will make the
    # truncation of the context fail (the tokenized question will take a lots of space). So we remove that
    # left whitespace
    examples["question"] = [q.lstrip() for q in examples["question"]]

    # For this notebook to work with any kind of models, we need to account for the special case where the model
    # expects padding on the left (in which case we switch the order of the question and the context)
    pad_on_right = tokenizer.padding_side == "right"

    # --- START CODE HERE (08)
    # Apply the tokenizer as before except that be careful to correctly setup the order of question and context 
    # given the value of the boolean pad_on_right.
    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )
    # --- END CODE HERE

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    # The offset mappings will give us a map from token to character position in the original context. This will
    # help us compute the start_positions and end_positions.
    offset_mapping = tokenized_examples.pop("offset_mapping")

    # Let's label those examples!
    tokenized_examples["start_positions"] = []
    tokenized_examples["end_positions"] = []

    # Iterate over the offset mapping from the features.
    for i, offsets in enumerate(offset_mapping):
        # We will label impossible answers with the index of the CLS token.
        input_ids = tokenized_examples["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)

        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        answers = examples["answers"][sample_index]

        # --- START CODE HERE (09)
        # If no answers are given, set the cls_index as answer.
        if len(answers["answer_start"]) == 0:
            tokenized_examples["start_positions"].append(cls_index)
            tokenized_examples["end_positions"].append(cls_index)
        # --- END CODE HERE
        else:
            # Start/end character index of the answer in the text.
            start_char = answers["answer_start"][0]
            end_char = start_char + len(answers["text"][0])

            # --- START CODE HERE (10)
            # Find where the context sequence starts and ends as before. Be careful about the pad_on_right boolean.
            token_start_index = 0
            while sequence_ids[token_start_index] != (1 if pad_on_right else 0):
                token_start_index += 1

            token_end_index = len(input_ids) - 1
            while sequence_ids[token_end_index] != (1 if pad_on_right else 0):
                token_end_index -= 1
            # --- END CODE HERE

            # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).
            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
                # --- START CODE HERE (11)
                # Label impossible answers with the index of the CLS token.
                tokenized_examples["start_positions"].append(cls_index)
                tokenized_examples["end_positions"].append(cls_index)
                # --- END CODE HERE
            else:
                # --- START CODE HERE (12)
                # Find where are the start_position and end_position of the answer as before
                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
                    token_start_index += 1
                tokenized_examples["start_positions"].append(token_start_index - 1)
                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                tokenized_examples["end_positions"].append(token_end_index + 1)
                # --- END CODE HERE
            

    return tokenized_examples

This function works with one or several examples. In the case of several examples, the tokenizer will return a list of lists for each key:

In [18]:
features = prepare_train_features(train_dataset[:5], tokenizer)
len(features["input_ids"]), len(features["input_ids"][0]) # should return (5, 384)

(5, 384)

Now, we can apply this function to our dataset using the [`.map`](https://huggingface.co/docs/datasets/v2.6.1/en/package_reference/main_classes#datasets.Dataset.map) operator from datasets to apply this tokenization process on our whole training dataset. 

We will apply the same function to the validation dataset to evaluate our model during training.

In [19]:
num_indices_to_keep_data = round(0.01 * len(train_dataset)) # 1% of data to keep.
indices = np.random.choice(range(len(train_dataset)), num_indices_to_keep_data, replace=False)
print(len(indices))
print(type(train_dataset))
subsample_train_dataset = train_dataset[indices]

# --- START CODE HERE (13)
# Apply the prepare_train_features to the train_dataset and validation_dataset. Provide the tokenizer to the function and batch the data.
# Finally, remove the column names from the dataset.
tokenized_train_dataset = train_dataset.map(prepare_train_features, fn_kwargs={"tokenizer": tokenizer}, batched=True, remove_columns=train_dataset.column_names)
tokenized_validation_dataset = validation_dataset.map(prepare_train_features, fn_kwargs={"tokenizer": tokenizer}, batched=True, remove_columns=validation_dataset.column_names)
tokenized_train_dataset[0]
# --- END CODE HERE



9
<class 'datasets.arrow_dataset.Dataset'>




{'input_ids': [101,
  2000,
  3183,
  2106,
  1996,
  6261,
  2984,
  9382,
  3711,
  1999,
  8517,
  1999,
  10223,
  26371,
  2605,
  1029,
  102,
  6549,
  2135,
  1010,
  1996,
  2082,
  2038,
  1037,
  3234,
  2839,
  1012,
  10234,
  1996,
  2364,
  2311,
  1005,
  1055,
  2751,
  8514,
  2003,
  1037,
  3585,
  6231,
  1997,
  1996,
  6261,
  2984,
  1012,
  3202,
  1999,
  2392,
  1997,
  1996,
  2364,
  2311,
  1998,
  5307,
  2009,
  1010,
  2003,
  1037,
  6967,
  6231,
  1997,
  4828,
  2007,
  2608,
  2039,
  14995,
  6924,
  2007,
  1996,
  5722,
  1000,
  2310,
  3490,
  2618,
  4748,
  2033,
  18168,
  5267,
  1000,
  1012,
  2279,
  2000,
  1996,
  2364,
  2311,
  2003,
  1996,
  13546,
  1997,
  1996,
  6730,
  2540,
  1012,
  3202,
  2369,
  1996,
  13546,
  2003,
  1996,
  24665,
  23052,
  1010,
  1037,
  14042,
  2173,
  1997,
  7083,
  1998,
  9185,
  1012,
  2009,
  2003,
  1037,
  15059,
  1997,
  1996,
  24665,
  23052,
  2012,
  10223,
  26371,
  1010,
  2605

## Fine-tune the model

Now that we have transformed the dataset to feed the model, we will instantiate our model and train it.


First, we will retrieve the model thanks to the [auto model for question answering](https://huggingface.co/docs/transformers/model_doc/auto) from HuggingFace.

In [20]:
# --- START CODE HERE (14)
# Import the correct auto model class and instantiate the model.
from transformers import AutoModelForQuestionAnswering

model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)
# --- END CODE HERE

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForQuestionAnswering: ['vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_projector.weight']
- This IS expected if you are initializing DistilBertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.weight', 'qa_outputs.bias']
You should probably TRAIN this mode

To train a model, HuggingFace expects to instantiate a [Trainer](https://huggingface.co/docs/transformers/main_classes/trainer) taking as parameters:
- the model
- the arguments to configure the trainer
- the train dataset
- the eval dataset
- the default [data collator](https://huggingface.co/docs/transformers/main_classes/data_collator)
- the tokenizer

First we will instantiate a [TrainerArguments](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments) with the following parameters:
- evaluation at each epoch
- learning rate of value 2e-5
- batch size of 16 to train
- batch size of 16 to evaluate
- train for 5 epochs
- weight decay of 0.01

In [21]:
from transformers import TrainingArguments

model_name = model_checkpoint.split("/")[-1]

# --- START CODE HERE (14)
# Instantiate the Training Arguments.
args = TrainingArguments(
    f"{model_name}-finetuned-squad",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=5,
    weight_decay=0.01,
    push_to_hub=False,
)
# --- END CODE HERE

Now that we have the training arguments and the model, we can instantiate the trainer.

In [22]:
# --- START CODE HERE (15)
# Import the trainer and the data collator.
from transformers import Trainer, default_data_collator
# --- END CODE HERE

# --- START CODE HERE (16)
# Instantiate the Trainer.
trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_validation_dataset,
    data_collator=default_data_collator,
    tokenizer=tokenizer,
)
# --- END CODE HERE

Finally, we can launch the training that should last around 2 minutes to train.

In [23]:
trainer.train()

***** Running training *****
  Num examples = 908
  Num Epochs = 5
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 285


Epoch,Training Loss,Validation Loss
1,No log,3.42934
2,No log,3.054243
3,No log,2.945534
4,No log,2.785733
5,No log,2.726093


***** Running Evaluation *****
  Num examples = 106
  Batch size = 16
***** Running Evaluation *****
  Num examples = 106
  Batch size = 16
***** Running Evaluation *****
  Num examples = 106
  Batch size = 16
***** Running Evaluation *****
  Num examples = 106
  Batch size = 16
***** Running Evaluation *****
  Num examples = 106
  Batch size = 16


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=285, training_loss=3.307274748149671, metrics={'train_runtime': 166.8391, 'train_samples_per_second': 27.212, 'train_steps_per_second': 1.708, 'total_flos': 444873805608960.0, 'train_loss': 3.307274748149671, 'epoch': 5.0})

### Evaluate our model

After training our model, we can start evaluating it.

For that we need to retrieve the prediction of our model. The following code gives us the keys returned by our model for a validation batch.

In [24]:
import torch

for batch in trainer.get_eval_dataloader():
    break
batch = {k: v.to(trainer.args.device) for k, v in batch.items()}
with torch.no_grad():
    output = trainer.model(**batch)
output.keys()

odict_keys(['loss', 'start_logits', 'end_logits'])

Our model predicts two probability distributions over the tokens:
- the start token probability called the `start_logits`.
- the end token probability called the `end_logits`.

In [25]:
output.start_logits.shape, output.end_logits.shape

(torch.Size([16, 384]), torch.Size([16, 384]))

To output the actual hard prediction we take the argmax of each distribution. We can observe several issues:
- sometimes the end token predicted is before the start token which is impossible.
- the predicted token could be inside the question.
- if our context is too large, we will have several predictions for each feature provided by the tokenizer.

Therefore, we need a procedure to select the best predictions.

In [26]:
output.start_logits.argmax(dim=-1), output.end_logits.argmax(dim=-1)

(tensor([ 46,  46, 161, 161, 167, 162,  72,  42, 162,  41,  73, 159,  80, 163,
         170,  46], device='cuda:0'),
 tensor([ 47,  47,  44,  44,  50,  45,  44,  43,  45,  42,  13,  42,  46,  46,
         158,  47], device='cuda:0'))

In the following cell, we will make our logits follow this pipeline:
- keep the 20 best propositions for each `start_logits` and each `end_logits` (the maximum value of the probability distributions).
- make pair values of each `start_logits` and `end_logits` if the `end_logits` index is after the `start_logits` index.

The idea is that every token proposition for both start and end should be taken into account and not only paired predictions especially when the first best paired predictions are not possible.

In [27]:
import numpy as np

n_best_size = 20
start_logits = output.start_logits[0].cpu().numpy()
end_logits = output.end_logits[0].cpu().numpy()

# --- START CODE HERE (17)
# Only keep the best propositions index.
start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
# --- END CODE HERE
valid_answers = []
for start_index in start_indexes:
    for end_index in end_indexes:
        # --- START CODE HERE (18)
        # Only keep the valid pairs.
        if start_index <= end_index:
        # --- END CODE HERE
            valid_answers.append(
                {
                    "score": start_logits[start_index] + end_logits[end_index],
                    "text": "" # Later we will find a way to get back the original substring corresponding to the answer in the context
                }
            )
print(f"We kept only {len(valid_answers)} valid pairs from {len(start_indexes) * len(end_indexes)} best pair propositions from {sum(start_logits.shape) * sum(end_logits.shape)} possible pairs.")

We kept only 218 valid pairs from 400 best pair propositions from 147456 possible pairs.


To retrieve all validation features, we need to add two things to our validation pipeline:
- verify that our pairs are inside the context and not the question.
- retrieve the actual text for the model instead of the tokens.

We need to tokenize all our validation data. We will implement a process pipeline slightly different from `prepare_train_features` that we implemented before.

In [28]:
def prepare_validation_features(examples):
    # Some of the questions have lots of whitespace on the left, which is not useful and will make the
    # truncation of the context fail (the tokenized question will take a lots of space). So we remove that
    # left whitespace
    examples["question"] = [q.lstrip() for q in examples["question"]]

    # For this notebook to work with any kind of models, we need to account for the special case where the model
    # expects padding on the left (in which case we switch the order of the question and the context)
    pad_on_right = tokenizer.padding_side == "right"

    # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # --- START CODE HERE (19)
    # Apply the tokenizer as for prepare_train_features
    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )
    # --- END CODE HERE

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")

    # We keep the example_id that gave us this feature and we will store the offset mappings.
    tokenized_examples["example_id"] = []

    for i in range(len(tokenized_examples["input_ids"])):
        # --- START CODE HERE (20)
        # Grab the text sequence corresponding to that feature.
        sequence_ids = tokenized_examples.sequence_ids(i)
        # --- END CODE HERE

        context_index = 1 if pad_on_right else 0

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        tokenized_examples["example_id"].append(examples["id"][sample_index])

        # --- START CODE HERE (21)
        # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token
        # position is part of the context or not.
        tokenized_examples["offset_mapping"][i] = [
            (o if sequence_ids[k] == context_index else None)
            for k, o in enumerate(tokenized_examples["offset_mapping"][i])
        ]
        # --- END CODE HERE


    return tokenized_examples

Now, we can apply this function to our dataset using the [`.map`](https://huggingface.co/docs/datasets/v2.6.1/en/package_reference/main_classes#datasets.Dataset.map) operator from datasets to apply this tokenization process on our whole validation dataset. 

In [29]:
# --- START CODE HERE (22)
# Apply the prepare_validation_features to the validation_dataset. Provide the tokenizer to the function and batch the data.
# Finally, remove the column names from the dataset.
validation_features = validation_dataset.map(
    prepare_validation_features,
    batched=True,
    remove_columns=validation_dataset.column_names
)
# --- END CODE HERE



With the `validation_features`, we will make predictions thanks to the [trainer](https://huggingface.co/docs/transformers/main_classes/trainer).

In [30]:
raw_predictions = trainer.predict(validation_features)

The following columns in the test set don't have a corresponding argument in `DistilBertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id. If offset_mapping, example_id are not expected by `DistilBertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 106
  Batch size = 16


The `Trainer` *hides* the columns that are not used by the model (here `example_id` and `offset_mapping` which we will need for our post-processing), so we set them back:

In [31]:
validation_features.set_format(type=validation_features.format["type"], columns=list(validation_features.features.keys()))

To refine our validation pipeline and eliminate irrelevant answers we will filter:
- the answers containing `None` in the offset mappings as it corresponds to a part of the question
- the answers longer than the hyper-parameter `max_answer_length`

In [32]:
max_answer_length = 30
start_logits = output.start_logits[0].cpu().numpy()
end_logits = output.end_logits[0].cpu().numpy()
offset_mapping = validation_features[0]["offset_mapping"]

# The first feature comes from the first example. For the more general case, we will need to be match the example_id to
# an example index
context = validation_dataset[0]["context"]

# --- START CODE HERE (23)
# Only keep the best propositions index as before.
start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
# --- END CODE HERE
valid_answers = []
for start_index in start_indexes:
    for end_index in end_indexes:
        # --- START CODE HERE (24)
        # Filter out-of-scope answers: indices out of bounds or in the question.
        if (
            start_index >= len(offset_mapping)
            or end_index >= len(offset_mapping)
            or offset_mapping[start_index] is None
            or offset_mapping[end_index] is None
        ):
        # --- END CODE HERE
            continue

        # --- START CODE HERE (25)
        # Consider answers that are valid and shorter than max_answer_length.
        if end_index < start_index or end_index - start_index + 1 > max_answer_length:
        # --- END CODE HERE
            continue

        start_char = offset_mapping[start_index][0]
        end_char = offset_mapping[end_index][1]
        valid_answers.append(
            {
                "score": start_logits[start_index] + end_logits[end_index],
                "text": context[start_char: end_char]
            }
        )

valid_answers = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[:n_best_size]
valid_answers

[{'score': 5.36228, 'text': 'Denver Broncos'},
 {'score': 4.343049,
  'text': 'American Football Conference (AFC) champion Denver Broncos'},
 {'score': 3.748499,
  'text': 'Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016'},
 {'score': 3.233025, 'text': 'Super Bowl L'},
 {'score': 3.0383232,
  'text': 'Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers'},
 {'score': 3.0111775, 'text': 'Broncos'},
 {'score': 3.0016599,
  'text': '2015 season. The American Football Conference (AFC) champion Denver Broncos'},
 {'score': 3.0003338,
  'text': 'National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos'},
 {'score': 2.9524784, 'text': '2016'},
 {'score': 2.93812, 'text': 'February 7, 2016'},
 {'score': 2.9262085,
  'text': 'National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was playe

We can compare to the actual ground-truth answer:

In [33]:
validation_dataset[0]["answers"]

{'text': ['Denver Broncos', 'Denver Broncos', 'Denver Broncos'],
 'answer_start': [177, 177, 177]}

As mentioned in the code above, this was easy on the first feature because we knew it comes from the first example.

For the other features, we will map between examples and their corresponding features. Since one example can give several features, we will gather together all the answers in all the features generated by a given example, then pick the best one. The following code builds a map from example index to its corresponding features indices:

In [34]:
import collections

features = validation_features

example_id_to_index = {k: i for i, k in enumerate(validation_dataset["id"])}
features_per_example = collections.defaultdict(list)
for i, feature in enumerate(features):
    features_per_example[example_id_to_index[feature["example_id"]]].append(i)

All combined together, this gives us this post-processing function:

In [35]:
from tqdm.auto import tqdm

def postprocess_qa_predictions(examples, features, raw_predictions, n_best_size = 20, max_answer_length = 30):
    all_start_logits, all_end_logits = raw_predictions
    # Build a map example to its corresponding features.
    example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
    features_per_example = collections.defaultdict(list)
    for i, feature in enumerate(features):
        features_per_example[example_id_to_index[feature["example_id"]]].append(i)

    # The dictionaries we have to fill.
    predictions = collections.OrderedDict()

    # Logging.
    print(f"Post-processing {len(examples)} example predictions split into {len(features)} features.")

    # Let's loop over all the examples!
    for example_index, example in enumerate(tqdm(examples)):
        # Those are the indices of the features associated to the current example.
        feature_indices = features_per_example[example_index]

        min_null_score = None # Only used if squad_v2 is True.
        valid_answers = []
        
        context = example["context"]
        # Looping through all the features associated to the current example.
        for feature_index in feature_indices:
            # We grab the predictions of the model for this feature.
            start_logits = all_start_logits[feature_index]
            end_logits = all_end_logits[feature_index]
            # This is what will allow us to map some the positions in our logits to span of texts in the original
            # context.
            offset_mapping = features[feature_index]["offset_mapping"]

            # Update minimum null prediction.
            cls_index = features[feature_index]["input_ids"].index(tokenizer.cls_token_id)
            feature_null_score = start_logits[cls_index] + end_logits[cls_index]
            if min_null_score is None or min_null_score < feature_null_score:
                min_null_score = feature_null_score

            # --- START CODE HERE (26)
            # Only keep the best propositions index as before.
            start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
            end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
            # --- END CODE HERE
            for start_index in start_indexes:
                for end_index in end_indexes:
                    # --- START CODE HERE (27)
                    # Filter out-of-scope answers: indices out of bounds or in the question as before.
                    if (
                        start_index >= len(offset_mapping)
                        or end_index >= len(offset_mapping)
                        or offset_mapping[start_index] is None
                        or offset_mapping[end_index] is None
                    ):
                    # --- END CODE HERE
                        continue
                    # --- START CODE HERE (28)
                    # Consider valid answers and the ones that have length shorter than max_answer_length.
                    if end_index < start_index or end_index - start_index + 1 > max_answer_length:
                    # --- END CODE HERE
                        continue

                    start_char = offset_mapping[start_index][0]
                    end_char = offset_mapping[end_index][1]
                    valid_answers.append(
                        {
                            "score": start_logits[start_index] + end_logits[end_index],
                            "text": context[start_char: end_char]
                        }
                    )
        
        if len(valid_answers) > 0:
            best_answer = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[0]
        else:
            # In the very rare edge case we have not a single non-null prediction, we create a fake prediction to avoid
            # failure.
            best_answer = {"text": "", "score": 0.0}
        
        # Let's pick our final answer: the best one
        predictions[example["id"]] = best_answer["text"]

    return predictions

Now, we can apply the post-processing function to our predictions:

In [36]:
# --- START CODE HERE (29)
# Apply our postprocess_qa_predictions to our predictions.
final_predictions = postprocess_qa_predictions(validation_dataset, validation_features, raw_predictions.predictions)
# --- END CODE HERE

Post-processing 106 example predictions split into 106 features.


  0%|          | 0/106 [00:00<?, ?it/s]

Then we can load the metric from the datasets library.

In [37]:
from datasets import load_metric

metric = load_metric("squad")

  This is separate from the ipykernel package so we can avoid doing imports until


In [38]:
formatted_predictions = [{"id": k, "prediction_text": v} for k, v in final_predictions.items()]
references = [{"id": ex["id"], "answers": ex["answers"]} for ex in validation_dataset]
metric.compute(predictions=formatted_predictions, references=references)

{'exact_match': 39.62264150943396, 'f1': 42.6284067085954}

You can improve this result by having a larger dataset or train longer !

## What to do now ?

If you want, you can lookup for datasets in your own language and see if distilbert performs correctly. Generally, a model that was learnt on the same language as your dataset will work better than a general model that was learnt on several languages or, obviously, on a totally different language. 

For example for french, Camembert is a BERT model but trained on french datasets and obtain a very good performance for french NLP tasks.

You can take a look at other HuggingFace tutorial that cover other tasks to see what is the tokenization process, how the model is different for such tasks:
- [translation](https://huggingface.co/docs/transformers/tasks/translation)
- [summarization](https://huggingface.co/docs/transformers/tasks/summarization)
- ...

