# TP7: Fine-tuning BERT on Q&A tasks

**Authors:** 
- julien.denize@centralesupelec.fr
- tom.dupuis@centralesupelec.fr


If you have questions or suggestions, contact us and we will gladly answer and take into account your remarks.

For this tp you need to have some ground understanding of pytorch. A basic introduction is available [here](https://pytorch.org/tutorials/beginner/basics/intro.html).



## Objective

In this TP, we will implement solutions for the Question & Answering (Q&A) task by Fine-tuning a pretrained distilbert.

This TP is built on the [HuggingFace Q&A tutorial](https://huggingface.co/docs/transformers/tasks/question_answering), therefore it relies on libraries associated to HuggingFace. 

Question answering tasks return an answer given a question. There are two common forms of question answering:
- Extractive: extract the answer from the given context.
- Abstractive: generate an answer from the context that correctly answers the question.

In this TP, we will show you how to fine-tune [Distilbert](https://huggingface.co/docs/transformers/model_doc/distilbert) on the SQuAD dataset for extractive question answering.

Distilbert is a smaller transformer architecture than BERT that has been trained by [knowledge distillation](https://en.wikipedia.org/wiki/Knowledge_distillation) of BERT to provide a lightweight faster NLP model with high performance.





## Your task

Fill the missing parts in the code (parts between # --- START CODE HERE and # --- END CODE HERE)

In [None]:
import torch
import numpy as np
import random

# Seed everything
seed=42
torch.manual_seed(seed)
random.seed(seed)
np.random.seed(seed)

## Install the required libraries

We need to install the following pip packages:
- [datasets](https://pypi.org/project/datasets/): to load datasets available on the [HuggingFace Datasets Hub](https://huggingface.co/datasets). 
- [transformers](https://pypi.org/project/transformers/):  to load thousands of pretrained models to perform tasks on different modalities such as text, vision, and audio in Pytorch, Tensorflow or JAX.

In [None]:
! pip install datasets transformers

## Load the dataset

We will use the [SQUAD dataset](https://rajpurkar.github.io/SQuAD-explorer/).

>Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.

We will instantiate the train and validation splits via the [load_dataset](https://huggingface.co/docs/datasets/v1.11.0/splits.html) function.

There are 87.599 elements in the train split and 10.570 elements in the validation split. We will only request 1% of each to avoid long training time. 

In [None]:
from datasets import load_dataset

# --- START CODE HERE (01)
# Load the SQUAD dataset with 1% of each different splits.
dataset =

train_dataset =
validation_dataset =
# --- END CODE HERE
dataset

Now we can have access to the samples in the datasets.

In each sample we have the following information:
- the id of the wikipedia article.
- the title of the wikipedia article.
- the context that contains the answer to the question.
- the question.
- the answer along with the index of where the answer start.

In [None]:
train_dataset[0]

HuggingFace provides a nice function to better show what the data looks like.

In [None]:
from datasets import ClassLabel, Sequence
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
        elif isinstance(typ, Sequence) and isinstance(typ.feature, ClassLabel):
            df[column] = df[column].transform(lambda x: [typ.feature.names[i] for i in x])
    display(HTML(df.to_html()))

In [None]:
show_random_elements(train_dataset, 3)

## Preprocess the training data



Now that we have access to the data, we need to preprocess it to feed it to our neural networks.

In NLP, this step consist of making the tokenization of the data, meaning convert the string words into unique IDs.

Our model requires the following as input:
- A first sequence that is the question.
- A separator token [SEP].
- A second sequence that may contain the answer.

The label is given by the start and end indices of the tokens that compose the answer.

![](https://miro.medium.com/max/1400/1*QhIXsDBEnANLXMA0yONxxA.png)

### Instantiate the tokenizer

We will use the [AutoTokenizer](https://huggingface.co/docs/transformers/model_doc/auto) class provided by HuggingFace as this will ensure we use the tokenizer that was used to train the distilbert model. For that, we need to use the right checkpoint of the model. The list of checkpoints is available [here](https://huggingface.co/models) and you need to retrieve the basic checkpoint for distilbert that do not care about the case (ENGLISH = english). 

In [None]:
# --- START CODE HERE (02)
# Import the auto tokenizer class and instantiate it

    
model_checkpoint =
tokenizer =
# --- END CODE HERE

You can try the tokenizer with custom strings or from our data. Tokenizer accepts tuple as input, but returns only one concatenated tokenized output with a separator token [SEP] with a starting token [CLS].

In [None]:
custom_string = "Hi, I would love to test the tokenizer."
custom_string_2 = "Sure, go ahead and verify that english = ENGLISH = ENglISh."
tokenized_custom_strings = tokenizer(custom_string, custom_string_2)
tokenized_custom_strings

You can now decode the sequence and retrieve the initial string with the special tokens. You can also verify that the case is no longer present in the string as our model do not make the difference between lower and upper case. 

In [None]:
tokenizer.decode(tokenized_custom_strings["input_ids"])

Here we can apply the tokenizer on one question from our training set.

In [None]:
print(train_dataset[0]["question"])
print(tokenizer(train_dataset[0]["question"]))
print(tokenizer.decode(tokenizer(train_dataset[0]["question"])["input_ids"]))

### Deal with long contexts

Our model can only take a maximum number of tokens per input. Our input is composed of both the question and the context separated by the special token [SEP]. 

However, in our dataset we might have some samples where the question plus the context length is larger than this maximum number of tokens. We cannot just truncate the input as for some other tasks as the answer to the question might be located in the cut part. 

Instead, a long context will be splitted in several input features, each of length shorter than the maximum length of the model. To avoid that the answer is located on the splitting point, we will make the input features overlap.

In [None]:
max_length = 384 # The maximum length of a feature (question and context)
doc_stride = 128 # The authorized overlap between two part of the context when splitting it is needed.

Below is the code to find the first example with a long input:

In [None]:
for i, example in enumerate(train_dataset):
    if len(tokenizer(example["question"], example["context"])["input_ids"]) > max_length:
        long_context_idx = i
        break
long_example = train_dataset[long_context_idx]
long_example, len(tokenizer(long_example["question"] + long_example["context"])["input_ids"])

To split the input in several features, we need to correctly configure our tokenizer and feed it with inputs following these requirements:
- Pass to the tokenizer the tuple of the question and the context. It will automatically add the [SEP] token between the two.
- Force the tokenizer to split the input if it is too large:
  - only the second part (the context) can be truncated so that the question is shared by all new inputs.
  - allow overlapping between tokens.

All these requirements can be done thanks to the [tokenizer utilities](https://huggingface.co/docs/transformers/v4.23.1/en/internal/tokenization_utils) by applying the correct parameters.

In [None]:
# --- START CODE HERE (03)
# Tokenize the long example with the requirements defined above.
tokenized_long_example = 
# --- END CODE HERE

print(f"The long example now has {len(tokenized_long_example['input_ids'])} inputs with length {[len(x) for x in tokenized_long_example['input_ids']]}.") # Should have 2 inputs with length [384, 157]
for sequence in tokenized_long_example['input_ids']:
  print(tokenizer.decode(sequence))

The problem with the above solution is that we lack the information of where is located the answer: we need to know where is located the answer for each feature provided. 

The model require the start and end positions of the answers in the tokens, so we will also need to map parts of the original context to some tokens.

We need for each index of our feature the corresponding start and end character in the original text that gave our token in the format (`start_char`, `end_char`). The first token (`[CLS]`) has (0, 0) because it is a special added token that was not present in the original sentence.

This can be done using the tokenizer utilities.

In [None]:
# --- START CODE HERE (04)
# Tokenize the long example with the new requirement defined above.
tokenized_long_example = 
# --- END CODE HERE

print(f"The long example now has {len(tokenized_long_example['input_ids'])} inputs with length {[len(x) for x in tokenized_long_example['input_ids']]}.") # Should have 2 inputs with length [384, 157]
for sequence, mapping in zip(tokenized_long_example['input_ids'], tokenized_long_example["offset_mapping"]):
  print("\n")
  print(tokenizer.decode(sequence))
  print(sequence)
  print(mapping)

The mapping can be used to find the position of the start and end tokens of our answer in a feature. To avoid the question part we can use the `sequence_ids` field provided by the tokenizer output to have the knowledge of which tokens are part of the first sequence (the question) or the second sequence (the context, or part of the context).

It returns for each token, the sequence ID (0 for question, 1 for context) and None for special tokens.

In [None]:
sequence_ids = tokenized_long_example.sequence_ids()
print(sequence_ids)

Now, we can retrieve the answer from our features.

In [None]:
answer = long_example["answers"] # Retrieve the answer from the example
start_char = answer["answer_start"][0] # Retrieve the index of the start character of the answer
end_char = start_char + len(answer["text"][0]) # Retrieve the index of the end character of the answer

# Iterate over the features
for i in range(len(tokenized_long_example["input_ids"])):
  print(f"Looking for the answer `{answer['text'][0]}` to the question `{long_example['question']}` in feature {i+1}.")
  print(f"The feature contains the following decoded sequence:\n{tokenizer.decode(tokenized_long_example['input_ids'][i])}")
  
  # Start token index of the current span in the text.
  token_start_index = 0

  # --- START CODE HERE (05)
  # Find where the context sequence starts and store it in the variable token_start_index.
  while :

  # --- END CODE HERE

  # --- START CODE HERE (06)
  # Find where the context sequence ends and store it in the variable token_end_index.
  token_end_index =
  while:

  # --- END CODE HERE

  offsets = tokenized_long_example["offset_mapping"][i]
  # Detect if the answer is out of the span.
  if (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):

      # --- START CODE HERE (07)
      # Find where are the start_position and end_position of the answer.
      # Move the token_start_index and token_end_index to the two ends of the answer.
      while:
          token_start_index += 1
      start_position = token_start_index - 1
      while :
          token_end_index -= 1
      end_position = token_end_index + 1
      # --- END CODE HERE

      print(f"Answer found by the tokenizer at the token positions: {start_position}, {end_position}")
      print(f"{tokenizer.decode(tokenized_long_example['input_ids'][i][start_position: end_position+1])}")
  else:
      print("The answer is not in this feature.")
  print("\n")



### Tokenize the whole dataset

Now we can implement a function that will prepare the whole dataset following the above process.

In [None]:
def prepare_train_features(examples, tokenizer, max_length: int = 384, doc_stride: int = 128):
    # Some of the questions have lots of whitespace on the left, which is not useful and will make the
    # truncation of the context fail (the tokenized question will take a lots of space). So we remove that
    # left whitespace
    examples["question"] = [q.lstrip() for q in examples["question"]]

    # For this notebook to work with any kind of models, we need to account for the special case where the model
    # expects padding on the left (in which case we switch the order of the question and the context)
    pad_on_right = tokenizer.padding_side == "right"

    # --- START CODE HERE (08)
    # Apply the tokenizer as before except that be careful to correctly setup the order of question and context 
    # given the value of the boolean pad_on_right.
    tokenized_examples =
    # --- END CODE HERE

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    # The offset mappings will give us a map from token to character position in the original context. This will
    # help us compute the start_positions and end_positions.
    offset_mapping = tokenized_examples.pop("offset_mapping")

    # Let's label those examples!
    tokenized_examples["start_positions"] = []
    tokenized_examples["end_positions"] = []

    # Iterate over the offset mapping from the features.
    for i, offsets in enumerate(offset_mapping):
        # We will label impossible answers with the index of the CLS token.
        input_ids = tokenized_examples["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)

        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        answers = examples["answers"][sample_index]

        # --- START CODE HERE (09)
        # If no answers are given, set the cls_index as answer.
        if :

        # --- END CODE HERE
        else:
            # Start/end character index of the answer in the text.
            start_char = answers["answer_start"][0]
            end_char = start_char + len(answers["text"][0])

            token_start_index = 0
           
            # --- START CODE HERE (10)
            # Find where the context sequence starts and ends as before. Be careful about the pad_on_right boolean.
            while :
                token_start_index += 1

            token_end_index =
            while :
                token_end_index -= 1
            # --- END CODE HERE

            # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).
            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
                # --- START CODE HERE (11)
                # Label impossible answers with the index of the CLS token.
                tokenized_examples["start_positions"].
                tokenized_examples["end_positions"].
                # --- END CODE HERE
            else:
                # --- START CODE HERE (12)
                # Find where are the start_position and end_position of the answer as before.
                while :
                    token_start_index += 1
                tokenized_examples["start_positions"].
                while :
                    token_end_index -= 1
                tokenized_examples["end_positions"].
                # --- END CODE HERE
            

    return tokenized_examples

This function works with one or several examples. In the case of several examples, the tokenizer will return a list of lists for each key:

In [None]:
features = prepare_train_features(train_dataset[:5], tokenizer)
len(features["input_ids"]), len(features["input_ids"][0]) # should return (5, 384)

Now, we can apply this function to our dataset using the [`.map`](https://huggingface.co/docs/datasets/v2.6.1/en/package_reference/main_classes#datasets.Dataset.map) operator from datasets to apply this tokenization process on our whole training dataset. 

We will apply the same function to the validation dataset to evaluate our model during training.

In [None]:
num_indices_to_keep_data = round(0.01 * len(train_dataset)) # 1% of data to keep.
indices = np.random.choice(range(len(train_dataset)), num_indices_to_keep_data, replace=False)
print(len(indices))
print(type(train_dataset))
subsample_train_dataset = train_dataset[indices]

# --- START CODE HERE (13)
# Apply the prepare_train_features to the train_dataset and validation_dataset. Provide the tokenizer to the function and batch the data.
# Finally, remove the column names from the dataset.
tokenized_train_dataset = 
tokenized_validation_dataset = 
# --- END CODE HERE
tokenized_train_dataset[0]

## Fine-tune the model

Now that we have transformed the dataset to feed the model, we will instantiate our model and train it.


First, we will retrieve the model thanks to the [auto model for question answering](https://huggingface.co/docs/transformers/model_doc/auto) from HuggingFace.

In [None]:
# --- START CODE HERE (14)
# Import the correct auto model class and instantiate the model.

model = 
# --- END CODE HERE

To train a model, HuggingFace expects to instantiate a [Trainer](https://huggingface.co/docs/transformers/main_classes/trainer) taking as parameters:
- the model
- the arguments to configure the trainer
- the train dataset
- the eval dataset
- the default [data collator](https://huggingface.co/docs/transformers/main_classes/data_collator)
- the tokenizer

First we will instantiate a [TrainerArguments](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments) with the following parameters:
- evaluation at each epoch
- learning rate of value 2e-5
- batch size of 16 to train
- batch size of 16 to evaluate
- train for 5 epochs
- weight decay of 0.01

In [None]:
from transformers import TrainingArguments

model_name = model_checkpoint.split("/")[-1]

# --- START CODE HERE (14)
# Instantiate the Training Arguments.
args = 
# --- END CODE HERE

Now that we have the training arguments and the model, we can instantiate the trainer.

In [None]:
# --- START CODE HERE (15)
# Import the trainer and the data collator.

# --- END CODE HERE

# --- START CODE HERE (16)
# Instantiate the Trainer.
trainer = 
# --- END CODE HERE

Finally, we can launch the training that should last around 2 minutes to train.

In [None]:
trainer.train()

### Evaluate our model

After training our model, we can start evaluating it.

For that we need to retrieve the prediction of our model. The following code gives us the keys returned by our model for a validation batch.

In [None]:
import torch

for batch in trainer.get_eval_dataloader():
    break
batch = {k: v.to(trainer.args.device) for k, v in batch.items()}
with torch.no_grad():
    output = trainer.model(**batch)
output.keys()

Our model predicts two probability distributions over the tokens:
- the start token probability called the `start_logits`.
- the end token probability called the `end_logits`.

In [None]:
output.start_logits.shape, output.end_logits.shape

To output the actual hard prediction we take the argmax of each distribution. We can observe several issues:
- sometimes the end token predicted is before the start token which is impossible.
- the predicted token could be inside the question.
- if our context is too large, we will have several predictions for each feature provided by the tokenizer.

Therefore, we need a procedure to select the best predictions.

In [None]:
output.start_logits.argmax(dim=-1), output.end_logits.argmax(dim=-1)

In the following cell, we will make our logits follow this pipeline:
- keep the 20 best propositions for each `start_logits` and each `end_logits` (the maximum value of the probability distributions).
- make pair values of each `start_logits` and `end_logits` if the `end_logits` index is after the `start_logits` index.

The idea is that every token proposition for both start and end should be taken into account and not only paired predictions especially when the first best paired predictions are not possible.

In [None]:
import numpy as np

n_best_size = 20
start_logits = output.start_logits[0].cpu().numpy()
end_logits = output.end_logits[0].cpu().numpy()

# --- START CODE HERE (17)
# Only keep the best propositions index.
start_indexes = 
end_indexes = 
# --- END CODE HERE
valid_answers = []
for start_index in start_indexes:
    for end_index in end_indexes:
        # --- START CODE HERE (18)
        # Only keep the valid pairs.
        if :
        # --- END CODE HERE
            valid_answers.append(
                {
                    "score": start_logits[start_index] + end_logits[end_index],
                    "text": "" # Later we will find a way to get back the original substring corresponding to the answer in the context
                }
            )
print(f"We kept only {len(valid_answers)} valid pairs from {len(start_indexes) * len(end_indexes)} best pair propositions from {sum(start_logits.shape) * sum(end_logits.shape)} possible pairs.")

To retrieve all validation features, we need to add two things to our validation pipeline:
- verify that our pairs are inside the context and not the question.
- retrieve the actual text for the model instead of the tokens.

We need to tokenize all our validation data. We will implement a process pipeline slightly different from `prepare_train_features` that we implemented before.

In [None]:
def prepare_validation_features(examples):
    # Some of the questions have lots of whitespace on the left, which is not useful and will make the
    # truncation of the context fail (the tokenized question will take a lots of space). So we remove that
    # left whitespace
    examples["question"] = [q.lstrip() for q in examples["question"]]

    # For this notebook to work with any kind of models, we need to account for the special case where the model
    # expects padding on the left (in which case we switch the order of the question and the context)
    pad_on_right = tokenizer.padding_side == "right"

    # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # --- START CODE HERE (19)
    # Apply the tokenizer as for prepare_train_features
    tokenized_examples = 
    # --- END CODE HERE

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")

    # We keep the example_id that gave us this feature and we will store the offset mappings.
    tokenized_examples["example_id"] = []

    for i in range(len(tokenized_examples["input_ids"])):
        # --- START CODE HERE (20)
        # Grab the text sequence corresponding to that feature.
        sequence_ids = 
        # --- END CODE HERE

        context_index = 1 if pad_on_right else 0

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        tokenized_examples["example_id"].append(examples["id"][sample_index])

        # --- START CODE HERE (21)
        # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token
        # position is part of the context or not.
        tokenized_examples["offset_mapping"][i] = 
        # --- END CODE HERE


    return tokenized_examples

Now, we can apply this function to our dataset using the [`.map`](https://huggingface.co/docs/datasets/v2.6.1/en/package_reference/main_classes#datasets.Dataset.map) operator from datasets to apply this tokenization process on our whole validation dataset. 

In [None]:
# --- START CODE HERE (22)
# Apply the prepare_validation_features to the validation_dataset. Provide the tokenizer to the function and batch the data.
# Finally, remove the column names from the dataset.
validation_features = 
# --- END CODE HERE

With the `validation_features`, we will make predictions thanks to the [trainer](https://huggingface.co/docs/transformers/main_classes/trainer).

In [None]:
raw_predictions = trainer.predict(validation_features)

The `Trainer` *hides* the columns that are not used by the model (here `example_id` and `offset_mapping` which we will need for our post-processing), so we set them back:

In [None]:
validation_features.set_format(type=validation_features.format["type"], columns=list(validation_features.features.keys()))

To refine our validation pipeline and eliminate irrelevant answers we will filter:
- the answers containing `None` in the offset mappings as it corresponds to a part of the question
- the answers longer than the hyper-parameter `max_answer_length`

In [None]:
max_answer_length = 30
start_logits = output.start_logits[0].cpu().numpy()
end_logits = output.end_logits[0].cpu().numpy()
offset_mapping = validation_features[0]["offset_mapping"]

# The first feature comes from the first example. For the more general case, we will need to be match the example_id to
# an example index
context = validation_dataset[0]["context"]

# --- START CODE HERE (23)
# Only keep the best propositions index as before.
start_indexes = 
end_indexes = 
# --- END CODE HERE
valid_answers = []
for start_index in start_indexes:
    for end_index in end_indexes:
        # --- START CODE HERE (24)
        # Filter out-of-scope answers: indices out of bounds or in the question.
        if :
        # --- END CODE HERE
            continue

        # --- START CODE HERE (25)
        # Consider answers that are valid and shorter than max_answer_length.
        if :
        # --- END CODE HERE
            continue

        start_char = offset_mapping[start_index][0]
        end_char = offset_mapping[end_index][1]
        valid_answers.append(
            {
                "score": start_logits[start_index] + end_logits[end_index],
                "text": context[start_char: end_char]
            }
        )

valid_answers = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[:n_best_size]
valid_answers

We can compare to the actual ground-truth answer:

In [None]:
validation_dataset[0]["answers"]

As mentioned in the code above, this was easy on the first feature because we knew it comes from the first example.

For the other features, we will map between examples and their corresponding features. Since one example can give several features, we will gather together all the answers in all the features generated by a given example, then pick the best one. The following code builds a map from example index to its corresponding features indices:

In [None]:
import collections

features = validation_features

example_id_to_index = {k: i for i, k in enumerate(validation_dataset["id"])}
features_per_example = collections.defaultdict(list)
for i, feature in enumerate(features):
    features_per_example[example_id_to_index[feature["example_id"]]].append(i)

All combined together, this gives us this post-processing function:

In [None]:
from tqdm.auto import tqdm

def postprocess_qa_predictions(examples, features, raw_predictions, n_best_size = 20, max_answer_length = 30):
    all_start_logits, all_end_logits = raw_predictions
    # Build a map example to its corresponding features.
    example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
    features_per_example = collections.defaultdict(list)
    for i, feature in enumerate(features):
        features_per_example[example_id_to_index[feature["example_id"]]].append(i)

    # The dictionaries we have to fill.
    predictions = collections.OrderedDict()

    # Logging.
    print(f"Post-processing {len(examples)} example predictions split into {len(features)} features.")

    # Let's loop over all the examples!
    for example_index, example in enumerate(tqdm(examples)):
        # Those are the indices of the features associated to the current example.
        feature_indices = features_per_example[example_index]

        min_null_score = None # Only used if squad_v2 is True.
        valid_answers = []
        
        context = example["context"]
        # Looping through all the features associated to the current example.
        for feature_index in feature_indices:
            # We grab the predictions of the model for this feature.
            start_logits = all_start_logits[feature_index]
            end_logits = all_end_logits[feature_index]
            # This is what will allow us to map some the positions in our logits to span of texts in the original
            # context.
            offset_mapping = features[feature_index]["offset_mapping"]

            # Update minimum null prediction.
            cls_index = features[feature_index]["input_ids"].index(tokenizer.cls_token_id)
            feature_null_score = start_logits[cls_index] + end_logits[cls_index]
            if min_null_score is None or min_null_score < feature_null_score:
                min_null_score = feature_null_score

            # --- START CODE HERE (26)
            # Only keep the best propositions index as before.
            start_indexes = 
            end_indexes = 
            # --- END CODE HERE
            for start_index in start_indexes:
                for end_index in end_indexes:
                    # --- START CODE HERE (27)
                    # Filter out-of-scope answers: indices out of bounds or in the question as before.
                    if :
                    # --- END CODE HERE
                        continue
                    # --- START CODE HERE (28)
                    # Consider valid answers and the ones that have length shorter than max_answer_length.
                    if :
                    # --- END CODE HERE
                        continue

                    start_char = offset_mapping[start_index][0]
                    end_char = offset_mapping[end_index][1]
                    valid_answers.append(
                        {
                            "score": start_logits[start_index] + end_logits[end_index],
                            "text": context[start_char: end_char]
                        }
                    )
        
        if len(valid_answers) > 0:
            best_answer = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[0]
        else:
            # In the very rare edge case we have not a single non-null prediction, we create a fake prediction to avoid
            # failure.
            best_answer = {"text": "", "score": 0.0}
        
        # Let's pick our final answer: the best one
        predictions[example["id"]] = best_answer["text"]

    return predictions

Now, we can apply the post-processing function to our predictions:

In [None]:
# --- START CODE HERE (29)
# Apply our postprocess_qa_predictions to our predictions.
final_predictions = 
# --- END CODE HERE

Then we can load the metric from the datasets library.

In [None]:
from datasets import load_metric

metric = load_metric("squad")

In [None]:
formatted_predictions = [{"id": k, "prediction_text": v} for k, v in final_predictions.items()]
references = [{"id": ex["id"], "answers": ex["answers"]} for ex in validation_dataset]
metric.compute(predictions=formatted_predictions, references=references)

You can improve this result by having a larger dataset or train longer !

## What to do now ?

If you want, you can lookup for datasets in your own language and see if distilbert performs correctly. Generally, a model that was learnt on the same language as your dataset will work better than a general model that was learnt on several languages or, obviously, on a totally different language. 

For example for french, Camembert is a BERT model but trained on french datasets and obtain a very good performance for french NLP tasks.

You can take a look at other HuggingFace tutorial that cover other tasks to see what is the tokenization process, how the model is different for such tasks:
- [translation](https://huggingface.co/docs/transformers/tasks/translation)
- [summarization](https://huggingface.co/docs/transformers/tasks/summarization)
- ...

