If you're opening this Notebook on colab, you will probably need to install 🤗 Transformers and 🤗 Datasets. Uncomment the following cell and run it.

In [36]:
# !pip install datasets transformers
# !pip install transformers[torch]
# !pip install accelerate -U
# !pip install datasets

# Need hugginface credential
!huggingface-cli login --token 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

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [40]:
from transformers.utils import send_example_telemetry
send_example_telemetry("question_answering_notebook", framework="pytorch")

# Fine-tuning a model on a question-answering task

This notebook is built to run on any question answering task with the same format as SQUAD (version 1 or 2), with any model checkpoint from the [Model Hub](https://huggingface.co/models) as long as that model has a version with a token classification head and a fast tokenizer (check on [this table](https://huggingface.co/transformers/index.html#bigtable) if this is the case). It might just need some small adjustments if you decide to use a different dataset than the one used here. Depending on you model and the GPU you are using, you might need to adjust the batch size to avoid out-of-memory errors. Set those three parameters, then the rest of the notebook should run smoothly:

In [41]:
squad_v2 = False
model_checkpoint = "distilbert-base-uncased"
batch_size = 16

## Loading the dataset

We will use the [🤗 Datasets](https://github.com/huggingface/datasets) library to download the data and get the metric we need to use for evaluation (to compare our model to the benchmark). This can be easily done with the functions `load_dataset` and `load_metric`.  

In [204]:
from datasets import load_dataset, load_metric, Dataset, DatasetDict

In [414]:
# load squad
squad = load_dataset("squad", split="train[:80000]")
squad = squad.train_test_split(test_size=0.2)
squad['validation'] = squad['test']
del squad['test']

elist = ['Kanye_West', 'Beyoncé', 'American_Idol', 'PlayStation_3']
slist = ['Antibiotics', 'Genome', 'Solar_energy', 'Brain', 'Mammal', 'Diarrhea', 'Incandescent_light_bulb', 'Apollo', 'Neptune', 'On_the_Origin_of_Species']

def filter_total(example):
    return example['title'] in elist or example['title'] in slist
def filter_entertainment(example):
    return example['title'] in elist
def filter_science(example):
    return example['title'] in slist

train_filtered = squad['train'].filter(filter_total)
validation_filtered = squad['validation'].filter(filter_science)

Filter:   0%|          | 0/64000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/16000 [00:00<?, ? examples/s]

## Data Augmentation

In [415]:
# !pip install nlpaug
from datasets import DatasetDict, concatenate_datasets
import nlpaug.augmenter.word as naw

# Initialize augmenters
aug_synonym = naw.SynonymAug(aug_src='wordnet')  # Synonym replacement
aug_insertion = naw.RandomWordAug(action="insert")  # Random insertion
aug_swap = naw.RandomWordAug(action="swap")  # Random swap
aug_deletion = naw.RandomWordAug(action="delete")  # Random deletion

def aug_sym(example):
    example['context'] = aug_synonym.augment(example["context"])[0]
    return example
def aug_ins(example):
    try:
      example['context'] = aug_insertion.augment(example["context"])
    except NameError:
      pass
    return example
def aug_swap(example):
    example['context'] = aug_swap.augment(example["context"])[0]
    return example

def aug_del(example):
    example['context'] = aug_deletion.augment(example["context"])[0]
    return example

def aug_total(data):
  augm1 = data.map(aug_sym)
  # augm2 = data.map(aug_ins)
  # augm3 = data.map(aug_swap)
  augm4 = data.map(aug_del)

  # return concatenate_datasets([augm1, augm2, augm3, augm4])
  return concatenate_datasets([data, augm1, augm4])

train_aug = aug_total(train_filtered)
# val_aug = aug_total(validation_filtered)

Map:   0%|          | 0/3120 [00:00<?, ? examples/s]

Map:   0%|          | 0/3120 [00:00<?, ? examples/s]

In [416]:
print(train_filtered.shape)
print(train_aug.shape)

(3120, 5)
(9360, 5)


# Define Train, Validation Data!!!

In [417]:
# Create a new DatasetDict with the filtered data
datasets = DatasetDict({
    # 'train': train_filtered,
    'train': train_aug,

    'validation': validation_filtered
})

datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 9360
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 338
    })
})

In [418]:
from datasets import ClassLabel, Sequence
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)

    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
        elif isinstance(typ, Sequence) and isinstance(typ.feature, ClassLabel):
            df[column] = df[column].transform(lambda x: [typ.feature.names[i] for i in x])
    display(HTML(df.to_html()))

In [419]:
show_random_elements(datasets["train"])

Unnamed: 0,id,title,context,question,answers
0,572844763acd2414000df816,PlayStation_3,"Developers also found the machine difficult to program for. In 2007, Gabe Newell of Valve said "" The PS3 is a total disaster on so many levels, I think it ' s really clear that Sony lost track of what customers and what developers wanted "". He "" I ' d say, even at this late date, they should just cancel it and do a do over. Just say, ' This was a horrible disaster and we ' re sorry and we ' re going to stop selling this and stop to convince people develop for it ' "". Doug Lombardi VP of Marketing for Valve has since stated that they are interested in developing for the and are looking to hire talented PS3 programmers for future projects. He later restated Valve ' s position, "" Until we have the to get a PS3 team together, until we find the people who want to come to Valve or who are at Valve who want to work on that, I don ' t really see us moving to that platform "". At Sony ' E3 2010 press conference, Newell made appearance to recant his previous statements, citing Sony ' s move to make the system more developer friendly, and to announce that Valve would be developing Portal 2 for the system. He also claimed that the inclusion of Steamworks (Valve ' s system to automatically update their software independently) would help to make the PS3 version of Portal 2 the version on the market.",Who is Valve's VP of Marketing who says they want to hire programmers for a PS3 team?,"{'text': ['Doug Lombardi'], 'answer_start': [478]}"
1,57262d97ec44d21400f3dbb1,Incandescent_light_bulb,"In 1902, the Mho company developed a tantalum lamp filament. These lamps were more efficient than even graphitized c filaments and could operate at higher temperatures. Since tantalum metal has a lower resistivity than carbon, the tantalum lamp filament was quite long and required multiple internal supports. The metal filament had the property of gradually shortening in economic consumption; the filaments be installed with large loops that tightened in use. This made lamps in use for several hundred hours quite fragile. Metal filaments had the property of breaking and re - welding, though this would commonly decrease resistance and shorten the life of the filament. General Electric bought the rights to apply tantalum filaments and produced them in the US until 1913.",When did GE cease production of the tantalum light filament?,"{'text': ['1913'], 'answer_start': [760]}"
2,56bf8fc1a10cfb1400551177,Beyoncé,"Beyoncé ' s first solo recording was a feature on Jay Z ' entropy "" ' 03 Bonnie & Clyde "" that was released in October 2002, top out at number four on the U. S. Billboard Hot 100 chart. Her first solo album Dangerously in Love was released on June 24, 2003, after Michelle Williams and Kelly Rowland had released their solo efforts. The album sold 317, 000 copies in its first hebdomad, debuted atop the Billboard 200, and has since sold 11 million copies worldwide. The album ' s lead single, "" Crazy in Love "", featuring John jay Z, became Beyoncé ' s first number - one single as a solo artist in the US. The single "" Baby Boy "" also reached number one, and singles, "" Me, Myself and I "" and "" Naughty Girl "", both reached the top - five. The album garner Beyoncé a then record - tying five awards at the 46th Annual Grammy Awards; Best Contemporary R & B Album, Charles herbert best Female R & B Vocal Performance for "" Dangerously in Love 2 "", Charles herbert best R & B Song and Best Rap / Sung Collaboration for "" Crazy in Love "", and Best R & B Performance by a Duo or Group with Vocals for "" The Closer I Get to You "" with Martin luther Vandross.","The album, Dangerously in Love achieved what spot on the Billboard Top 100 chart?","{'text': ['number four'], 'answer_start': [123]}"
3,56bf9b57a10cfb14005511b1,Beyoncé,"At the 52nd Annual Grammy Awards, Beyoncé received ten nominations, including Album of the Year for I Be. .. Sasha Fierce, Platter of the Year for "" Halo "", and Song of the Year for "" Single Ladies (Put a Ring on It) "", among others. She tied with Lauryn Benny hill for most Grammy nominations in a single year by a female artist. In 2010, Beyoncé was featured on Lady Gaga ' s single "" Telephone "" and it music video. The song topped the US Pop Songs chart, becoming the sixth number - one for both Beyoncé and Gaga, tying them with Mariah Carey for most number - ones since the Nielsen Top 40 airplay chart launched in 1992. "" Telephony "" received a Grammy Award nomination for Best Pop Collaboration with Vocals.",Beyonce received how many nominations at the 52nd Annual Grammy Awards?,"{'text': ['ten nominations'], 'answer_start': [51]}"
4,56daebe7e7c41114004b4b19,American_Idol,"American Idol is an American singing competition series created by Simon Fuller and produced by 19 Entertainment, and is distributed by FremantleMedia Frederick north America. It began airing on Fox on June 11, 2002, as an addition to the Idols format based on the British series Pop Idol and has since become one of the most successful show in the history of American television. The concept of the series is to find unexampled solo transcription artists, with the winner be determined by the viewers in America. Winners chosen by viewers through telephone, Internet, and SMS text voting were Kelly Clarkson, Ruben Studdard, Fantasia Barrino, Carrie Undergrowth, Taylor Hicks, Jordin Sparks, David Cook, Creese Allen, Lee DeWyze, Scotty McCreery, Phillip Phillips, Candice Glover, Caleb Johnson, and Nick Fradiani.",What British show was American Idol based on?,"{'text': ['Pop Idol'], 'answer_start': [270]}"
5,56d38a1559d6e414001466b4,American_Idol,"Some in the entertainment industry were critical of the star - make aspect of the show. Usher, a mentor on the show, bemoaning the loss of the "" true art form of music "", thought that shows like American Idol relieve oneself it seem "" so easy that everyone can do it, and that it can happen overnight "", and that "" television is a lie "". Musician Michael Feinstein, while acknowledging that the appearance had uncovered promising performers, said that American Idol "" isn ' t really about music. It ' s about all the bad aspects of the music business – the arrogance of commerce, this sensory faculty of ' I know what will make this person a star; artists themselves don ' t know. ' "" That American Idol is seen to be a fast track to success for its contestants has been a cause of resentment for some in the industry. LeAnn Rimes, commenting on Carrie Underwood winning Best Female Artist in Country Music Awards over Faith Hill in 2006, said that "" Carrie has not paid her dues long enough to fully deserve that award "". It is a common theme that has been echoed by many others. Elton John, who had appeared as a mentor in the show but turned down an offer to embody a judge on American Idol, commenting on talent shows in general, said that "" there have been some good acts but the only way to sustain a career is to pay your dues in small clubs "".",What year did Carrie Underwood win a Country Music Award for Best Female Artist?,"{'text': ['2006'], 'answer_start': [891]}"
6,5733d91e4776f4190066134d,Antibiotics,"Possible improvements include clarification of clinical trial regulations by FDA. Furthermore, appropriate economic incentives could persuade pharmaceutical companies to invest in this endeavor. Antibiotic Development to Advance Patient Treatment (ADAPT) Act aims to fast track the drug development to combat the growing threat of 'superbugs'. Under this Act, FDA can approve antibiotics and antifungals treating life-threatening infections based on smaller clinical trials. The CDC will monitor the use of antibiotics and the emerging resistance, and publish the data. The FDA antibiotics labeling process, 'Susceptibility Test Interpretive Criteria for Microbial Organisms' or 'breakpoints', will provide accurate data to healthcare professionals. According to Allan Coukell, senior director for health programs at The Pew Charitable Trusts, ""By allowing drug developers to rely on smaller datasets, and clarifying FDA's authority to tolerate a higher level of uncertainty for these drugs when making a risk/benefit calculation, ADAPT would make the clinical trials more feasible.""",Who is a director at the Pew Charitable Trusts?,"{'text': ['Allan Coukell,'], 'answer_start': [763]}"
7,5733b6a2d058e614000b6122,Antibiotics,"With progress in medicinal chemistry, most modern antibacterials are semisynthetic modifications of various natural chemical compound. These include, for example, the beta - lactam antibiotics, which include the penicillins (produced by fungus kingdom in the genus Penicillium ), the cephalosporins, and the carbapenems. Compounds that are still isolated from living organisms are the aminoglycosides, whereas other bactericide — for example, the sulfonamides, the quinolones, and the oxazolidinones — are develop solely by chemical synthesis. Many antibacterial compounds exist relatively small molecules with a molecular weight of less than 2000 atomic mass units. [citation need ]",What are antibiotics in chemical terms?,"{'text': ['semisynthetic modifications'], 'answer_start': [69]}"
8,572849443acd2414000df898,PlayStation_3,"The PlayStation 3 Slim received extremely positive reviews as well as a boost in sales; less than 24 hours after its announcement, PS3 Slim took the number-one bestseller spot on Amazon.com in the video games section for fifteen consecutive days. It regained the number-one position again one day later. PS3 Slim also received praise from PC World giving it a 90 out of 100 praising its new repackaging and the new value it brings at a lower price as well as praising its quietness and the reduction in its power consumption. This is in stark contrast to the original PS3's launch in which it was given position number-eight on their ""The Top 21 Tech Screwups of 2006"" list.",For how many consecutive days did the PS3 Slim hold the number-one spot on Amazon.com?,"{'text': ['fifteen'], 'answer_start': [221]}"
9,56d09354234ae51400d9c3ab,Solar_energy,"Beginning with the surge in coal purpose which accompanied the Industrial Rotation, energy consumption has steadily transitioned from wood and biomass to fossil fuels. The early development of solar technology starting in the 1860s was beat back by an expectation that coal would soon suit scarce. However, development of solar technologies stagnated in the early twentieth c in the face of the increase availability, economy, and utility of coal and petroleum.",What slowed the development of solar technologies in the early 20th century?,"{'text': ['increasing availability, economy, and utility of coal and petroleum'], 'answer_start': [395]}"


## Preprocessing the training data

In [420]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

The following assertion ensures that our tokenizer is a fast tokenizers (backed by Rust) from the 🤗 Tokenizers library. Those fast tokenizers are available for almost all models, and we will need some of the special features they have for our preprocessing.

In [421]:
import transformers
assert isinstance(tokenizer, transformers.PreTrainedTokenizerFast)

In [422]:
max_length = 384 # The maximum length of a feature (question and context)
doc_stride = 128 # The authorized overlap between two part of the context when splitting it is needed.

In [423]:
pad_on_right = tokenizer.padding_side == "right"

In [424]:
def prepare_train_features(examples):
    # Some of the questions have lots of whitespace on the left, which is not useful and will make the
    # truncation of the context fail (the tokenized question will take a lots of space). So we remove that
    # left whitespace
    examples["question"] = [q.lstrip() for q in examples["question"]]

    # Tokenize our examples with truncation and padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    # The offset mappings will give us a map from token to character position in the original context. This will
    # help us compute the start_positions and end_positions.
    offset_mapping = tokenized_examples.pop("offset_mapping")

    # Let's label those examples!
    tokenized_examples["start_positions"] = []
    tokenized_examples["end_positions"] = []

    for i, offsets in enumerate(offset_mapping):
        # We will label impossible answers with the index of the CLS token.
        input_ids = tokenized_examples["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)

        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        answers = examples["answers"][sample_index]
        # If no answers are given, set the cls_index as answer.
        if len(answers["answer_start"]) == 0:
            tokenized_examples["start_positions"].append(cls_index)
            tokenized_examples["end_positions"].append(cls_index)
        else:
            # Start/end character index of the answer in the text.
            start_char = answers["answer_start"][0]
            end_char = start_char + len(answers["text"][0])

            # Start token index of the current span in the text.
            token_start_index = 0
            while sequence_ids[token_start_index] != (1 if pad_on_right else 0):
                token_start_index += 1

            # End token index of the current span in the text.
            token_end_index = len(input_ids) - 1
            while sequence_ids[token_end_index] != (1 if pad_on_right else 0):
                token_end_index -= 1

            # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).
            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
                tokenized_examples["start_positions"].append(cls_index)
                tokenized_examples["end_positions"].append(cls_index)
            else:
                # Otherwise move the token_start_index and token_end_index to the two ends of the answer.
                # Note: we could go after the last offset if the answer is the last word (edge case).
                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
                    token_start_index += 1
                tokenized_examples["start_positions"].append(token_start_index - 1)
                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                tokenized_examples["end_positions"].append(token_end_index + 1)

    return tokenized_examples

This function works with one or several examples. In the case of several examples, the tokenizer will return a list of lists for each key:

## Fine-tuning the model

In [425]:
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


To instantiate a `Trainer`, we will need to define three more things. The most important is the [`TrainingArguments`](https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments), which is a class that contains all the attributes to customize the training. It requires one folder name, which will be used to save the checkpoints of the model, and all other arguments are optional:

In [426]:
model_name = model_checkpoint.split("/")[-1]
args = TrainingArguments(
    f"{model_name}-finetuned-squad-newsqa",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=True,
)

Here we set the evaluation to be done at the end of each epoch, tweak the learning rate, use the `batch_size` defined at the top of the notebook and customize the number of epochs for training, as well as the weight decay.

The last argument to setup everything so we can push the model to the [Hub](https://huggingface.co/models) regularly during training. Remove it if you didn't follow the installation steps at the top of the notebook. If you want to save your model locally in a name that is different than the name of the repository it will be pushed, or if you want to push your model under an organization and not your name space, use the `hub_model_id` argument to set the repo name (it needs to be the full name, including your namespace: for instance `"sgugger/bert-finetuned-squad"` or `"huggingface/bert-finetuned-squad"`).

Then we will need a data collator that will batch our processed examples together, here the default one will work:

In [427]:
from transformers import default_data_collator

data_collator = default_data_collator

We will evaluate our model and compute metrics in the next section (this is a very long operation, so we will only compute the evaluation loss during training).

Then we just need to pass all of this along with our datasets to the `Trainer`:

In [428]:
trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

We can now finetune our model by just calling the `train` method:

In [429]:
trainer.train()

Epoch,Training Loss,Validation Loss
1,No log,3.347259
2,No log,2.640569
3,2.956100,2.556108


TrainOutput(global_step=597, training_loss=2.7813381278135467, metrics={'train_runtime': 112.558, 'train_samples_per_second': 84.463, 'train_steps_per_second': 5.304, 'total_flos': 931589266503168.0, 'train_loss': 2.7813381278135467, 'epoch': 3.0})

Since this training is particularly long, let's save the model just in case we need to restart.

In [430]:
trainer.save_model("test-squad-trained")

events.out.tfevents.1703412072.4803b7c41ef3.3743.8:   0%|          | 0.00/5.50k [00:00<?, ?B/s]

## Evaluation

Evaluating our model will require a bit more work, as we will need to map the predictions of our model back to parts of the context. The model itself predicts logits for the start and en position of our answers: if we take a batch from our validation datalaoder, here is the output our model gives us:

In [431]:
import torch

for batch in trainer.get_eval_dataloader():
    break
batch = {k: v.to(trainer.args.device) for k, v in batch.items()}
with torch.no_grad():
    output = trainer.model(**batch)
output.keys()

odict_keys(['loss', 'start_logits', 'end_logits'])

The output of the model is a dict-like object that contains the loss (since we provided labels), the start and end logits. We won't need the loss for our predictions, let's have a look a the logits:

In [432]:
output.start_logits.shape, output.end_logits.shape

(torch.Size([16, 384]), torch.Size([16, 384]))

We have one logit for each feature and each token. The most obvious thing to predict an answer for each featyre is to take the index for the maximum of the start logits as a start position and the index of the maximum of the end logits as an end position.

In [433]:
output.start_logits.argmax(dim=-1), output.end_logits.argmax(dim=-1)

(tensor([ 35,  61,  45,  74, 144,  94,  46, 105, 162,  37, 103,  14,  20, 124,
          48, 103], device='cuda:0'),
 tensor([ 35,  63,  51,  36,  70,  96,  47, 107, 163,  39,  49,  21,  22, 127,
          52, 101], device='cuda:0'))

In [434]:
n_best_size = 20

In [435]:
import numpy as np

start_logits = output.start_logits[0].cpu().numpy()
end_logits = output.end_logits[0].cpu().numpy()
# Gather the indices the best start/end logits:
start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
valid_answers = []
for start_index in start_indexes:
    for end_index in end_indexes:
        if start_index <= end_index: # We need to refine that test to check the answer is inside the context
            valid_answers.append(
                {
                    "score": start_logits[start_index] + end_logits[end_index],
                    "text": "" # We need to find a way to get back the original substring corresponding to the answer in the context
                }
            )

In [436]:
def prepare_validation_features(examples):
    # Some of the questions have lots of whitespace on the left, which is not useful and will make the
    # truncation of the context fail (the tokenized question will take a lots of space). So we remove that
    # left whitespace
    examples["question"] = [q.lstrip() for q in examples["question"]]

    # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")

    # We keep the example_id that gave us this feature and we will store the offset mappings.
    tokenized_examples["example_id"] = []

    for i in range(len(tokenized_examples["input_ids"])):
        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)
        context_index = 1 if pad_on_right else 0

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        tokenized_examples["example_id"].append(examples["id"][sample_index])

        # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token
        # position is part of the context or not.
        tokenized_examples["offset_mapping"][i] = [
            (o if sequence_ids[k] == context_index else None)
            for k, o in enumerate(tokenized_examples["offset_mapping"][i])
        ]

    return tokenized_examples

And like before, we can apply that function to our validation set easily:

In [437]:
validation_features = datasets["validation"].map(
    prepare_validation_features,
    batched=True,
    remove_columns=datasets["validation"].column_names
)

Map:   0%|          | 0/338 [00:00<?, ? examples/s]

Now we can grab the predictions for all features by using the `Trainer.predict` method:

In [438]:
raw_predictions = trainer.predict(validation_features)

The `Trainer` *hides* the columns that are not used by the model (here `example_id` and `offset_mapping` which we will need for our post-processing), so we set them back:

In [439]:
validation_features.set_format(type=validation_features.format["type"], columns=list(validation_features.features.keys()))

We can now refine the test we had before: since we set `None` in the offset mappings when it corresponds to a part of the question, it's easy to check if an answer is fully inside the context. We also eliminate very long answers from our considerations (with an hyper-parameter we can tune)

In [440]:
max_answer_length = 30

In [441]:
start_logits = output.start_logits[0].cpu().numpy()
end_logits = output.end_logits[0].cpu().numpy()
offset_mapping = validation_features[0]["offset_mapping"]
# The first feature comes from the first example. For the more general case, we will need to be match the example_id to
# an example index
context = datasets["validation"][0]["context"]

# Gather the indices the best start/end logits:
start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
valid_answers = []
for start_index in start_indexes:
    for end_index in end_indexes:
        # Don't consider out-of-scope answers, either because the indices are out of bounds or correspond
        # to part of the input_ids that are not in the context.
        if (
            start_index >= len(offset_mapping)
            or end_index >= len(offset_mapping)
            or offset_mapping[start_index] is None
            or offset_mapping[end_index] is None
        ):
            continue
        # Don't consider answers with a length that is either < 0 or > max_answer_length.
        if end_index < start_index or end_index - start_index + 1 > max_answer_length:
            continue
        if start_index <= end_index: # We need to refine that test to check the answer is inside the context
            start_char = offset_mapping[start_index][0]
            end_char = offset_mapping[end_index][1]
            valid_answers.append(
                {
                    "score": start_logits[start_index] + end_logits[end_index],
                    "text": context[start_char: end_char]
                }
            )

valid_answers = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[:n_best_size]
valid_answers

[{'score': 7.744294, 'text': 'G'},
 {'score': 7.7277107, 'text': 'cells. G'},
 {'score': 5.976891, 'text': '. G'},
 {'score': 5.7757683,
  'text': 'Glial cells (also known as glia or neuroglia) come in several types, and perform a number of critical functions,'},
 {'score': 5.759185,
  'text': 'cells. Glial cells (also known as glia or neuroglia) come in several types, and perform a number of critical functions,'},
 {'score': 4.22836, 'text': 'functions,'},
 {'score': 4.008365,
  'text': '. Glial cells (also known as glia or neuroglia) come in several types, and perform a number of critical functions,'},
 {'score': 3.671344, 'text': 'cells.'},
 {'score': 3.6209111,
  'text': 'functions, including structural support, metabolic support, insulation, and guidance of development. Neurons, however, are usually considered the most important cells in'},
 {'score': 2.9378724, 'text': 'cells'},
 {'score': 2.5089068,
  'text': 'two broad classes of cells: neurons and glial cells. G'},
 {'score': 

We can compare to the actual ground-truth answer:

In [442]:
datasets["validation"][0]["answers"]

{'text': ['glia or neuroglia'], 'answer_start': [132]}

Our model picked the right as the most likely answer!

As we mentioned in the code above, this was easy on the first feature because we knew it comes from the first example. For the other features, we will need a map between examples and their corresponding features. Also, since one example can give several features, we will need to gather together all the answers in all the features generated by a given example, then pick the best one. The following code builds a map from example index to its corresponding features indices:

In [443]:
import collections

examples = datasets["validation"]
features = validation_features

example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
features_per_example = collections.defaultdict(list)
for i, feature in enumerate(features):
    features_per_example[example_id_to_index[feature["example_id"]]].append(i)

In [444]:
from tqdm.auto import tqdm

def postprocess_qa_predictions(examples, features, raw_predictions, n_best_size = 20, max_answer_length = 30):
    all_start_logits, all_end_logits = raw_predictions
    # Build a map example to its corresponding features.
    example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
    features_per_example = collections.defaultdict(list)
    for i, feature in enumerate(features):
        features_per_example[example_id_to_index[feature["example_id"]]].append(i)

    # The dictionaries we have to fill.
    predictions = collections.OrderedDict()

    # Logging.
    print(f"Post-processing {len(examples)} example predictions split into {len(features)} features.")

    # Let's loop over all the examples!
    for example_index, example in enumerate(tqdm(examples)):
        # Those are the indices of the features associated to the current example.
        feature_indices = features_per_example[example_index]

        min_null_score = None # Only used if squad_v2 is True.
        valid_answers = []

        context = example["context"]
        # Looping through all the features associated to the current example.
        for feature_index in feature_indices:
            # We grab the predictions of the model for this feature.
            start_logits = all_start_logits[feature_index]
            end_logits = all_end_logits[feature_index]
            # This is what will allow us to map some the positions in our logits to span of texts in the original
            # context.
            offset_mapping = features[feature_index]["offset_mapping"]

            # Update minimum null prediction.
            cls_index = features[feature_index]["input_ids"].index(tokenizer.cls_token_id)
            feature_null_score = start_logits[cls_index] + end_logits[cls_index]
            if min_null_score is None or min_null_score < feature_null_score:
                min_null_score = feature_null_score

            # Go through all possibilities for the `n_best_size` greater start and end logits.
            start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
            end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
            for start_index in start_indexes:
                for end_index in end_indexes:
                    # Don't consider out-of-scope answers, either because the indices are out of bounds or correspond
                    # to part of the input_ids that are not in the context.
                    if (
                        start_index >= len(offset_mapping)
                        or end_index >= len(offset_mapping)
                        or offset_mapping[start_index] is None
                        or offset_mapping[end_index] is None
                    ):
                        continue
                    # Don't consider answers with a length that is either < 0 or > max_answer_length.
                    if end_index < start_index or end_index - start_index + 1 > max_answer_length:
                        continue

                    start_char = offset_mapping[start_index][0]
                    end_char = offset_mapping[end_index][1]
                    valid_answers.append(
                        {
                            "score": start_logits[start_index] + end_logits[end_index],
                            "text": context[start_char: end_char]
                        }
                    )

        if len(valid_answers) > 0:
            best_answer = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[0]
        else:
            # In the very rare edge case we have not a single non-null prediction, we create a fake prediction to avoid
            # failure.
            best_answer = {"text": "", "score": 0.0}

        # Let's pick our final answer: the best one or the null answer (only for squad_v2)
        if not squad_v2:
            predictions[example["id"]] = best_answer["text"]
        else:
            answer = best_answer["text"] if best_answer["score"] > min_null_score else ""
            predictions[example["id"]] = answer

    return predictions

And we can apply our post-processing function to our raw predictions:

In [445]:
final_predictions = postprocess_qa_predictions(datasets["validation"], validation_features, raw_predictions.predictions)

Post-processing 338 example predictions split into 350 features.


  0%|          | 0/338 [00:00<?, ?it/s]

Then we can load the metric from the datasets library.

In [446]:
metric = load_metric("squad")

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


Then we can call compute on it. We just need to format predictions and labels a bit as it expects a list of dictionaries and not one big dictionary. In the case of squad_v2, we also have to set a `no_answer_probability` argument (which we set to 0.0 here as we have already set the answer to empty if we picked it).

In [447]:
formatted_predictions = [{"id": k, "prediction_text": v} for k, v in final_predictions.items()]
references = [{"id": ex["id"], "answers": ex["answers"]} for ex in datasets["validation"]]
metric.compute(predictions=formatted_predictions, references=references)

{'exact_match': 39.349112426035504, 'f1': 52.593878900107285}

You can now upload the result of the training to the Hub, just execute this instruction:

In [448]:
trainer.push_to_hub()

'https://huggingface.co/hyunjerry/distilbert-base-uncased-finetuned-squad-newsqa/tree/main/'