<a href="https://colab.research.google.com/github/maruseppe/Extractive-question-answering/blob/main/distilbert_finetuned_squad_accelerate.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Extractive question answering.

This involves posing questions about a document and identifying the answers as spans of text in the document itself.

We will fine-tune a DistilBERT model on the SQUAD dataset

# Push to the Hub 

To save crucial parameters and weigths during training we need to push our model to the Hub. To do this, we’ll need to log in to Hugging Face. If you’re running this code in a notebook, you can do so with the following utility function, which displays a widget where you can enter your login credentials. Make sure you have set an access token in your huggingface account with write permissions.

In [1]:
!pip install huggingface-hub

from huggingface_hub import notebook_login

notebook_login()

Login successful
Your token has been saved to /root/.huggingface/token
[1m[31mAuthenticated through git-credential store but this isn't the helper defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub. Run the following command in your terminal in case you want to set this credential helper as the default

git config --global credential.helper store[0m


We’ll determine the repository name from the model ID we want to give our model (feel free to replace the repo_name with your own choice; it just needs to contain your username, which is what the function get_full_repo_name() does):

In [2]:
from huggingface_hub import Repository, get_full_repo_name

model_name = "distilbert-finetuned-squad-accelerate"
repo_name = get_full_repo_name(model_name)
repo_name

'maruseppe/distilbert-finetuned-squad-accelerate'

The following code is needed to make sure the package for Git LFS is installed.

In [4]:
!sudo apt-get install git-lfs

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following NEW packages will be installed:
  git-lfs
0 upgraded, 1 newly installed, 0 to remove and 39 not upgraded.
Need to get 2,129 kB of archives.
After this operation, 7,662 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 git-lfs amd64 2.3.4-1 [2,129 kB]
Fetched 2,129 kB in 2s (888 kB/s)
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 76, <> line 1.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure: unable to re-open stdin: 
Selecting previously unselected package git-lfs.
(Reading database ... 155335 files and directories cur

Then we can clone that repository in a local folder. If it already exists, this local folder should be a clone of the repository we are working with:

In [5]:
output_dir = "distilbert-finetuned-squad-accelerate"
repo = Repository(output_dir, clone_from=repo_name)

Cloning https://huggingface.co/maruseppe/distilbert-finetuned-squad-accelerate into local empty directory.


# Loading pretrained model 

In our fine_tuning task we use the DistilBert pretrained model:  
DistilBERT is a transformers model, a distilled version of the BERT base model, but smaller and faster than BERT. DistilBERT was pretrained on the same corpus in a self-supervised fashion, using the BERT base model as a teacher.

Bert is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. The model learns an inner representation of the English language that can then be used to extract features useful for downstream tasks. Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering. The BERT model was pretrained on BookCorpus, a dataset consisting of 11,038 unpublished books and English Wikipedia (excluding lists, tables and headers).

Alternatively to BERT, DistilBERT is 4 times faster but performance is degraded by 3%, while ROBERTa is 4-5 times slower but performance is improved by 2-20 %.

In [6]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.17.0-py3-none-any.whl (3.8 MB)
[K     |████████████████████████████████| 3.8 MB 4.4 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 58.0 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.49-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 57.5 MB/s 
Collecting tokenizers!=0.11.3,>=0.11.1
  Downloading tokenizers-0.11.6-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.5 MB)
[K     |████████████████████████████████| 6.5 MB 33.5 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Uninstalling PyYAML-3.13:
      Successfully uninstalled PyYAML-3.13
Successfully installed pyyaml-6.0 sacremoses-0.0.49 tokenizers-0.11.6 transf

In [8]:
from transformers import AutoModelForQuestionAnswering
model_checkpoint= "distilbert-base-uncased"
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForQuestionAnswering: ['vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_transform.bias', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this mode

# Loading dataset for fine-tuning 
The SQuAD dataset, which consists of questions posed by crowdworkers on a set of Wikipedia articles. 

In [9]:
!pip install datasets

Collecting datasets
  Downloading datasets-2.0.0-py3-none-any.whl (325 kB)
[?25l[K     |█                               | 10 kB 28.9 MB/s eta 0:00:01[K     |██                              | 20 kB 10.5 MB/s eta 0:00:01[K     |███                             | 30 kB 8.3 MB/s eta 0:00:01[K     |████                            | 40 kB 3.7 MB/s eta 0:00:01[K     |█████                           | 51 kB 3.7 MB/s eta 0:00:01[K     |██████                          | 61 kB 4.4 MB/s eta 0:00:01[K     |███████                         | 71 kB 4.7 MB/s eta 0:00:01[K     |████████                        | 81 kB 4.8 MB/s eta 0:00:01[K     |█████████                       | 92 kB 5.4 MB/s eta 0:00:01[K     |██████████                      | 102 kB 4.4 MB/s eta 0:00:01[K     |███████████                     | 112 kB 4.4 MB/s eta 0:00:01[K     |████████████                    | 122 kB 4.4 MB/s eta 0:00:01[K     |█████████████                   | 133 kB 4.4 MB/s eta 0:00:01[

In [10]:
from datasets import load_dataset 

In [11]:
raw_datasets= load_dataset("squad")

Downloading builder script:   0%|          | 0.00/1.97k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.02k [00:00<?, ?B/s]

Downloading and preparing dataset squad/plain_text (download: 33.51 MiB, generated: 85.63 MiB, post-processed: Unknown size, total: 119.14 MiB) to /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/8.12M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.05M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

Dataset squad downloaded and prepared to /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

A good practice when doing any sort of data analysis is to grab a small random sample to get a quick feel for the type of data we’re working with. In 🤗 Datasets, we can create a random sample by chaining the Dataset.shuffle() and Dataset.select() functions together.

To enable the conversion between various third-party libraries, 🤗 Datasets provides a Dataset.set_format() function. This function only changes the output format of the dataset, so you can easily switch to another format without affecting the underlying data format, which is Apache Arrow. The formatting is done in place. To demonstrate, let’s convert our dataset to Pandas.



In [12]:
sample = raw_datasets["train"].shuffle(seed=42).select(range(1000))
sample.set_format("pandas")
# Peek at the first few examples
sample[:3]

Unnamed: 0,id,title,context,question,answers
0,573173d8497a881900248f0c,Egypt,The Pew Forum on Religion & Public Life ranks ...,What percentage of Egyptians polled support de...,"{'text': ['84%'], 'answer_start': [468]}"
1,57277e815951b619008f8b52,"Ann_Arbor,_Michigan",The Ann Arbor Hands-On Museum is located in a ...,Ann Arbor ranks 1st among what goods sold?,"{'text': ['books'], 'answer_start': [402]}"
2,5727e2483acd2414000deef0,Rule_of_law,One important aspect of the rule-of-law initia...,"In developing countries, who makes most of the...","{'text': ['the executive'], 'answer_start': [6..."


Let's verify that there is one only possible answer in the training set.
However, in the validation set there are several possible answers for each sample, which may be the same or different. We won’t dive into the evaluation script as it will all be wrapped up by a 🤗 Datasets metric for us, but the short version is that some of the questions have several possible answers, and this script will compare a predicted answer to all the acceptable answers and take the best score. 

In [13]:
raw_datasets["train"].filter(lambda x : len(x["answers"]["text"])>1)

  0%|          | 0/88 [00:00<?, ?ba/s]

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 0
})

In [14]:
raw_datasets["validation"].filter( lambda x : len(x["answers"]["text"])>1)

  0%|          | 0/11 [00:00<?, ?ba/s]

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 10567
})

## Preprocessing the train data 
As we deal with long contexts, each example will be split into a number of (length(context)/max_length) chuncks, each of them containing the question and some part of the context. Consequentely, the answer to the question will appear in only one of the chuncks. Thus preprocessing step consists in assigning a tuple of indices (0,0) if the answer is not in the chunck and (start_answer_index, end_answer_index) if the answer is in the chunk.

We first need to find the indices that start and end the context in the input IDs. We could use the token type IDs to do this, but since those do not necessarily exist for all models (DistilBERT does not require them, for instance), we’ll instead use the sequence_ids() method of the BatchEncoding our tokenizer returns.

Once we have the start and end context token indices, we look at the corresponding offsets, which are tuples of two integers representing the span of characters of each chunk within its original context.
At this point, we will assign start and end position labels = 0, to those chunks where the answer is not included in the context, ( corrisponding to the [CLS] token). The same assignment will be carried out for those chunks where the answer has been truncated so that we only have the start (or end) of it. Finally, for the examples where the answer is fully in the context, the labels will be the index of the token where the answer starts and the index of the token where the answer ends.

In [15]:
from transformers import AutoTokenizer 

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
tokenizer.is_fast

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

True

In [16]:
max_length = 384
stride = 128


def preprocess_training_examples(examples):
    # here we remove some extra space present in the questions of the squad dataset
    questions = [q.strip() for q in examples["question"]]

    #tokenization of long contexts
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=max_length,
        truncation="only_second",
        stride=stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    sample_map = inputs.pop("overflow_to_sample_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        sample_idx = sample_map[i]
        answer = answers[sample_idx]
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label is (0, 0)
        if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

To apply this function to the whole training set, we use the Dataset.map() method with the batched=True flag. It’s necessary here as we are changing the length of the dataset (since one example can give several training features):

In [17]:
train_dataset = raw_datasets["train"].map(
    preprocess_training_examples, 
    batched = True,
    remove_columns = raw_datasets["train"].column_names,
)

  0%|          | 0/88 [00:00<?, ?ba/s]

In [18]:
print(f"After tokenization the dataset increased from  {len( raw_datasets['train'] ) } to {len(train_dataset)} examples ")

After tokenization the dataset increased from  87599 to 88524 examples 


## Preprocessing the validation data
After training, we will validate our model by interpreting validation dataset predictions into spans of the original context. For this, we will just need to store both the offset mappings and match each created feature to the original example it comes from (to this end, we will use the ID column in the original dataset). 

Note that we don’t need to generate labels (unless we want to compute a validation loss, but that number won’t really help us understand how good the model is). 

The only thing we’ll add here is a tiny bit of cleanup of the offset mappings. They will contain offsets for the question and the context, but once we’re in the post-processing stage we won’t have any way to know which part of the input IDs corresponded to the context and which part was the question (the sequence_ids() method we used is available for the output of the tokenizer only). So, we’ll set the offsets corresponding to the question to None:



In [19]:
def preprocess_validation_examples(examples):
    
    #for each example extract its question and strip it
    questions = [q.strip() for q in examples["question"]]
    
    #chunk tokenization
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=max_length,
        truncation="only_second",
        stride=stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    sample_map = inputs.pop("overflow_to_sample_mapping")
    example_ids = []

    for i in range(len(inputs["input_ids"])):
        
        #assign the example id to the chunck
        sample_idx = sample_map[i]
        example_ids.append(examples["id"][sample_idx])
        
        #set the chunk offset mapping corresponding to the question to None 
        sequence_ids = inputs.sequence_ids(i)
        offset = inputs["offset_mapping"][i]
        inputs["offset_mapping"][i] = [
            o if sequence_ids[k] == 1 else None for k, o in enumerate(offset)
        ]

    inputs["example_id"] = example_ids
    return inputs

In [20]:
validation_dataset = raw_datasets["validation"].map(
    preprocess_validation_examples,
    batched=True,
    remove_columns=raw_datasets["validation"].column_names,
)
print(f"After tokenization the dataset increased from  {len( raw_datasets['validation'] ) } to {len(validation_dataset)} examples ")

  0%|          | 0/11 [00:00<?, ?ba/s]

After tokenization the dataset increased from  10570 to 10784 examples 


## Metrics Function
During training we will use a compute_metrics() function to compute how good is the model at generalizing on validation data. This function receives a tuple with start- and end-logits, as well as offsets from the dataset of features and finally original contexts from the dataset of examples.


In [21]:
from datasets import load_metric

metric = load_metric("squad")


Downloading builder script:   0%|          | 0.00/1.72k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.12k [00:00<?, ?B/s]

In [22]:
from tqdm.auto import tqdm
import collections
import numpy as np

n_best = 20
max_answer_length = 30

def compute_metrics(start_logits, end_logits, features, examples):
    example_to_features = collections.defaultdict(list)
    for idx, feature in enumerate(features):
        example_to_features[feature["example_id"]].append(idx)

    predicted_answers = []
    for example in tqdm(examples):
        example_id = example["id"]
        context = example["context"]
        answers = []

        # Loop through all features associated with that example
        for feature_index in example_to_features[example_id]:
            start_logit = start_logits[feature_index]
            end_logit = end_logits[feature_index]
            offsets = features[feature_index]["offset_mapping"]

            start_indexes = np.argsort(start_logit)[-1 : -n_best - 1 : -1].tolist()
            end_indexes = np.argsort(end_logit)[-1 : -n_best - 1 : -1].tolist()
            for start_index in start_indexes:
                for end_index in end_indexes:
                    # Skip answers that are not fully in the context
                    if offsets[start_index] is None or offsets[end_index] is None:
                        continue
                    # Skip answers with a length that is either < 0 or > max_answer_length
                    if (
                        end_index < start_index
                        or end_index - start_index + 1 > max_answer_length
                    ):
                        continue

                    answer = {
                        "text": context[offsets[start_index][0] : offsets[end_index][1]],
                        "logit_score": start_logit[start_index] + end_logit[end_index],
                    }
                    answers.append(answer)

        # Select the answer with the best score
        if len(answers) > 0:
            best_answer = max(answers, key=lambda x: x["logit_score"])
            predicted_answers.append(
                {"id": example_id, "prediction_text": best_answer["text"]}
            )
        else:
            predicted_answers.append({"id": example_id, "prediction_text": ""})

    theoretical_answers = [{"id": ex["id"], "answers": ex["answers"]} for ex in examples]
    return metric.compute(predictions=predicted_answers, references=theoretical_answers)

# Custom training loop

With a custom training loop we will be able to evaluate the model regularly since we’re not constrained by the Trainer class.

First we need to build the DataLoaders from our datasets. We set the format of those datasets to "torch", and remove the columns in the validation set that are not used by the model. Then, we can use the default_data_collator provided by Transformers as a collate_fn and shuffle the training set, but not the validation set.

Then we will need an optimizer. As usual we use the classic AdamW, which is like Adam, but with a fix in the way weight decay is applied:

In [23]:
import torch

from torch.utils.data import DataLoader
from transformers import default_data_collator

train_dataset.set_format("torch")
validation_set = validation_dataset.remove_columns(["example_id", "offset_mapping"])
validation_set.set_format("torch")

train_dataloader = DataLoader(
    train_dataset,
    shuffle=True,
    collate_fn=default_data_collator,
    batch_size=8,
)
eval_dataloader = DataLoader(
    validation_set, collate_fn=default_data_collator, batch_size=8
)


from torch.optim import AdamW

optimizer = AdamW(model.parameters(), lr=2e-5)

## accelerator

Once we have all those objects, we can send them to the accelerator.prepare() method. Remember that if we want to train on TPUs in a Colab notebook, you will need to move all of this code into a training function, and that shouldn’t execute any cell that instantiates an Accelerator. We can force mixed-precision training by passing fp16=True to the Accelerator

In [24]:
!pip install accelerate
from accelerate import Accelerator

accelerator = Accelerator(fp16=True)
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader
)

Collecting accelerate
  Downloading accelerate-0.5.1-py3-none-any.whl (58 kB)
[?25l[K     |█████▋                          | 10 kB 19.8 MB/s eta 0:00:01[K     |███████████▎                    | 20 kB 9.4 MB/s eta 0:00:01[K     |█████████████████               | 30 kB 7.9 MB/s eta 0:00:01[K     |██████████████████████▋         | 40 kB 3.6 MB/s eta 0:00:01[K     |████████████████████████████▎   | 51 kB 4.1 MB/s eta 0:00:01[K     |████████████████████████████████| 58 kB 2.6 MB/s 
Installing collected packages: accelerate
Successfully installed accelerate-0.5.1




We can only use the train_dataloader length to compute the number of training steps after it has gone through the accelerator.prepare() method. To this goal we use the linear schedule:

In [25]:
from transformers import get_scheduler

num_train_epochs = 3
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

 ## training loop

 We are now ready to write the full training loop. After defining a progress bar to follow how training goes, the loop has three parts:

-The training in itself, which is the classic iteration over the train_dataloader, forward pass through the model, then backward pass and optimizer step.

-The evaluation, in which we gather all the values for start_logits and end_logits before converting them to NumPy arrays. Once the evaluation loop is finished, we concatenate all the results. Note that we need to truncate because the Accelerator may have added a few samples at the end to ensure we have the same number of examples in each process.

-Saving and uploading, where we first save the model and the tokenizer , then call repo.push_to_hub(). We use the argument blocking=False to tell the 🤗 Hub library to push in an asynchronous process. This way, training continues normally and this (long) instruction is executed in the background. The first line tells all the processes to wait until everyone is at that stage before continuing. This is to make sure we have the same model in every process before saving. Then we grab the unwrapped_model, which is the base model we defined. The accelerator.prepare() method changes the model to work in distributed training, so it won’t have the save_pretrained() method anymore; the accelerator.unwrap_model() method undoes that step. Lastly, we call save_pretrained() but tell that method to use accelerator.save() instead of torch.save().

In [26]:
from tqdm.auto import tqdm
import torch

progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_train_epochs):
    # Training
    model.train()
    for step, batch in enumerate(train_dataloader):
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

    # Evaluation
    model.eval()
    start_logits = []
    end_logits = []
    accelerator.print("Evaluation!")
    for batch in tqdm(eval_dataloader):
        with torch.no_grad():
            outputs = model(**batch)

        start_logits.append(accelerator.gather(outputs.start_logits).cpu().numpy())
        end_logits.append(accelerator.gather(outputs.end_logits).cpu().numpy())

    start_logits = np.concatenate(start_logits)
    end_logits = np.concatenate(end_logits)
    start_logits = start_logits[: len(validation_dataset)]
    end_logits = end_logits[: len(validation_dataset)]

    metrics = compute_metrics(
        start_logits, end_logits, validation_dataset, raw_datasets["validation"]
    )
    print(f"epoch {epoch}:", metrics)

    # Save and upload
    accelerator.wait_for_everyone()
    unwrapped_model = accelerator.unwrap_model(model)
    unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)
    if accelerator.is_main_process:
        tokenizer.save_pretrained(output_dir)
        repo.push_to_hub(
            commit_message=f"Training in progress epoch {epoch}", blocking=False
        )

  0%|          | 0/33198 [00:00<?, ?it/s]

KeyboardInterrupt: ignored

## Results of the fine-tuned model
Once the training is complete, we can finally evaluate our model. The predict() method of the Trainer will return a tuple where the first elements will be the predictions of the model ( a pair with the start and end logits). We send this to our compute_metrics() function:

In [None]:
predictions, _ = trainer.predict(validation_dataset)
start_logits, end_logits = predictions
compute_metrics(start_logits, end_logits, validation_dataset, raw_datasets["validation"])

Great! As a comparison, the baseline scores reported in the BERT article for this model are 80.8 and 88.5, so we’re right where we should be.

# Interference using the fine-tuned model.
To use the fine-tuned model for interference on a test exampple of your own, we can do it locally in a pipeline, by specifying the model identifier:

In [None]:
from transformers import pipeline

# Replace this with your own checkpoint
model_checkpoint = "huggingface-course/bert-finetuned-squad"
question_answerer = pipeline("question-answering", model=model_checkpoint)

context = """
🤗 Transformers is backed by the three most popular deep learning libraries — Jax, PyTorch and TensorFlow — with a seamless integration
between them. It's straightforward to train your models with one before loading them for inference with the other.
"""
question = "Which deep learning libraries back 🤗 Transformers?"
question_answerer(question=question, context=context)