# Using pre-trained BERT for building a Q&A system

##### **😉 Hello there!**

Before starting to explore and run this notebook, here are a few things you should know:

<div class="alert alert-info">
<b>🧐 What is BERT?</b>

It is a language representation model which stands for Bidirectional Encoder Representations from Transformers. BERT model can be finetuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial taskspecific architecture modifications. See more details in the official paper [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.](https://arxiv.org/abs/1810.04805)
</div>

<div class="alert alert-info">
<b>🧐 What is transformers?</b>

[Transformers](https://huggingface.co/docs/transformers/index) provides APIs and tools to easily download and train state-of-the-art pretrained models. Using pretrained models can reduce your compute costs, carbon footprint, and save you the time and resources required to train a model from scratch. These models support common tasks in different modalities, such as:

- Natural Language Processing
- Computer Vision
- Audio
- Multimodal
</div>

<div class="alert alert-info">
<b>🧐 What is HuggingFace?</b>

[Hugging Face, Inc.](https://huggingface.co/) is a French company that develops tools for building applications using machine learning. It is most notable for its Transformers library built for natural language processing applications and its platform that allows users to share machine learning models and datasets. The [HuggingFace Hub ](https://huggingface.co/docs/hub/index) is a platform with over 120k models, 20k datasets, and 50k demo apps (Spaces), all open source and publicly available, in an online platform where people can easily collaborate and build ML together. The Hub works as a central place where anyone can explore, experiment, collaborate and build technology with Machine Learning.
</div>

# Setup

In [2]:
# These are all the libraries and frameworks we're going to use
from datasets import load_dataset, load_metric
from datetime import datetime
import mlflow
import mlflow.pytorch
import torch
from tqdm.autonotebook import tqdm
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, TrainingArguments, Trainer, pipeline

2024-03-22 14:15:01.373090: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [3]:
start_time_all_execution = datetime.now() # This variable is to help us to see in how much time this notebook will run

# Dataset

## <b>🧐 What is the SQuAD dataset?</b>

Stanford Question Answering Dataset ([SQuAD](https://rajpurkar.github.io/SQuAD-explorer/)) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. See more details at SQuAD.

In this case, we're going to download this dataset from the [HuggingFace datasets](https://huggingface.co/datasets) repository.
</div>

In [4]:
squad_dataset = load_dataset("squad") # Downloading the dataset
squad_dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

By printing the squad_dataset variable, we can see it is a dict composed by two key-values: the 'train' key, with all the train dataset as value, and the 'validation' key, with all the validation dataset as value. In this first part, we're using just the train dataset. Let's explore it a little bit, shall we?

## **The train dataset**

As we saw, we have and dict with two key-values. Let's access the "train" key.

In [5]:
squad_train_dataset = squad_dataset['train']
squad_train_dataset

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 87599
})

We can see that, inside the squad_train_dataset, we have another type of data, which is a Dataset type, (very similar to a dict) composed by two key-values: 'features' and 'num_rows'. 
We have in this Dataset 87599 rows, which corresponds to 87599 indixes inside it, each index corresponds to one question-anwser input for our model, like this:



In [6]:
index_input = 1

squad_train_dataset[index_input]

{'id': '5733be284776f4190066117f',
 'title': 'University_of_Notre_Dame',
 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'question': 'What is in front of the Notre Dame Main Building?',
 'answers': {'text': ['a copper statue of Christ'], 'answer_start': [188]}}

Looking into the this input of this Dataset object, we can see our first input.

The inputs are a dict where each key comes from the 'features' list we just seen earlier, so we have:

- id: A unique id for each input
- title: The title for the question-answer (to give context)
- context: The text input for the model to search the answer for the question
- question: The question based on the context
- answers: The answer based on the context

Each of this feature can be accessed individually, like this:

In [7]:
print(squad_train_dataset[index_input]['title'])
print(squad_train_dataset[index_input]['context'])
print(squad_train_dataset[index_input]['question'])
print(squad_train_dataset[index_input]['answers'])

University_of_Notre_Dame
Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.
What is in front of the Notre Dame Main Building?
{'text': ['a copper statue of Christ'], 'answer_start': [188]}


Something important to notice here is that the 'answers' is an key composed by a dict as a value with two key-values: 'text' and 'answer_start'.
The 'text' key corresponds to a list with the text answers, so yes! We can have more than one answer for each question in the datasets!
In this case, for the first input we just have one answer. But we should check in the rest of the inputs. Let's do this using the filter method.

In [8]:
# Checking in the train dataset if we have just one answer for each question
squad_train_dataset.filter(lambda x: len(x['answers']['text']) != 1)

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 0
})

Great, we don't have more than one answer for each question. 

Well, we're done exploring the train dataset. Let's go to the next part, shall we?

# Tokenizer

<b>🧐 What is a tokenizer?</b>

A tokenizer is in charge of preparing the inputs for a model. There are a range of tokenizers, so let's get to know the BERT tokenizer and what it does with the text.
</div>

## **BERT Tokenizer**

First thing we have to do is to load the [BERT](https://huggingface.co/docs/transformers/model_doc/bert) pre-trained model from the HuggigngFace Hub. In this case, we're using the [distilbert-base-cased](https://huggingface.co/distilbert-base-cased) model.

In [9]:
model_checkpoint_bbc = "distilbert-base-cased" #"bert-base-cased" is a larger option if you want to test!
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint_bbc) # getting the model's tokenizer 

For exploring it, let's take the same sample from the train dataset exploration.

In [10]:
# getting just context and the question
#print(index_input) # uncoment to remember the index

context = squad_train_dataset[index_input]['context'] 
question = squad_train_dataset[index_input]['question']
print(context)
print(question)

Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.
What is in front of the Notre Dame Main Building?


Let's pass the question and the context for the tokenizer, in this order, since it is the way BERT receives the inputs.

In [11]:
inputs = tokenizer(question, context) # Decoding the inputs
inputs

{'input_ids': [101, 1327, 1110, 1107, 1524, 1104, 1103, 10360, 8022, 4304, 4334, 136, 102, 22182, 1193, 117, 1103, 1278, 1144, 170, 2336, 1959, 119, 1335, 4184, 1103, 4304, 4334, 112, 188, 2284, 10945, 1110, 170, 5404, 5921, 1104, 1103, 6567, 2090, 119, 13301, 1107, 1524, 1104, 1103, 4304, 4334, 1105, 4749, 1122, 117, 1110, 170, 7335, 5921, 1104, 4028, 1114, 1739, 1146, 14089, 5591, 1114, 1103, 7051, 107, 159, 21462, 1566, 24930, 2508, 152, 1306, 3965, 107, 119, 5893, 1106, 1103, 4304, 4334, 1110, 1103, 19349, 1104, 1103, 11373, 4641, 119, 13301, 1481, 1103, 171, 17506, 9538, 1110, 1103, 144, 10595, 2430, 117, 170, 14789, 1282, 1104, 8070, 1105, 9284, 119, 1135, 1110, 170, 16498, 1104, 1103, 176, 10595, 2430, 1120, 10111, 20500, 117, 1699, 1187, 1103, 6567, 2090, 25153, 1193, 1691, 1106, 2216, 17666, 6397, 3786, 1573, 25422, 13149, 1107, 8109, 119, 1335, 1103, 1322, 1104, 1103, 1514, 2797, 113, 1105, 1107, 170, 2904, 1413, 1115, 8200, 1194, 124, 11739, 1105, 1103, 3487, 17917, 114, 117

We can see that it returned a big dict. Let's chekc its keys.

In [12]:
inputs.keys()

dict_keys(['input_ids', 'attention_mask'])

We have 2 keys here, but in this experiment, we're just interested in the 'input_ids', okay? Let's understand it.

### **input_ids**

Let's get the first 26 items.

In [13]:
print(inputs['input_ids'][0:27])

[101, 1327, 1110, 1107, 1524, 1104, 1103, 10360, 8022, 4304, 4334, 136, 102, 22182, 1193, 117, 1103, 1278, 1144, 170, 2336, 1959, 119, 1335, 4184, 1103, 4304]


You can be asking yourself what are those numbers. Let's decode them with the same tokenizer we've just used.

In [14]:
tokenizer.decode(inputs['input_ids'][0:27])

'[CLS] What is in front of the Notre Dame Main Building? [SEP] Architecturally, the school has a Catholic character. Atop the Main'

Just like magic, right? 😌

What the tokenizer did was to concatenate both the question and the context in one long string and give as output the dict with two keys, where the input_ids has a list of each token of the original context and answer in a numerical representation. For example, the token "What" from the text is corresponding to the input_id 1327.

Ok, and how about the tokens [CLS] and [SEP]?

Don't worry, I got you! These are called special tokens. CLS is the classification token and comes before the question. The SEP token is to delimit the begining and the end of the context. So, the format that BERT outputs is follows is:

**[CLS] 'question' [SEP] 'context' [SEP]**

### **BERT and (way too) long contexts**

We saw that the context of the inputs can be very long. Let's get the context from the previous part.

In [15]:
sentences = 0 
for i in context:
    if str(i) == '.':
        sentences += 1

print(context)
print(f"Total number of words: {len(context)}")
print(f"Total number of sentences: {sentences}")

Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.
Total number of words: 695
Total number of sentences: 7


This was something that would worry in the NLP area, since it's pretty different from other commom application like next sentence predictions where the inputs were just single sentences, and not just that: BERT can only handle a limited number of tokens! (In 2023, right now, it is limited to 512 tokens). You could think "Why we just don't truncate the context?". Well, this is a terrible option since our answer could be cut off from it.

The solution that was found is to split the context into **multiple context windows**!
It means that one data sample will turn into multiple data samples, and at least one of them will certanly contain the answer.

But, what if a part of the answer begins in one window and is cut off and then the rest is in the next window?

Well, in this case, we use **overlaping windows**!
It is easier to understand and visualize all of those concepts when we use the tokenizer again. Let's go!

### **Understanding the model's tokenizer**

As before, we're using the same quenstion and context from the previous part and also passing them to the tokenizer in this specific order. 

- max_length: refers to the maximum length of the entire input (including the question, context and special tokens [CLS] and [SEP]). 
- truncation: here, we're saying that we just want to truncate the second input, which is the context.
- stride: this one defines how much overlap there is between the context windows when they're splited up.
- return_overflowing_tokens: this one is to return the overlaping tokens.

In [16]:
inputs = tokenizer(
    question,
    context, 
    max_length=100,
    truncation="only_second",
    stride=50,
    return_overflowing_tokens=True
)

Let's see what we've got in the inputs and its keys.

In [17]:
print(inputs)
print(inputs.keys())

{'input_ids': [[101, 1327, 1110, 1107, 1524, 1104, 1103, 10360, 8022, 4304, 4334, 136, 102, 22182, 1193, 117, 1103, 1278, 1144, 170, 2336, 1959, 119, 1335, 4184, 1103, 4304, 4334, 112, 188, 2284, 10945, 1110, 170, 5404, 5921, 1104, 1103, 6567, 2090, 119, 13301, 1107, 1524, 1104, 1103, 4304, 4334, 1105, 4749, 1122, 117, 1110, 170, 7335, 5921, 1104, 4028, 1114, 1739, 1146, 14089, 5591, 1114, 1103, 7051, 107, 159, 21462, 1566, 24930, 2508, 152, 1306, 3965, 107, 119, 5893, 1106, 1103, 4304, 4334, 1110, 1103, 19349, 1104, 1103, 11373, 4641, 119, 13301, 1481, 1103, 171, 17506, 9538, 1110, 1103, 144, 102], [101, 1327, 1110, 1107, 1524, 1104, 1103, 10360, 8022, 4304, 4334, 136, 102, 4749, 1122, 117, 1110, 170, 7335, 5921, 1104, 4028, 1114, 1739, 1146, 14089, 5591, 1114, 1103, 7051, 107, 159, 21462, 1566, 24930, 2508, 152, 1306, 3965, 107, 119, 5893, 1106, 1103, 4304, 4334, 1110, 1103, 19349, 1104, 1103, 11373, 4641, 119, 13301, 1481, 1103, 171, 17506, 9538, 1110, 1103, 144, 10595, 2430, 117, 1

Despite the 'attention_mask' key appears again, we're just interested in the other 2 keys. Let's dive in!

#### **input_ids**

We can see now that the 'input_ids' is now a list of lists.

In [18]:
print(f'Total number of lists: {len(inputs["input_ids"])}')

Total number of lists: 4


Let's decode those lists and see what we have.

In [19]:
for id in inputs["input_ids"]:
    print(tokenizer.decode(id))

[CLS] What is in front of the Notre Dame Main Building? [SEP] Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend " Venite Ad Me Omnes ". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the G [SEP]
[CLS] What is in front of the Notre Dame Main Building? [SEP] facing it, is a copper statue of Christ with arms upraised with the legend " Venite Ad Me Omnes ". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernade [SEP]
[CLS] What is in front of the Notre Dame Main Building? [SEP] of the Sacred Heart. Immediately behind the basilica is the Grotto, 

Wonderful! The tokenizer has splited the context into 4 different inputs and has conserved the question and the special tokens in each one of them.

#### **overflow_to_sample_mapping**

To do a better demonstration of what is this new key, let's pass to the tokenizer more than one input sample.

In [20]:
question_samples = squad_train_dataset[:3]["question"] # Getting the first 3 questions
context_samples = squad_train_dataset[:3]["context"] #  and contexts

for i in question_samples:
    print(i)

To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?
What is in front of the Notre Dame Main Building?
The Basilica of the Sacred heart at Notre Dame is beside to which structure?


In [21]:
print(context) # You can check if you want but those questions are from the same context, so no need to print all of the 3.

Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.


Let's set up the tokenizer with these new samples and one more argument to understand the overflow mapping:

- return_offsets_mapping: this returns the start and end character for each token (it will be explained after this part)

In [22]:
inputs = tokenizer(
    question_samples, 
    context_samples,
    max_length=100,
    truncation="only_second",
    stride=50,
    return_overflowing_tokens=True,
    return_offsets_mapping=True
)

Let's see how many windows we have now with 3 inputs:

In [23]:
print(len(inputs["input_ids"]))

12


Great! But how do we know how much each input was splitted on? That's what the overflow_to_sample_mapping gives to us! Look at this:

In [24]:
inputs["overflow_to_sample_mapping"]

[0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2]

The first sample corresponds to 0, so it was splied into 4 windows. The second sample corresponds to 1, and it was also splitted into 4 windows. The for the third sample, which is number 2. Since they como from the same context, it's usual to have the same amount of windows for the samples.

In [25]:
for id in inputs["input_ids"]:
    print(tokenizer.decode(id))

[CLS] To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France? [SEP] Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend " Venite Ad Me Omnes ". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basi [SEP]
[CLS] To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France? [SEP] the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend " Venite Ad Me Omnes ". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin [SEP]
[CLS] To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France? [SEP] Next to the Main Building is the B

#### **offset_mapping**

To do a better demonstration, let's go back to the single input and pass to the same tokenizer from the previous part.

In [26]:
# print(question, "\n", context) #descoment this cell if you don't remember them

In [27]:
inputs = tokenizer(
    question,
    context, 
    max_length=100,
    truncation="only_second",
    stride=50,
    return_overflowing_tokens=True,
    return_offsets_mapping=True
)

inputs.keys()

dict_keys(['input_ids', 'attention_mask', 'offset_mapping', 'overflow_to_sample_mapping'])

We have now a new key, the offset_mapping key. Let's print it.

In [28]:
# inputs['offset_mapping']

We can see that it is a list composed of lists of tuples.

In [29]:
print(f"Total number of lists: {len(inputs['offset_mapping'])}")

Total number of lists: 4


Basically, the offset_mapping tell us the location of  the start and the end of each token from the **ORIGINAL** samples! Take a look at this:

In [30]:
print(tokenizer.decode(inputs['input_ids'][0])) # Taking the firts window
print(inputs['offset_mapping'][0])

[CLS] What is in front of the Notre Dame Main Building? [SEP] Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend " Venite Ad Me Omnes ". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the G [SEP]
[(0, 0), (0, 4), (5, 7), (8, 10), (11, 16), (17, 19), (20, 23), (24, 29), (30, 34), (35, 39), (40, 48), (48, 49), (0, 0), (0, 13), (13, 15), (15, 16), (17, 20), (21, 27), (28, 31), (32, 33), (34, 42), (43, 52), (52, 53), (54, 56), (56, 58), (59, 62), (63, 67), (68, 76), (76, 77), (77, 78), (79, 83), (84, 88), (89, 91), (92, 93), (94, 100), (101, 107), (108, 110), (111, 114), (115, 121), (122, 126), (126, 127), (128, 139), (140, 142), (143, 148), (149, 151), (152, 155), (156, 160), (161, 169), (170, 173), (174, 180), (181, 183), (183, 184), (185, 187

Looking at the sentence and the offset_mapping, the (0,0) is for the special tokens. The (0,4) is for the location of the token "What", which is composed by 4 chars, so it starts at the position 0 and ends at position 4. The next is the (5, 7) for the location of the token "is", which starts at position 5 and ends at position 7, and so on.

Notice that the word 'original' from the previous text is in bold and upper case. Let's understand why this is important in the next step.

# Targets

We have just seen that we can work with long contexts by spliting them into multiple windows, right?

In [31]:
for id in inputs["input_ids"]:
    print(tokenizer.decode(id))

[CLS] What is in front of the Notre Dame Main Building? [SEP] Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend " Venite Ad Me Omnes ". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the G [SEP]
[CLS] What is in front of the Notre Dame Main Building? [SEP] facing it, is a copper statue of Christ with arms upraised with the legend " Venite Ad Me Omnes ". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernade [SEP]
[CLS] What is in front of the Notre Dame Main Building? [SEP] of the Sacred Heart. Immediately behind the basilica is the Grotto, 


Well, the thing is that the answer in the dataset comes with a start position, remember?

In [32]:
answer = squad_train_dataset[index_input]['answers']
print(answer)
#print(context) # uncoment here to remember the original context

{'text': ['a copper statue of Christ'], 'answer_start': [188]}


But this start position is within the original context that has not been splited yet. After spliting it into windows of context, that position is no longer valid, and for the model, this is exactly the target that it is waiting for. So, we need to align this targets within the windows now, considering that sometimes the answer may not exist in one specific window or only exist in part.

There's an useful method we can call from the tokenizer from all models (e.g. DistilBERT, bert-base-uncased, etc) which is the sequence_ids method. Let's take a look.

### **sequence_ids method**

In [33]:
print(inputs.sequence_ids(0)) # Getting the first window
print(tokenizer.decode(inputs['input_ids'][0])) # first window context

[None, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, None, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, None]
[CLS] What is in front of the Notre Dame Main Building? [SEP] Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend " Venite Ad Me Omnes ". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the G [SEP]


The meaning of each item is what follows:

- None: special tokens like [CLS] and [SEP]
- 0: is par of the question sentence
- 1: is par of the the context sentence

So, with this codification, we can now compare the start position of the answer in the original context and the start position of the answer in the windows!

### **Finding the answer: Window contexts and original context**

Let's see now what is the start index of the answer within the windows.

In [34]:
sequence_ids = inputs.sequence_ids(0) # Getting the first window

wind_ctx_start = sequence_ids.index(1) # Getting the first occurence of 1, which means the index where the context begins
wind_ctx_end = (len(sequence_ids) - sequence_ids[::-1].index(1) - 1) # Getting the index of the last 1, where the context ends

wind_ctx_start, wind_ctx_end

(13, 98)

And the original start answer position (within the context)

In [35]:
print(answer)
ans_start_char = answer['answer_start'][0]
ans_end_char = ans_start_char + len(answer['text'][0]) # the length of the text plus 515 is the final char of the answer

print((ans_start_char, ans_end_char))

{'text': ['a copper statue of Christ'], 'answer_start': [188]}
(188, 213)


OK! Let's now use the offset mapping since it tells us about the char positions within the context.

In [36]:
offset = inputs['offset_mapping'][0] # First windows
# print(offset) # uncoment to remember the offset
# print(tokenizer.decode(inputs['input_ids'][0])) # and how they correspond to the original sentences

Since the offset has the original char starts and ends and also the (0,0) for special tokens (in this case, we are focosing in the [SEP] token that tell us where the context starts and ends) in the original context, we can compare if those original indeces match with the window context indices we have.

In [37]:
start_idx = 0
end_idx = 0

if offset[wind_ctx_start][0] > ans_start_char or offset[wind_ctx_end][1] < ans_end_char:
    print("target is (0,0)")
else:
    i = wind_ctx_start
    for start_end_char in offset[wind_ctx_start:]:
        start, end = start_end_char
        if start == ans_start_char:
            start_idx = i

        if end == ans_end_char:
            end_idx = i 
            break

        i += 1
    
start_idx, end_idx

(53, 57)

We need to get this indexes and use in the input_ids from the first sample and then decode it to see if the answers match.

In [38]:
input_ids = inputs['input_ids'][0]
# tokenizer.decode(input_ids) # uncoment to visualize

In [39]:
# Placing the start_idx and end_idx and decoding.
print(input_ids[start_idx:end_idx+1])
print(tokenizer.decode(input_ids[start_idx : end_idx + 1]))

[170, 7335, 5921, 1104, 4028]
a copper statue of Christ


Real answer

In [40]:
answer['text']

['a copper statue of Christ']

Yay! It matches!

Now, we just have to turn this into a function.

In [41]:
def find_asnwer_token_idx(
        ctx_start,
        ctx_end,
        ans_start_char,
        ans_end_char,
        offset
):

    start_idx = 0
    end_idx = 0

    if offset[ctx_start][0] > ans_start_char or offset[ctx_end][1] < ans_end_char:
        pass #answer does not exist
    else:
        i = ctx_start
        # aligning the indices of the answers within the context windows
        for start_end_char in offset[ctx_start:]:
            start, end = start_end_char
            if start == ans_start_char:
                start_idx = i

            if end == ans_end_char:
                end_idx = i 
                break

            i += 1
    return start_idx, end_idx
        

In [42]:
# now applying to the whole dataset
start_idxs = []
end_idxs = []

for i, offset in enumerate(inputs["offset_mapping"]):
    sequence_ids = inputs.sequence_ids(i)

    ctx_start = sequence_ids.index(1)
    ctx_end = len(sequence_ids) - sequence_ids[::-1].index(1) - 1

    start_idx, end_idx = find_asnwer_token_idx(
        ctx_start,
        ctx_end,
        ans_start_char,
        ans_end_char,
        offset
    )

    start_idxs.append(start_idx)
    end_idxs.append(end_idx)

start_idxs, end_idxs

([53, 17, 0, 0], [57, 21, 0, 0])

They are in this format because of the overlapping, remember?
In this input we have 4 windows, which means that for the firts window, the answer starts at index 53 and ends in index 57. Same for the second window. For the third and last windows, the answer does not appear. 😉

### Applying the tokenizer

One commum issue in this dataset is that some questions are badly formatted and have extra white spaces in the beggining of in the end of it. 

In [43]:
for q in squad_dataset["train"]["question"][:1000]:
    if q.strip() != q:
        print(q)

In what city and state did Beyonce  grow up? 
 The album, Dangerously in Love  achieved what spot on the Billboard Top 100 chart?
Which song did Beyonce sing at the first couple's inaugural ball? 
What event did Beyoncé perform at one month after Obama's inauguration? 
Where was the album released? 
What movie influenced Beyonce towards empowerment themes? 


So, let's define our tokenizer function and add this particular par for dealing with extra white spaces.

In [44]:
# Defining some fixed args
max_length = 384 # Indicated by Google
stride = 128

In [45]:
def tokenize_fn_train(batch):
    questions = [q.strip() for q in batch['question']]

    inputs = tokenizer(
        questions, 
        batch['context'],
        max_length=max_length,
        truncation="only_second",
        stride=stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length"
    )

    # We won't use those guys so let's kick them off (remove them)
    offset_mapping = inputs.pop("offset_mapping")
    orig_sample_idxs = inputs.pop("overflow_to_sample_mapping")

    # From the original dataset
    answers = batch['answers']
    start_idxs, end_idxs = [], []

    # Loops we just saw
    for i, offset in enumerate(offset_mapping):
        sample_idx = orig_sample_idxs[i]
        answer = answers[sample_idx]

        ans_start_char = answer['answer_start'][0]
        ans_end_char = ans_start_char + len(answer['text'][0])

        sequence_ids = inputs.sequence_ids(i)

        # Aligning the indexes
        ctx_start = sequence_ids.index(1)
        ctx_end = len(sequence_ids) - sequence_ids[::-1].index(1) - 1

        start_idx, end_idx = find_asnwer_token_idx(
            ctx_start,
            ctx_end,
            ans_start_char,
            ans_end_char,
            offset
        )

        start_idxs.append(start_idx)
        end_idxs.append(end_idx)

    inputs["start_positions"] = start_idxs
    inputs["end_positions"] = end_idxs

    return inputs

#### Tokenizing the train dataset

In [46]:
train_dataset = squad_train_dataset.map(
    tokenize_fn_train,
    batched=True,
    remove_columns=squad_train_dataset.column_names
)

In [47]:
# The actual train dataset ir a little bit  bigger than the original 
# Because we've expanded the context in windows
print(f'Processed dataset: {len(train_dataset)}\nOriginal dataset: {len(squad_dataset["train"])}')

Processed dataset: 88729
Original dataset: 87599


Creating the same function for the validation dataset

In [48]:
def tokenize_fn_validation(batch):
    questions = [q.strip() for q in batch['question']]

    inputs = tokenizer(
        questions, 
        batch['context'],
        max_length=max_length,
        truncation="only_second",
        stride=stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length"
    )

    orig_sample_idxs = inputs.pop("overflow_to_sample_mapping")
    sample_ids = []

    for i in range(len(inputs["input_ids"])):
        # Getting the corresponding ID from the original samples (thei identify the questions and contexts remember?) 
        sample_idx = orig_sample_idxs[i]
        sample_ids.append(batch["id"][sample_idx])

        sequence_ids = inputs.sequence_ids(i) # 1:context | 0:question | (0,0): special tokens
        offset = inputs["offset_mapping"][i] # getting the sequence_ids for this sample

        # Modifying the original offset_mapping 
        # When it is (0,0) or 0 replace with None
        # And get just the context
        inputs["offset_mapping"][i] = [
            x if sequence_ids[j] == 1 else None for j, x in enumerate(offset)
        ]

    inputs["sample_id"] = sample_ids
    return inputs

In [49]:
validation_dataset = squad_dataset["validation"].map(
    tokenize_fn_validation,
    batched=True,
    remove_columns=squad_dataset["validation"].column_names
)

print(f'Processed dataset: {len(validation_dataset)}\nOriginal dataset: {len(squad_dataset["validation"])}')

Processed dataset: 10822
Original dataset: 10570


## Metrics and Logits

We can load a metric called "squad" anduse it in our problem! Let's see how it will work.

In [50]:
metric = load_metric("squad")

  metric = load_metric("squad")
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


In [51]:
# making some examples

pred_answers = [
    {'id': '1', 'prediction_text': 'Strawberry'},
    {'id': '2', 'prediction_text': 'Agriculture industry'},
    {'id': '3', 'prediction_text': 'Red'}
]

true_answers = [
    {'id': '1', 'answers': {'text': ['Strawberry'], 'answer_start': [80]}},
    {'id': '2', 'answers': {'text': ['Agroindustry'], 'answer_start': [65]}},
    {'id': '3', 'answers': {'text': ['Red'], 'answer_start': [100]}}
]

# checking the metrics

metric.compute(predictions=pred_answers, references=true_answers)

{'exact_match': 66.66666666666667, 'f1': 66.66666666666667}

But, before heading to metrics, let's remember that the model outputs are Logits, numbers! We have to make it back to numbers.

For that, we're donwloading a pretrained question-answering model to get predictions that are not random and use those predictions to learn how to convert the logits into answer strings.

With it, we won't need the whole dataset, but just a part of it for learning how to turn them into strings!

Let's dive in!

##### Learning how to transform logits into answers

In [52]:
small_validation_dataset = squad_dataset["validation"].select(range(100)) # Getting just the first 100 samples from the validation set 
trained_checkpoint = "distilbert-base-cased-distilled-squad" # model trained in q&a

tokenizer2 = AutoTokenizer.from_pretrained(trained_checkpoint) # new tokenizer from distilbert-base-cased-distilled-squad

# Here, since the tokenizer is a global variable 
# And we're training it with another model trained in q&a
# We're temporarily exchanging this global variable for the tokenizer2
original_tokenizer = tokenizer
tokenizer = tokenizer2

Now, let's process this small validation dataset

In [53]:
small_validation_processed = small_validation_dataset.map( # Now, we can use this new tokenizer from distilbert-base-cased-distilled-squad
    tokenize_fn_validation,                                 # and map it into our small validation dataset using the function tokenize_fn_validation
    batched=True,
    remove_columns=squad_dataset["validation"].column_names
)

Once this cell is done, let's just get back with the first and original tokenizer from distilbert-base-cased model

In [54]:
tokenizer = original_tokenizer

Now, it's time to change some things in our small dataset like unsed columns and change to torch format in order to pass the inputs to process in the GPU

In [55]:
small_model_inputs =  small_validation_processed.remove_columns(['sample_id', 'offset_mapping']) # unused columns
small_model_inputs.set_format("torch")

#### Setting the GPU

Once the previous step is done, it's time to set the GPUas our device and move the inputs (now tensors) to there 

In [56]:
# Setting the GPU as current device 
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [57]:
small_model_inputs_gpu = {
    k: small_model_inputs[k].to(device) for k in small_model_inputs.column_names
}
# All the data will come from the GPU now

Downloading the distilbert-base-cased-distilled-squad model and setting into the GPU

In [58]:
trained_model =  AutoModelForQuestionAnswering.from_pretrained(trained_checkpoint).to(device)

Getting the model's output

In [59]:
with torch.no_grad(): # This is just saying that we're not using any compution gradient (like we're not training)
    outputs = trained_model(**small_model_inputs_gpu) # passing the inputs to distilbert-base-cased-distilled-squad and getting the outputs

Great! Let's see those outputs

In [60]:
outputs

QuestionAnsweringModelOutput(loss=None, start_logits=tensor([[ -2.2607,  -5.1783,  -5.2709,  ...,  -9.5243,  -9.5183,  -9.5288],
        [ -2.5961,  -5.5482,  -5.5313,  ...,  -9.9598,  -9.9533,  -9.9860],
        [ -3.7127,  -7.1848,  -8.5388,  ..., -11.6557, -11.6571, -11.6505],
        ...,
        [ -2.0260,  -4.4167,  -4.4980,  ...,  -8.1479,  -8.1530,  -8.1760],
        [ -4.1553,  -5.8304,  -7.1643,  ..., -10.5255, -10.5251, -10.4890],
        [ -3.2000,  -5.8162,  -6.7249,  ...,  -9.4935,  -9.5038,  -9.4871]],
       device='cuda:0'), end_logits=tensor([[ -0.7353,  -4.9236,  -5.1048,  ...,  -8.8734,  -8.8916,  -8.8550],
        [ -1.3056,  -5.3870,  -5.4945,  ...,  -9.4895,  -9.5039,  -9.4959],
        [ -2.7649,  -7.2201,  -9.0916,  ..., -11.3106, -11.3414, -11.2702],
        ...,
        [ -0.0768,  -4.8210,  -4.4374,  ...,  -8.0483,  -8.0502,  -7.9903],
        [ -2.7347,  -5.3650,  -7.2549,  ..., -10.0498, -10.0661,  -9.9886],
        [ -1.0991,  -4.2569,  -6.1267,  ...,  -8

This kind of QuestionAnsweringModelOutput object is composed with a tuple containing the start_logits and the end_logits.

#### Turning the logits into IDs

In [61]:
# Here, we're getting the logits, moving back to CPU and formatting as a numpy array (we don't need them in the tensor format anymore)
start_logits = outputs.start_logits.cpu().numpy()
end_logits = outputs.end_logits.cpu().numpy()

Let's remember how the ID's look like in the small_validation_processed

In [62]:
small_validation_processed["sample_id"][:3] # remember that small_validation_processed was processed by distilbert-base-cased-distilled-squad tokenizer

['56be4db0acb8001400a502ec',
 '56be4db0acb8001400a502ed',
 '56be4db0acb8001400a502ee']

And how they look like in our validation dataset

In [63]:
validation_dataset["sample_id"][:3]

['56be4db0acb8001400a502ec',
 '56be4db0acb8001400a502ed',
 '56be4db0acb8001400a502ee']

Also, they're not unique! Remebember that one ID could come from more than one question-answer input because of the windows? One input could be splitted into 2 or 3 or more windows, but they still form the same input.

In [64]:
print(f"Total ID's from validation: {len(validation_dataset['sample_id'])},\nTotal unique ID's from validation: {len(set(validation_dataset['sample_id']))}")

Total ID's from validation: 10822,
Total unique ID's from validation: 10570


So, for handling this case, we're building an dict where the key is the ID and the value is going to be an list pointing the indixes taht this ID corresponds in the small_validation_processed. If we have an input that was splited into 3 parts for examples, we want something like this:

{'56be4db0acb8001400a502ef': [5, 6, 7]}

In [65]:
sample_id2idxs = {}

for i, id_ in enumerate(small_validation_processed['sample_id']): # looping through all the ID's and enumerating to get the index
    if id_ not in sample_id2idxs: # Checking if this ID existis
        sample_id2idxs[id_] = [i] # If not, we create an entry with the format we just saw above.
    else:
        print("here") # If existis,
        sample_id2idxs[id_].append(i) # we just append into the existing list

In [66]:
# sample_id2idxs # uncoment to see the result

Great! Now, let's understand how we're turning this into strings.

Let's check the shape of our logits. We expect them to be in this shape:

(number_of_samples, max_length)

In [67]:
start_logits.shape, end_logits.shape # remember that they come from the outputs we got from distilbert-base-cased-distilled-squad

((100, 384), (100, 384))

We will need to sort the indices of the logits in order to get where the values are stored within this indices.
First, we negate the the values by placing a '-' in front of the array. This will sort them in descending order.

In [68]:
# uncoment to see
#print(start_logits[0])
#print()
#print(-start_logits[0])

In thsi way, the largest values will be at the front. Then, when we call the argsort method, we organize this array by ascending order, getting this result:

In [69]:
indices = (-start_logits[0]).argsort() # here, we are taking just the first position for example.
indices

array([ 46,  57,  47,  38,  39,  58,  50,  43,  45,  54,  56,  49,  13,
        42,  40,  35,  27,  31,  48,  41,  53,  44,  37,  59,  78,  15,
         0,  52,  24,  65,  81,  70,  18,  51,  55,  26,  69,  29,  28,
        75,  61,  64,  23,  36,  32,  11, 101,  62,  66,  34,  95,  30,
        63,  21,  19,  20,  17,  14,  22,  33,  68,  87, 171,  12,  76,
        71,  73,  92, 110,  84, 151,   1,  74,   2,   6,  16,  80,  79,
       105,  98,  10,  96, 136, 169, 106, 100,  93, 165,  67, 109,   8,
        90,   3, 115,  60,   5,  97,   7, 103, 102,  86,  72, 111,  89,
       108,   4,  88,  25, 132,  77, 123, 150, 124, 153,  83, 118,  82,
        85, 107, 114, 143, 164, 137, 130, 166, 159, 131,  91,   9, 144,
       139, 160,  94, 141, 128, 112, 134, 152, 170, 154, 117, 127, 104,
       140, 157, 155, 133, 145, 119, 162, 138, 135, 156, 167, 168, 126,
       148, 163, 161, 116,  99, 120, 142, 158, 125, 146, 113, 121, 147,
       149, 129, 122, 311, 312, 304, 309, 313, 310, 300, 307, 31

If we use those indices in the original array, we get this result:

In [70]:
start_logits[0][indices]

array([10.69444   ,  9.803681  ,  4.4599767 ,  4.400482  ,  2.9437776 ,
        2.701735  ,  2.012642  ,  1.5780758 ,  0.52236927,  0.02073596,
       -0.02802782, -0.04971706, -0.3857315 , -0.6945391 , -0.7979498 ,
       -0.8678062 , -0.872207  , -1.3516879 , -1.370372  , -1.3878838 ,
       -1.5135087 , -1.7355462 , -1.8827081 , -1.8932881 , -1.9078954 ,
       -1.9304959 , -2.2607315 , -2.2983875 , -2.306936  , -2.5027428 ,
       -2.5100663 , -2.5308392 , -2.5399976 , -2.671815  , -2.7323549 ,
       -2.7710226 , -2.7713668 , -2.952134  , -3.0604637 , -3.1706042 ,
       -3.2045438 , -3.5693393 , -3.5798075 , -3.666883  , -3.725064  ,
       -3.7498558 , -3.7632174 , -3.9968169 , -4.0113277 , -4.0688004 ,
       -4.0944843 , -4.195477  , -4.2383127 , -4.3323617 , -4.352419  ,
       -4.387961  , -4.3886123 , -4.3966126 , -4.6790543 , -4.703028  ,
       -4.7757573 , -4.777813  , -4.7882166 , -4.788251  , -4.822125  ,
       -4.872537  , -4.884937  , -4.898152  , -5.0720987 , -5.10

We have the start_logits showing in descending order!


Now, let's really transform the logits into string and you'll understand everything.

In [71]:
n_largest = 20 # Number of start and end logits we want to search
max_answer_length = 30 # Max answer length we want to allow
predict_answers = [] # List of predicted answers will be stored 

for sample in small_validation_dataset: # For each sample in the NON-processed (it is not tokenized!) small validation dataset (CTRL+click if you want to remember)
    sample_id = sample["id"] # Get the id from this sample
    context = sample["context"] # and the context

    # Initializing best_score and best_answer (they'll be update in the below looping)
    best_score = float("-inf") 
    best_answer = None

    for idx in sample_id2idxs[sample_id]: # For each id in the sample_id2idxs (samples here are tokenized!) in the sample_id as index (remebmer it is a dict)
        # Grabbing the start and end logits for this index
        start_logit = start_logits[idx] 
        end_logit = end_logits[idx]
        # And also get the offset mapping for this index
        offsets = small_validation_processed[idx]["offset_mapping"] # note that this offset mapping is the processed, containg None for any position
                                                                    # that is not in the context
        # Sorting the logits as we saw                                                           
        start_indices = (-start_logit).argsort() 
        end_indices = (-end_logit).argsort()

        # Next step is to loop through the n_largest start and end logits
        for start_idx in start_indices[:n_largest]:
            for end_idx in end_indices[:n_largest]:
                # Checking the cases where the answer:
                if offsets[start_idx] is None or offsets[end_idx] is None: # Answer is not in the context
                    continue
                if end_idx < start_idx: # Answer does not exist (since is has negative length)
                    continue
                if (end_idx - start_idx + 1) > max_answer_length: # Answer is longer than allowed
                    continue

                # If we have an answer,
                score = start_logit[start_idx] + end_logit[end_idx] # Compute the score for this answer
                if score > best_score: # Checking if score is better than the current best_score
                    best_score = score # If yes, compute

                    # Getting the position of the first character and of the last character
                    first_ch = offsets[start_idx][0] 
                    last_ch = offsets[end_idx][1]
                    # Retrieving the answer as actual text using the them as indices in the context
                    best_answer = context[first_ch:last_ch]
        # And finally append to the list        
        predict_answers.append({"id": sample_id, "prediction_text": best_answer})

Onde this is done, we just need to format the true answer in the right format for computing the metrcis.

To remembem the format, go to the beggining of the "Metrics and Logits" outline.

In [72]:
true_answers = [
    {
    "id": x["id"],
    "answers": x["answers"]
    }
    for x in small_validation_dataset
]

In [73]:
#true_answers # uncoment to see the result

Yay! We can now turn the logits into string and finally compute the metrics!

In [74]:
metric.compute(predictions=predict_answers, references=true_answers)

{'exact_match': 83.0, 'f1': 88.25000000000004}

Let's turn the whole process into a function called compute_metrics

#### Computing metrics

In [75]:
def compute_metrics(start_logits, end_logits, processed_dataset, orig_dataset):
    sample_id2idxs = {}

    for i, id_ in enumerate(processed_dataset["sample_id"]):
        if id_ not in sample_id2idxs:
            sample_id2idxs[id_] = [i]
        else:
            sample_id2idxs[id_].append(i)

    predicted_answers = []
    for sample in tqdm(orig_dataset):

        sample_id = sample["id"]
        context = sample['context']

        best_score = float("-inf")
        best_answer = None

        for idx in sample_id2idxs[sample_id]:
            start_logit = start_logits[idx]
            end_logit = end_logits[idx]

            offsets = processed_dataset[idx]["offset_mapping"]

            start_indices = (-start_logit).argsort()
            end_indices = (-end_logit).argsort()

            for start_idx in start_indices[:n_largest]:
                for end_idx in end_indices[:n_largest]:
                    if offsets[start_idx] is None or offsets[end_idx] is None:
                        continue

                    if end_idx < start_idx:
                        continue

                    if (end_idx - start_idx + 1) > max_answer_length:
                        continue

                    score = start_logit[start_idx] + end_logit[end_idx]
                    if score > best_score:
                        best_score = score

                        first_ch = offsets[start_idx][0] 
                        last_ch = offsets[end_idx][1]
                        best_answer = context[first_ch:last_ch]
                
        predicted_answers.append({"id": sample_id, "prediction_text": best_answer})
    true_answers = [{"id": x["id"], "answers": x["answers"]} for x in orig_dataset]
    y = metric.compute(predictions=predicted_answers, references=true_answers)
    return y
    

Let's run the function on the small datasets we used earlier

In [76]:
compute_metrics(
    start_logits,
    end_logits,
    small_validation_processed,
    small_validation_dataset
)

  0%|          | 0/100 [00:00<?, ?it/s]

{'exact_match': 83.0, 'f1': 88.25000000000004}

Great!
This function will be used after our training step is done!

# Training

In [77]:
mlflow.end_run()
mlflow.set_experiment("BERT Q&A - distilbert-base-cased")

2024/03/22 14:15:36 INFO mlflow.tracking.fluent: Experiment with name 'BERT Q&A - distilbert-base-cased' does not exist. Creating a new experiment.


<Experiment: artifact_location='/phoenix/mlflow/151682566930780428', creation_time=1711116936652, experiment_id='151682566930780428', last_update_time=1711116936652, lifecycle_stage='active', name='BERT Q&A - distilbert-base-cased', tags={}>

In [78]:
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint_bbc) # Loading the model we want to fine-tune (distilbert-base-cased)

Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Now it is time to create our TrainingArguments object with all the necssary arguments for the training step

In [79]:
args = TrainingArguments(
    "finetuned-squad", # this is a default name for this model and task
    evaluation_strategy="no", # No, because we'll compute metrics manually
    save_strategy="epoch", # saving for each step (you can use epoch as well)
    learning_rate=2e-5, # learnin rate value 
    num_train_epochs=3, # 3 epoch in total (max is 4 since out inputs are very large, more tha that is not recommended)
    weight_decay=0.01, # regularization technique
    fp16=True # speed up the process
)

Now, let's instantiate a trainer object

In [80]:
trainer = Trainer(
    model=model, # our model
    args=args, # our args
    train_dataset=train_dataset, # our datasets
    eval_dataset=validation_dataset,
    tokenizer=tokenizer # and our tokenizer
)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Checking if the GPU is available

In [81]:
torch.cuda.is_available()

True

The time has come! Let's train!

In [82]:
mlflow.end_run()
start_time_training = datetime.now() # this is for computing the time it take for the training
with mlflow.start_run():
    trainer.train() 
print(f'Total time for training: {datetime.now() - start_time_training}')

Step,Training Loss
500,3.3492
1000,2.2978
1500,2.0159
2000,1.7748
2500,1.6831
3000,1.5824
3500,1.5159
4000,1.4333
4500,1.3863
5000,1.422


Total time for training: 1:45:05.599867


## Prediction

Now, let's do the evaluation.

In [83]:
trainer_prediction = trainer.predict(validation_dataset) # getting the predictions for the validation set
trainer_prediction

PredictionOutput(predictions=(array([[ -8.2265625, -10.53125  , -10.6640625, ..., -11.2578125,
        -11.265625 , -11.25     ],
       [ -8.296875 , -10.5859375, -10.734375 , ..., -11.265625 ,
        -11.2734375, -11.2578125],
       [ -8.328125 , -10.7421875, -10.5625   , ..., -11.2734375,
        -11.2734375, -11.2734375],
       ...,
       [ -5.1367188, -10.7109375, -10.75     , ..., -11.359375 ,
        -11.34375  , -11.3203125],
       [ -5.0039062, -10.3203125, -10.375    , ..., -11.4296875,
        -11.4140625, -11.390625 ],
       [ -3.6503906, -10.8125   , -10.921875 , ..., -11.375    ,
        -11.3671875, -11.3359375]], dtype=float32), array([[ -7.6289062, -11.1484375, -10.859375 , ..., -11.4296875,
        -11.4296875, -11.4453125],
       [ -7.578125 , -11.109375 , -10.8359375, ..., -11.40625  ,
        -11.40625  , -11.421875 ],
       [ -7.78125  , -11.34375  , -11.5625   , ..., -11.5078125,
        -11.515625 , -11.5      ],
       ...,
       [ -4.546875 , -11.0156

And grab just the prediction values from this objetc

In [84]:
predictions, _, _ = trainer_prediction
predictions

(array([[ -8.2265625, -10.53125  , -10.6640625, ..., -11.2578125,
         -11.265625 , -11.25     ],
        [ -8.296875 , -10.5859375, -10.734375 , ..., -11.265625 ,
         -11.2734375, -11.2578125],
        [ -8.328125 , -10.7421875, -10.5625   , ..., -11.2734375,
         -11.2734375, -11.2734375],
        ...,
        [ -5.1367188, -10.7109375, -10.75     , ..., -11.359375 ,
         -11.34375  , -11.3203125],
        [ -5.0039062, -10.3203125, -10.375    , ..., -11.4296875,
         -11.4140625, -11.390625 ],
        [ -3.6503906, -10.8125   , -10.921875 , ..., -11.375    ,
         -11.3671875, -11.3359375]], dtype=float32),
 array([[ -7.6289062, -11.1484375, -10.859375 , ..., -11.4296875,
         -11.4296875, -11.4453125],
        [ -7.578125 , -11.109375 , -10.8359375, ..., -11.40625  ,
         -11.40625  , -11.421875 ],
        [ -7.78125  , -11.34375  , -11.5625   , ..., -11.5078125,
         -11.515625 , -11.5      ],
        ...,
        [ -4.546875 , -11.015625 , -10.

We have a tuple with two arrays, the start_logits and end_logits!

In [85]:
start_logits, end_logits = predictions

##### Computing the metrics

In [86]:
compute_metrics(
    start_logits,
    end_logits,
    validation_dataset,
    squad_dataset['validation']
)

  0%|          | 0/10570 [00:00<?, ?it/s]

{'exact_match': 77.1050141911069, 'f1': 85.14177757622853}

Saving the model for further usage

In [87]:
trainer.save_model('distilbert_bertqa')

# Inference

We can create a question-answering pipeline from transformers and pass our model to it.

In [88]:
qa = pipeline(
    'question-answering',
    model='distilbert_bertqa',
    device=0 #GPU
)

Testing the pipeline

In [89]:
context = "Tomorrow the Atlântico is going to have a delicious team lunch!"
question = "What did the Atlântico is going to have tomorrow?"
qa(context=context, question=question)

{'score': 0.5906658172607422,
 'start': 40,
 'end': 62,
 'answer': 'a delicious team lunch'}

In [90]:
print(f' {datetime.now() - start_time_all_execution}')

 1:47:23.731507
