# Introduction:

* In the previous chapter we learned how to use tokenizers and pretrained models to make predictions.
* In this chapter we will see how to **Fine-tune** a model on our **Dataset** by learning:
   - How to prepare a large dataset for the finetuning process
   - How to use the high level API trainer to finetune a model
   - How to leverage the HuggingFace Accelerate library to easily run that custom training loop on any distributed setup
* But first let's do the usuall by picking an architecture/model/tokenizer, and then train it some sample data:   

In [2]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AdamW

In [3]:
mdl_ckp = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(mdl_ckp)
model = AutoModelForSequenceClassification.from_pretrained(mdl_ckp)
sequences = ["I've been waiting for a HuggingFace course my whole life.", "This course is amazing!"]
batch = tokenizer(sequences, truncation=True, padding=True, return_tensors='pt')

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [4]:
# will be axplained later
batch['labels'] = torch.tensor([1, 1])

In [5]:
# training
optimizer = torch.optim.AdamW(model.parameters())
loss = model(**batch).loss
loss.backward()
optimizer.step()

* Of course training a model on 2 sentences will not yield a good results
* So we need to introduce it to a larger dataset
* In this chapter we will work with: example the [**MRPC**](https://aclanthology.org/I05-5002.pdf) (Microsoft Research Paraphrase Corpus) dataset.
    - The dataset consists of 5,801 pairs of sentences, with a label indicating if they are paraphrases or not (i.e., if both sentences mean the same thing)
    - This is one of the 10 datasets composing the [GLUE benchmark](https://gluebenchmark.com/), which is an academic benchmark that is used to measure the performance of ML models across 10 different text classification tasks

## Loading Datasets From The Hub:

* We can easily download a dataset from the Hub just like we did with models before:

In [6]:
# load dataset
from datasets import load_dataset
dataset = load_dataset('glue', 'mrpc')


Downloading builder script:   0%|          | 0.00/28.8k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/28.7k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/27.9k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data: 0.00B [00:00, ?B/s]

Downloading data: 0.00B [00:00, ?B/s]

Downloading data: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/3668 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/408 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1725 [00:00<?, ? examples/s]

In [7]:
dataset

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

* Datasets are presented as **DatasetDict** which is an object dictionary that our datset is organized by.
   - Here we have our training-set, validation-set and test-set.
   - Each set 2 keys: features and num_rows.
   - Features has: `sentence1`, `sentence2`, `label`, `idx`
   - `sentence1&2` represent the pair we need to train our model on and predict whether its paraphrased or not.
   

* We can access each pair of sentences in our raw_datasets object by indexing, like with a dictionary:

In [8]:
# training set
train_ds = dataset['train']
train_ds[22]

{'sentence1': 'A BMI of 25 or above is considered overweight ; 30 or above is considered obese .',
 'sentence2': 'A BMI between 18.5 and 24.9 is considered normal , over 25 is considered overweight and 30 or greater is defined as obese .',
 'label': 0,
 'idx': 24}

* Here we see the pair of sentences, the label and the index of that pair.
* Labels are already `int` value so we won't need to preprocess them.
* What means `label: 0`?

In [9]:
# what means each label
train_ds.features

{'sentence1': Value(dtype='string', id=None),
 'sentence2': Value(dtype='string', id=None),
 'label': ClassLabel(names=['not_equivalent', 'equivalent'], id=None),
 'idx': Value(dtype='int32', id=None)}

* **0** for `not_equivalent` and **1** for `equivalent`

In [10]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(mdl_ckp)
train_seq1 = tokenizer(train_ds['sentence1'])
train_seq2 = tokenizer(train_ds['sentence2'])

## Preprocessing the datset:

* We can't just pass two sequences to the model and expect to get proper prediction about whether these sequences are paraphrased or not.
* We need to apply a proper preparation of the data in order feed the model pairs of sequences instead 2 sentences separtly.
* This can be done first whith the tokenizer, we create pairs of tokens and compute them the way **BERT** expect:

In [11]:
#example
input = tokenizer('this is the first sentence', 'this is number 2')
input

{'input_ids': [101, 2023, 2003, 1996, 2034, 6251, 102, 2023, 2003, 2193, 1016, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

* The tokenizer output: `input_ids`, `attention_mask`, but also `token_type_ids`.
* This feature tells us that the tokenizer is aware that we are dealing with the two sentences, each is represented by either `0` or `1`

In [12]:
input.token_type_ids

[0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1]

* If we convert each `input_ids` back to token we can have and idea of what happend:

In [13]:
tokenizer.convert_ids_to_tokens(input['input_ids'])


['[CLS]',
 'this',
 'is',
 'the',
 'first',
 'sentence',
 '[SEP]',
 'this',
 'is',
 'number',
 '2',
 '[SEP]']

* So we see the model expects the inputs to be of the form [CLS] sentence1 [SEP] sentence2 [SEP]
* Note that not all model's tokenizer can perform this because the way each model is trained, here `BERT` have seen pairs and knows how to deal with them.
* We can then pass pairs of sentences to the tokenizer like this:

In [14]:
tokenized_dataset = tokenizer(train_ds['sentence1'], train_ds['sentence2'], truncation=True, padding=True)


* This way of tokenizing the whole dataset is not ideal since it requires huge RAM to store the dataset while we process it.
* It will also return dictionary keys: `attention_mask`, `input_ids`, `token_type_ids` and its values.
* To work around this problem we will use `map()` method which will keep data as dataset, and also it will give us more flexibility if we need more preprocessing more than just tokenizing.
* `map()` works by applying a function to each element of the dataset, let's create a function that tokenize pairs of sentences so the map method use it over the whole dataset:

In [15]:
def func_tokenize(example):
  return tokenizer(example['sentence1'], example['sentence2'], truncation=True)

* This function takes a dictionary (like the items of our dataset) and returns a new dictionary with the `keys input_ids`, `attention_mask`, and `token_type_ids`.
* We didn't include the `padding` here, because it's not sufficient to pad the whole dataset based on the longest sentence, when we can do it on the batch level
* We can pass the batching as argument in the `map()` method

In [16]:
tokenized_datasets = dataset.map(func_tokenize, batched=True)

Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

* Let's take a look on pair exmaple from the training dataset:

* We get what we expected, the 3 keys representing tokenization process, plus the dictionary key we already have: `label`, `idx` and `sentence1&2`:

In [17]:
tokenized_datasets['train'][55].keys()

dict_keys(['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'])

* Now we have to deal with the padding since we decided to apply it o the batch-level, so each batch will have its own **longest sequence** to pad on.
* So we need to do a process called: **Dynamic Padding**.

### Dynamic Padding:

* Putting the samples together in a single batch is done throught a function called: **`Collate function`**.
* Collate function convert our samples to Pytorch tensors and concatenate them.
* But this can't be done without padding, otherwise we will get different shapes for tensors.
* As we said before the padding process should be done on batch level, which means each batch will have its samples padded according to the longest sequence otherwise we will get samples a with lot of paddings.
* In practice we have to define a collate-function that apply  the correct amount of padding to the items of the dataset we want to batch together.

In [18]:
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)


* Here we test this collate function on some samples from training set.
* We need first to remove columns `idx`, `sentence1`, `sentence2` since we don't need them.
* Let's have a look at the length of each entry in the batch:

In [19]:
samples = tokenized_datasets['train'][:8]
samples = {k: v for k, v in samples.items() if k not in ['idx', 'sentence1', 'sentence2']}
[len(x) for x in samples['input_ids']]



[50, 59, 47, 67, 59, 50, 62, 32]

In [20]:
samples.keys()

dict_keys(['label', 'input_ids', 'token_type_ids', 'attention_mask'])


* These samples are varying between `32` and `67`, so our job here is to pad all the other sequence in this particular in respect to the treshold.

In [21]:
sample_batch = data_collator(samples)
{k:v.shape for k, v in sample_batch.items()}

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'input_ids': torch.Size([8, 67]),
 'token_type_ids': torch.Size([8, 67]),
 'attention_mask': torch.Size([8, 67]),
 'labels': torch.Size([8])}

* Let's check again if our `input_ids` have the same length:

In [22]:
[len(i) for i in sample_batch['input_ids']]

[67, 67, 67, 67, 67, 67, 67, 67]

# Fine-tuning a model with the Trainer API:

* We can use the `Trainer` class to fine-tune any pretrained model on our dataset.
* Now we need to prepare the enviroment for `Train.train()` which will be done on **GPU**.
* But first we have to define `TrainingArguments` that contains all the *Hyperparameters* the **`Trainer`** will use for the training and evaluation.
* We just need to provide where the model will be saved, as long as the checkpoints, everything else is set as defaults which will work for learning purposes.

In [23]:
output_dir = 'my_folder'

In [24]:
from transformers import TrainingArguments

training_args = TrainingArguments(output_dir=output_dir)

* For this dataset we will use `AutoModelForSequenceClassification` class with 2 labels:

In [25]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(mdl_ckp, num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


* The BERT model we instantiated now will be the back-bone for our process, but we will delete the head and add the `AutoModelForSequenceClassification` head that fit our situation.
* The weights are initialized randomly for the head, which means we need to train them from scratch which is exactly what we will do:

In [26]:
from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

* When the tokenizer is passed as ardument like this, usually we won't need to define `data_collator` since it already defined with `DataCollatorWithPadding`, and the `Trainer` will retrieve it from the tokenizer anyway.
* Now we fine-tune the model on our dataset:

In [27]:
trainer.train()

Step,Training Loss
500,0.5568
1000,0.3745


TrainOutput(global_step=1377, training_loss=0.4070802964243061, metrics={'train_runtime': 198.5461, 'train_samples_per_second': 55.423, 'train_steps_per_second': 6.935, 'total_flos': 405324636337200.0, 'train_loss': 0.4070802964243061, 'epoch': 3.0})

* One thing we didn't include in our Trainer is the evaluating strategy.
* We don't have an idea about how good or bad our nodel because:
    - We didn’t tell the `Trainer` to evaluate during training by setting `evaluation_strategy` to either "steps" (evaluate every eval_steps) or "epoch" (evaluate at the end of each epoch).
    - We didn’t provide the `Trainer` with a `compute_metrics()` function to **calculate a metric during said evaluation** (otherwise the evaluation would just have printed the loss, which is **not** a very intuitive number).

## Evaluation:

* First we need to build a **`compute_metrics()`** function in order to use it in the next training.
* The function takes a `EvalPrediction` object as argument, which is basically a named tuple with:
    - `predictions` field
    - `label_ids` field
* Here we get some predictions from our model:    

In [28]:
predictions = trainer.predict(tokenized_datasets["validation"])
print(predictions.predictions.shape, predictions.label_ids.shape)

(408, 2) (408,)


* The output of the `predict()` method is another named tuple with three fields: `predictions`, `label_ids`, and `metrics`.
* Metrics represent the loss on the dataset, as well as the time metrics, how much it takes the predictions on total average.
* As we see here, predictions is a two-dimensional array with shape 408 x 2 (408 being the number of elements in the dataset we used).
* It represent the logits for each element of the dataset we passed to `predict()` .
* To transform them into predictions that we can compare to our labels, we need to take the index with the maximum value on the second axis:

In [33]:
import numpy as np

preds = np.argmax(predictions.predictions, axis=-1)