# Introduction:

* In the previous chapter we learned how to use tokenizers and pretrained models to make predictions.
* In this chapter we will see how to **Fine-tune** a model on our **Dataset** by learning:
   - How to prepare a large dataset for the finetuning process
   - How to use the high level API trainer to finetune a model
   - How to leverage the HuggingFace Accelerate library to easily run that custom training loop on any distributed setup
* But first let's do the usuall by picking an architecture/model/tokenizer, and then train it some sample data:   

In [3]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AdamW

In [4]:
mdl_ckpt = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(mdl_ckpt)
model = AutoModelForSequenceClassification.from_pretrained(mdl_ckpt)
sequences = ["I've been waiting for a HuggingFace course my whole life.", "This course is amazing!"]
batch = tokenizer(sequences, truncation=True, padding=True, return_tensors='pt')

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [5]:
# will be axplained later
batch['labels'] = torch.tensor([1, 1])

In [6]:
# training
optimizer = torch.optim.AdamW(model.parameters())
loss = model(**batch).loss
loss.backward()
optimizer.step()

* Of course training a model on 2 sentences will not yield a good results
* So we need to introduce it to a larger dataset
* In this chapter we will work with: example the [**MRPC**](https://aclanthology.org/I05-5002.pdf) (Microsoft Research Paraphrase Corpus) dataset.
    - The dataset consists of 5,801 pairs of sentences, with a label indicating if they are paraphrases or not (i.e., if both sentences mean the same thing)
    - This is one of the 10 datasets composing the [GLUE benchmark](https://gluebenchmark.com/), which is an academic benchmark that is used to measure the performance of ML models across 10 different text classification tasks

## Loading Datasets From The Hub:

* We can easily download a dataset from the Hub just like we did with models before:

In [7]:
# load dataset
from datasets import load_dataset
raw_ds = load_dataset('glue', 'mrpc')


Downloading builder script:   0%|          | 0.00/28.8k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/28.7k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/27.9k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data: 0.00B [00:00, ?B/s]

Downloading data: 0.00B [00:00, ?B/s]

Downloading data: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/3668 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/408 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1725 [00:00<?, ? examples/s]

In [9]:
raw_ds

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

* Datasets are presented as **DatasetDict** which is an object dictionary that our datset is organized by.
   - Here we have our training-set, validation-set and test-set.
   - Each set 2 keys: features and num_rows.
   - Features has: `sentence1`, `sentence2`, `label`, `idx`
   - `sentence1&2` represent the pair we need to train our model on and predict whether its paraphrased or not.
   

* We can access each pair of sentences in our raw_datasets object by indexing, like with a dictionary:

In [10]:
# training set
raw_train_ds = raw_ds['train']
raw_train_ds[22]

{'sentence1': 'A BMI of 25 or above is considered overweight ; 30 or above is considered obese .',
 'sentence2': 'A BMI between 18.5 and 24.9 is considered normal , over 25 is considered overweight and 30 or greater is defined as obese .',
 'label': 0,
 'idx': 24}

* Here we see the pair of sentences, the label and the index of that pair.
* Labels are already `int` value so we won't need to preprocess them.
* What means `label: 0`?

In [11]:
# what means each label
raw_train_ds.features

{'sentence1': Value(dtype='string', id=None),
 'sentence2': Value(dtype='string', id=None),
 'label': ClassLabel(names=['not_equivalent', 'equivalent'], id=None),
 'idx': Value(dtype='int32', id=None)}

* **0** for `not_equivalent` and **1** for `equivalent`

In [12]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(mdl_ckpt)
train_seq1 = tokenizer(raw_train_ds['sentence1'])
train_seq2 = tokenizer(raw_train_ds['sentence2'])

## Preprocessing the datset:

* We can't just pass two sequences to the model and expect to get proper prediction about whether these sequences are paraphrased or not.
* We need to apply a proper preparation of the data in order feed the model pairs of sequences instead 2 sentences separtly.
* This can be done first whith the tokenizer, we create pairs of tokens and compute them the way **BERT** expect:

In [13]:
#example
input = tokenizer('this is the first sentence', 'this is number 2')
input

{'input_ids': [101, 2023, 2003, 1996, 2034, 6251, 102, 2023, 2003, 2193, 1016, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

* The tokenizer output: `input_ids`, `attention_mask`, but also `token_type_ids`.
* This feature tells us that the tokenizer is aware that we are dealing with the two sentences, each is represented by either `0` or `1`

In [14]:
input.token_type_ids

[0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1]

* If we convert each `input_ids` back to token we can have and idea of what happend:

In [None]:
tokenizer.convert_ids_to_tokens(input['input_ids'])


['[CLS]',
 'this',
 'is',
 'the',
 'first',
 'sentence',
 '[SEP]',
 'this',
 'is',
 'number',
 '2',
 '[SEP]']

* So we see the model expects the inputs to be of the form [CLS] sentence1 [SEP] sentence2 [SEP]
* Note that not all model's tokenizer can perform this because the way each model is trained, here `BERT` have seen pairs and knows how to deal with them.
* We can then pass pairs of sentences to the tokenizer like this:

In [15]:
tokenized_dataset = tokenizer(raw_train_ds['sentence1'], raw_train_ds['sentence2'], truncation=True, padding=True)


* This way of tokenizing the whole dataset is not ideal since it requires huge RAM to store the dataset while we process it.
* It will also return dictionary keys: `attention_mask`, `input_ids`, `token_type_ids` and its values.
* To work around this problem we will use `map()` method which will keep data as dataset, and also it will give us more flexibility if we need more preprocessing more than just tokenizing.
* `map()` works by applying a function to each element of the dataset, let's create a function that tokenize pairs of sentences so the map method use it over the whole dataset:

In [16]:
def func_tokenize(example):
  return tokenizer(example['sentence1'], example['sentence2'], truncation=True)

* This function takes a dictionary (like the items of our dataset) and returns a new dictionary with the `keys input_ids`, `attention_mask`, and `token_type_ids`.
* We didn't include the `padding` here, because it's not sufficient to pad the whole dataset based on the longest sentence, when we can do it on the batch level
* We can pass the batching as argument in the `map()` method

In [17]:
tokenized_dataset = raw_ds.map(func_tokenize, batched=True)

Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

* Let's take a look on pair exmaple from the training dataset:

* We get what we expected, the 3 keys representing tokenization process, plus the dictionary key we already have: `label`, `idx` and `sentence1&2`:

In [18]:
tokenized_dataset['train'][55].keys()

dict_keys(['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'])

* Now we have to deal with the padding since we decided to apply it o the batch-level, so each batch will have its own **longest sequence** to pad on.
* So we need to do a process called: **Dynamic Padding**.

### Dynamic Padding:

* Putting the samples together in a single batch is done throught a function called: **`Collate function`**.
* Collate function convert our samples to Pytorch tensors and concatenate them.
* But this can't be done without padding, otherwise we will get different shapes for tensors.
* As we said before the padding process should be done on batch level, which means each batch will have its samples padded according to the longest sequence otherwise we will get samples a with lot of paddings.
* In practice we have to define a collate-function that apply  the correct amount of padding to the items of the dataset we want to batch together.

In [19]:
from transformers import DataCollatorWithPadding
data_colator = DataCollatorWithPadding(tokenizer=tokenizer)


* Here we test this collate function on some samples from training set.
* We need first to remove columns `idx`, `sentence1`, `sentence2` since we don't need them.
* Let's have a look at the length of each entry in the batch:

In [20]:
samples = tokenized_dataset['train'][:8]
samples = {k: v for k, v in samples.items() if k not in ['idx', 'sentence1', 'sentence2']}
[len(x) for x in samples['input_ids']]



[50, 59, 47, 67, 59, 50, 62, 32]

In [22]:
samples.keys()

dict_keys(['label', 'input_ids', 'token_type_ids', 'attention_mask'])


* These samples are varying between `32` and `67`, so our job here is to pad all the other sequence in this particular in respect to the treshold.

In [23]:
sample_batch = data_colator(samples)
{k:v.shape for k, v in sample_batch.items()}

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'input_ids': torch.Size([8, 67]),
 'token_type_ids': torch.Size([8, 67]),
 'attention_mask': torch.Size([8, 67]),
 'labels': torch.Size([8])}

* Let's check again if our `input_ids` have the same length:

In [25]:
[len(i) for i in sample_batch['input_ids']]

[67, 67, 67, 67, 67, 67, 67, 67]