# FINE-TUNE A PRETRAINED MODEL

In [None]:
!pip install datasets evaluate transformers[sentencepiece]

# Processing the data

To train a sequence classifier on one batch in PyTorch,

In [None]:
import torch
from transformers import AdamW, AutoTokenizer, AutoModelForSequenceClassification

# same as before
checkpoint = 'bert-base-uncased'

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

In [3]:
# sequence data
sequences = [
    "I've been waiting for a HuggingFace course my whole life.",
    "This course is amazing!",
]

# batch data
batch = tokenizer(sequences,
                  padding=True,
                  truncation=True,
                  return_tensors='pt')
# ADD labels
batch['labels'] = torch.tensor([1, 1])
print(batch)

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172,
          2607,  2026,  2878,  2166,  1012,   102],
        [  101,  2023,  2607,  2003,  6429,   999,   102,     0,     0,     0,
             0,     0,     0,     0,     0,     0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'labels': tensor([1, 1])}


In [None]:
# set up the training for one epoch
optimizer = AdamW(model.parameters())
loss = model(**batch).loss
loss.backward()
optimizer.step()

## Loading a dataset from the Hub

In [None]:
from datasets import load_dataset

# download the MRPC (Microsoft Research Paraphrase Corpus) dataset
raw_datasets = load_dataset('glue', 'mrpc')
raw_datasets

The dataset is cached by default in *~/.cache/huggingface/datasets*. We can customize the cache folder by setting the `HF_HOME` environment variable.

In [6]:
# access the training set
raw_train_dataset = raw_datasets['train']
# first example
raw_train_dataset[0]

{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
 'label': 1,
 'idx': 0}

To know which integer corresponds to which label, we can inspect the `features` of the `raw_train_dataset`:

In [7]:
raw_train_dataset.features

{'sentence1': Value(dtype='string', id=None),
 'sentence2': Value(dtype='string', id=None),
 'label': ClassLabel(names=['not_equivalent', 'equivalent'], id=None),
 'idx': Value(dtype='int32', id=None)}

So 0 corresponds to `not_equivalent` and 1 corresponds to `equivalent`.

## Preprocessing a dataset

To preprocess the dataset, we need to convert the text to numbers the model can make sense of.

In [8]:
from transformers import AutoTokenizer

checkpoint = 'bert-base-uncased'

tokenizer = AutoTokenizer.from_pretrained(checkpoint)

tokenized_sentences_1 = tokenizer(raw_train_dataset['sentence1'])
tokenized_sentences_2 = tokenizer(raw_train_dataset['sentence2'])



However, we CANNOT just pass two sequences to the model and get a prediction of whether the two sentences are paraphrases or not.

We need to handle the two sequences as a pair, and apply the appropriate preprocessing.

The tokenizer can also take a pair of sequences and prepare it the way our BERT model expects:

In [9]:
inputs = tokenizer(
    'This is the first sentence.',
    'This is the second sentence.',
)
inputs

{'input_ids': [101, 2023, 2003, 1996, 2034, 6251, 1012, 102, 2023, 2003, 1996, 2117, 6251, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

The `token_type_ids` tells the model which part of the input is the first sentence and which is the second sentence.

In [10]:
tokenizer(
    raw_train_dataset['sentence1'][15],
    raw_train_dataset['sentence2'][15]
)

{'input_ids': [101, 24049, 2001, 2087, 3728, 3026, 3580, 2343, 2005, 1996, 9722, 1004, 4132, 9340, 12439, 2964, 2449, 1012, 102, 3026, 3580, 2343, 4388, 24049, 1010, 3839, 2132, 1997, 1996, 9722, 1998, 4132, 9340, 12439, 2964, 3131, 1010, 2097, 2599, 1996, 2047, 9178, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [11]:
tokenizer(raw_train_dataset['sentence1'][15]), tokenizer(raw_train_dataset['sentence2'][15])

({'input_ids': [101, 24049, 2001, 2087, 3728, 3026, 3580, 2343, 2005, 1996, 9722, 1004, 4132, 9340, 12439, 2964, 2449, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]},
 {'input_ids': [101, 3026, 3580, 2343, 4388, 24049, 1010, 3839, 2132, 1997, 1996, 9722, 1998, 4132, 9340, 12439, 2964, 3131, 1010, 2097, 2599, 1996, 2047, 9178, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]})

If we decode the IDs inside `input_ids` back to words:

In [12]:
tokenizer.convert_ids_to_tokens(inputs['input_ids'])

['[CLS]',
 'this',
 'is',
 'the',
 'first',
 'sentence',
 '.',
 '[SEP]',
 'this',
 'is',
 'the',
 'second',
 'sentence',
 '.',
 '[SEP]']

The model expects the inputs to be of the form `[CLS] sentence1 [SEP] sentence2 [SEP]` when there are two sentences. So the parts of the input corresponding to `[CLS] sentence1 [SEP]` all have a token type ID of 0, while the other parts, corresponding to `sentence2 [SEP]`, all have a token type ID of 1.

If we select a different checkpoint, we may not necessarily have the `token_type_ids` in our tokenized inputs.

We can use our tokenizer to tokenize the whole dataset: feed the tokenizer a list of pairs of sentences by giving it the list of first sentences, then the list of second sentences.:

In [13]:
tokenized_dataset = tokenizer(
    raw_train_dataset['sentence1'],
    raw_train_dataset['sentence2'],
    padding=True,
    truncation=True,
)

This is okay, but it has the disadvantage of returning a dictionary (with our keys, `input_ids`, `attention_mask` and `token_type_ids`, and values that are lists of lists). It will also only work if we have enough RAM to store the whole dataset during the tokenization.

To keep the data as a dataset, we need to use the `Dataset.map()` method. The `map()` method works by applying a function on each element of the dataset:

In [14]:
def tokenize_function(example):
    res = tokenizer(
        example['sentence1'],
        example['sentence2'],
        truncation=True,
    )

    return res

We have left the `padding` out in our tokenization function for now because padding all the samples to the maximum length is not efficient: it is better to pad the samples when we are building a batch, as then we only need to pad to the maximum length in that batch, and not the maximum length in the entire dataset.

Use `batched=True` to `map` so the function is applied to multiple elements of our dataset at once, and not on each element separately.

In [15]:
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
tokenized_datasets

Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1725
    })
})

The `DatasetDict` now adds new fields to the dictionary.

One last thing is to pad all the examples to the length of the longest element when we batch element together - called *dynamic padding*.

## Dynamic padding

The *collate function* is a function that is reponsible for putting together samples inside a batch.

The Transformers library provides such a function via `DataCollatorWithPadding`, which takes a tokenizer when we instantiate it (to know which padding token to use, and whether the model expects padding to be on the left or on the right of the inputs) and will do everything we need:

In [16]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [17]:
# take a few samples
samples = tokenized_datasets['train'][:10]

# only remove columns we do not need
samples = {
    k: v for k,v in samples.items() if k not in ['idx', 'sentence1', 'sentence2']
}

[len(x) for x in samples['input_ids']]

[50, 59, 47, 67, 59, 50, 62, 32, 45, 60]

We can see the lengths of selected samples are different.

Dynamic padding means the samples in this batch should all be padded to a length of 67, the maximum length inside this batch. Without dynamic padding, all of the samples would have to be padded to the maximum length in the whole dataset, or the maximum length the model can accept.

In [18]:
batch = data_collator(samples)

{k: v.shape for k, v in batch.items()}

{'input_ids': torch.Size([10, 67]),
 'token_type_ids': torch.Size([10, 67]),
 'attention_mask': torch.Size([10, 67]),
 'labels': torch.Size([10])}

# Fine-tuning a model with the Trainer API

Data Preparation and preprocessing

In [19]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

raw_datasets = load_dataset('glue', 'mrpc')
checkpoint = 'bert-base-uncased'

tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_function(example):
    res = tokenizer(
        example['sentence1'],
        example['sentence2'],
        truncation=True,
    )

    return res

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)



Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

## Training

The first step is to define a `TrainingArguments` class that contains all the hyperparameters the `Trainer` will use for training and evaluation.

In [20]:
from transformers import TrainingArguments

training_args = TrainingArguments('test-trainer')

The trained model and checkpoints will be saved in the `test-trainer` folder.

The second step is to define the model.

In [21]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


The warning message indicates that the default BERT has not been pretrained on classifying pairs of sentences, so the head of the pretrained model has been discarded and a new head suitable for sequence classification has been added instead.

The warnings also indicate that some weights were not used (the ones corresponding to the dropped pretraining head) and that some others were randomly initialized (the ones for the new head).

Once we have our model, we can define a `Trainer` by passing it all the objects constructed up to now - the `model`, the `training_args`, the training and validation datasets, our `data_collator`, and our `tokenizer`:

In [22]:
from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation'],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

To fine-tune the model on our dataset, we have to call the `train()` method:

In [None]:
trainer.train()

This will start the fine-tuning and report the training loss every 500 steps.

It will not tell us how well (or badly) our model is performing. This is because
1. We did not tell the `Trainer` to evaluate during training by setting `evaluation_strategy` to either `steps` (evaluate every `eval_steps`) or `epoch` (evaluate at the end of each epoch.
2. We did not provide the `Trainer` with a `compute_metrics()` function to calculate a metric during said evaluation (otherwise the evaluation would just have printed the loss).

## Evaluation

The `compute_metrics()` function must take an `EvalPrediction` object (which is a named tuple with a `predictions` field and a `label_ids` field) and will return a dictionary mapping strings to floats (the strings being the names of the metrics returned, and the floats their values).

In [24]:
predictions = trainer.predict(tokenized_datasets['validation'])

print(predictions.predictions.shape, predictions.label_ids.shape)

Step,Training Loss


(408, 2) (408,)


The output of the `predict()` method is another named tuple with three fields: `predictions`, `label_ids`, and `metrics`.

The `metrics` field just contains the loss on the dataset passed, as well as some time metrics (how long it took to predict, in total and on average). This field will also contain the metrics returned by `compute_metrics()`.

The `predictions.predictions` is a two-dimensional array with shape 408x2 (408 being the number of elements in the dataset we use). Those are the logits for each element of the dataset we passed to `predict()`.

To transform them into predictions that we can compare to our label:

In [25]:
import numpy as np

preds = np.argmax(predictions.predictions, axis=-1)

We need to rely on the metrics from the Evaluate library. We can load the metrics associated with the MRPC dataset as easily as we loaded the dataset.

In [26]:
import evaluate

metric = evaluate.load('glue', 'mrpc')

metric.compute(predictions=preds, references=predictions.label_ids)

Downloading builder script:   0%|          | 0.00/5.75k [00:00<?, ?B/s]

{'accuracy': 0.7034313725490197, 'f1': 0.7966386554621848}

Wrap everything together inside the `compute_metrics()` function:

In [27]:
def compute_metrics(eval_preds):
    metric = evaluate.load('glue', 'mrpc')
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)

    res = metric.compute(predictions=predictions, references=labels)
    return res

To see it used in action to report metrics at the end of each epoch:

In [28]:
training_args = TrainingArguments('test-trainer',
                                  evaluation_strategy='epoch')

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation'],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Note that we create a new `TrainingArguments` with its `evaluation_strategy` set to `epoch` and a new model - otherwise, we would just be continuing the training of the model we have already trained. To launch a new training run,

In [None]:
trainer.train()

The Trainer will work out of the box on multiple GPUs or TPUs and provides lots of options, like mixed-precision training (use `fp16 = True` in your training arguments)

# A full training

Short summary to prepare the data:

In [29]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

# load raw text data
raw_datasets = load_dataset('glue', 'mrpc')
# choose a model name
checkpoint = 'bert-base-uncased'

# create a tokenizer
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_function(example):
    return tokenizer(example['sentence1'], example['sentence2'], truncation=True)

# get the tokenized data
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

# create a data collator
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)



Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

## Prepare for training

Before writing the training loop,
1. create a dataloaders we will use to iterate over batches,
2. apply postprocessing to our `tokenized_datasets` to take care of some things that the `Trainer` did for us automatically:
  * Remove the columns corresponding to values the model does not expect (like the `sentence1` and `sentence2` columns).
  * Rename the column `label` to `labels` (because the model expects the argument to be named `labels`).
  * Set the format of the datasets so they return PyTorch tensors instead of lists.

In [30]:
tokenized_datasets = tokenized_datasets.remove_columns(['sentence1', 'sentence2', 'idx'])

tokenized_datasets = tokenized_datasets.rename_column('label', 'labels')

tokenized_datasets.set_format('torch')

# check
tokenized_datasets['train'].column_names

['labels', 'input_ids', 'token_type_ids', 'attention_mask']

Define our dataloaders:

In [31]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(
    tokenized_datasets['train'],
    shuffle=True,
    batch_size=8,
    collate_fn=data_collator,
)

eval_dataloader = DataLoader(
    tokenized_datasets['validation'],
    batch_size=8,
    collate_fn=data_collator,
)

Check the processed data:

In [32]:
for batch in train_dataloader:
    break

{k: v.shape for k,v in batch.items()}

{'labels': torch.Size([8]),
 'input_ids': torch.Size([8, 72]),
 'token_type_ids': torch.Size([8, 72]),
 'attention_mask': torch.Size([8, 72])}

Now we can create the model:

In [33]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Make sure everything is fine:

In [34]:
outputs = model(**batch)
print(outputs.loss, outputs.logits.shape)

tensor(0.6111, grad_fn=<NllLossBackward0>) torch.Size([8, 2])


All Transformers models will return the loss when `labels` are provided, and we also get the logits (two for each input in our batch, so a tensor of size 8x2).

One last thing: define an optimizer and a learning rate scheduler:

In [35]:
from transformers import AdamW

# optimizer
optimizer = AdamW(model.parameters(), lr=5e-5)



In [36]:
from transformers import get_scheduler

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)

# scheduler
lr_scheduler = get_scheduler(
    'linear',
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

print(num_training_steps)

1377


## The training loop

In [37]:
import torch

# select device
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model.to(device)

device

device(type='cuda')

In [38]:
from tqdm.auto import tqdm

progress_bar = tqdm(range(num_training_steps))

model.train()

for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k,v in batch.items()}

        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        # update gradients
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()

        progress_bar.update(1)

  0%|          | 0/1377 [00:00<?, ?it/s]

This trianing loop will not tell us anything about how the model behaves, so we need to add an evaluation loop for that.

## The evaluation loop

In [None]:
import evaluate

metric = evaluate.load('glue', 'mrpc')

model.eval()

for batch in eval_dataloader:
    batch = {k: v.to(device) for k,v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)

    metric.add_batch(predictions=predictions, references=batch['labels'])

metric.compute()

## Supercharge your training loop with HuggingFace Accelerate

The previous training loop works fine on a single CPU or GPU. However, using the HuggingFace Accelerate library, with just a few adjustments we can enable distributed training on multiple GPUs or TPUs.

Our manual training loop looks like:

In [None]:
from transformers import AdamW, AutoModelForSequenceClassification, get_scheduler

# create model
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
# create optimizer
optimizer = AdamW(model.parameters(), lr=3e-5)

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model.to(device)

# scheduler
num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    'linear',
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k,v in batch.items()}

        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        # update gradients
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()

        progress_bar.update(1)

Now if we apply the Accelerate library, we need to modify the trianing loop:

In [None]:
from accelerate import Accelerator
from transformers import AdamW, AutoModelForSequenceClassification, get_scheduler

accelerator = Accelerator()

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
optimizer = AdamW(model.parameters(), lr=3e-5)

# NO need to select device
#device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
#model.to(device)

# Apply the accelerator here
train_dataloader, eval_dataloader, model, optimizer = accelerator.prepare(
    train_dataloader, eval_dataloader, model, optimizer,
)

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    'linear',
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        # NO need to send data to device
        #batch = {k: v.to(device) for k,v in batch.items()}

        # INSTEAD, directly apply
        outputs = model(**batch)
        loss = outputs.loss
        # Do NOT use backward() method
        #loss.backward()

        # INSTEAD,
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()

        progress_bar.update(1)

Once an `Accelerator` object is instantiated, it will look at the environment and intialize the proper distributed setup.

The main build of the work is done in the line that sends the dataloaders, the model, and the optimizer to `accelerator.prepare()`.

Putting this in a `train.py` script will make that script runnable on any kind of distributed setup.

In our distributed setup, run the command:
```bash
accelerate config
```
which will prompt us to answer a few questions and dump your answers in a configuration file used by this command:
```bash
accelerate launch train.py
```
which will launch the distributed training.

For a Notebook-like envrionment, have the training loop in a `training_function()` and run a last cell with:
```python
from accelerate import notebook_launcher

notebook_launcher(training_function)
```