In [None]:
import numpy as np
import pandas as pd

- GLUE: General Language Understand Evaluation - A benchmark
- MRPC: Microsoft Research Paraphrase Corpus - A dataset

# Fine-tuning a model with the Trainer API or Keras

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [1]:
# !pip install datasets evaluate transformers[sentencepiece]

<b>Trainer API</b>
- Allows to easily fine-tune transformer models on our own dataset
- `Trainer` class 
    - takes our 
        - dataset, 
        - tokenizer, data collator
            - final data-processing such as dynamic padding (if we provide the tokenizer)
        - model, 
        - training hyperparameters
        - metrics
        - ...
    - perform 
        - the training on any kind of setup (CPU, GPU, multiple GPUs, TPUs)
        - compute predictions on any datasets
        - evaluate our model on any dataset with provided metrics

<img src="images/Trainer-API-overview.png" style="width:850px;" title="Trainer API overview">

With the below code,
- we do not apply padding during preprocessing, as we will use dyanmic padding using data collator
- we also do not do any renaming/removing colums or set the format of tokenizers output to torch tensors

All these are taken care automatically by the `Trainer` by analysing the model signature

<img src="images/processing-steps-for-glue-mrpc-dataset.png" style="width:850px;" title="Easy preprocessing steps for GLUE MRPC dataset">

<img src="images/model-and-training-args-before-creating-trainer.png" style="width:850px;" title="model-and-training-args-before-creating-trainer">

The only downside on the `Trainer` below is we do not pass `compute_metrics` function. So, it will only output `loss` metric by default. However, accuracy and f1-score wil be the most useful ones for us.

<img src="images/pass-everything-to-trainer-class-and-start-training.png" style="width:750px;" title="pass-everything-to-trainer-class-and-start-training">

ü§ó Transformers provides a Trainer class to help you fine-tune any of the pretrained models it provides on your dataset. Once you‚Äôve done all the data preprocessing work in the last section, you have just a few steps left to define the `Trainer`. The hardest part is likely to be preparing the environment to run `Trainer.train()`, as it <span style="color:red">will run very slowly on a CPU</span>. If you don‚Äôt have a GPU set up, you can get access to free GPUs or TPUs on Google Colab. 
- <span style="color:red">Note: When ran on Google colab free version, the `trainer.train()` took close to 3 hrs</span>, whereas it took only ~15 mins with Mac M1 Pro chip

The code examples below assume you have already executed the examples in the previous section. Here is a short summary recapping what you need:

In [2]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Found cached dataset glue (/Users/prasanth.thangavel/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


  0%|          | 0/3 [00:00<?, ?it/s]

Loading cached processed dataset at /Users/prasanth.thangavel/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-bb21e6423b980722.arrow
Loading cached processed dataset at /Users/prasanth.thangavel/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-d1e8c90b5d349f7a.arrow
Loading cached processed dataset at /Users/prasanth.thangavel/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-33304e37c309912f.arrow


## Training

The first step before we can define our `Trainer` is to define a `TrainingArguments` class that will contain all the hyperparameters the `Trainer` will use for training and evaluation. The only argument you have to provide is a directory where the trained model will be saved, as well as the checkpoints along the way. For all the rest, you can leave the defaults, which should work pretty well for a basic fine-tuning.

In [8]:
from transformers import TrainingArguments

training_args = TrainingArguments(output_dir="test-trainer")

> üí° If you want to automatically upload your model to the Hub during training, pass along `push_to_hub=True` in the `TrainingArguments`. We will learn more about this in Chapter 4

The second step is to define our model. As in the previous chapter, we will use the `AutoModelForSequenceClassification` class, with two labels:

In [10]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly i

<span style="background-color:#FFFFCC">
    <span style="color:blue">You will notice that unlike in Chapter 2, you get a warning after instantiating this pretrained model</span>. <span style="color:red">This is because BERT has not been pretrained on classifying pairs of sentences, so the head of the pretrained model has been discarded and a new head suitable for sequence classification has been added instead</span>. <span style="color:blue">The warnings indicate that some weights were not used (the ones corresponding to the dropped pretraining head) and that some others were randomly initialized (the ones for the new head). It concludes by encouraging you to train the model, which is exactly what we are going to do now.</span>
</span>

Once we have our model, we can define a `Trainer` by passing it all the objects constructed up to now ‚Äî the model, the `training_args`, the training and validation datasets, our `data_collator`, and our `tokenizer`:

In [11]:
from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

Though the train and validation datasets are already tokenized, `tokenizer=tokenizer` parameter is still passed to the Trainer for the following reasons:
1. Decoding and post-processing
2. Consistency and flexibility

Note that, also when you pass the tokenizer as we did here, the default `data_collator` used by the `Trainer` will be a `DataCollatorWithPadding` as defined previously, so you can skip the line `data_collator=data_collator` in this call. It was still important to show you this part of the processing in section 2!

To fine-tune the model on our dataset, we just have to call the `train()` method of our `Trainer`:

In [12]:
# Takes ~16 mins with Apple M1 Pro 16 GB RAM
trainer.train()

Step,Training Loss
500,0.5311
1000,0.2988


TrainOutput(global_step=1377, training_loss=0.3403487472146338, metrics={'train_runtime': 954.1732, 'train_samples_per_second': 11.532, 'train_steps_per_second': 1.443, 'total_flos': 406183858377360.0, 'train_loss': 0.3403487472146338, 'epoch': 3.0})

<span style="color:blue">This will start the fine-tuning (which should take a couple of minutes on a GPU) and report the training loss every 500 steps</span>. <span style="color:red">It won‚Äôt, however, tell you how well (or badly) your model is performing, because </span>:

1. <span style="color:red">didn‚Äôt tell the `Trainer` to evaluate during training</span> by setting `evaluation_strategy` to either `"steps"` (evaluate every eval_steps) or `"epoch"` (evaluate at the end of each epoch).
2. <span style="color:red">We didn‚Äôt provide the `Trainer` with a `compute_metrics()` function to calculate a metric during said evaluation</span> (otherwise the evaluation would just have printed the (training and validation) loss, which is not a very intuitive number).

## Evaluation

Let‚Äôs see how we can build a useful `compute_metrics()` function and use it the next time we train. The function must take an `EvalPrediction` object (which is a named tuple with a `predictions` field and a `label_ids` field) and will return a dictionary mapping strings to floats (the strings being the names of the metrics returned, and the floats their values). To get some predictions from our model, we can use the `Trainer.predict()` command:

In [13]:
predictions = trainer.predict(tokenized_datasets["validation"])
print(predictions.predictions.shape, predictions.label_ids.shape)

(408, 2) (408,)


In [41]:
predictions.metrics

{'test_loss': 0.7443095445632935,
 'test_runtime': 8.931,
 'test_samples_per_second': 45.684,
 'test_steps_per_second': 5.71}

<span style="color:blue">The output of the `predict()` method is another named tuple with three fields: `predictions`, `label_ids`, and `metrics`</span>. The metrics field will just contain the loss on the dataset passed, as well as some time metrics (how long it took to predict, in total and on average). Once we complete our `compute_metrics()` function and pass it to the `Trainer`, that field will also contain the metrics returned by `compute_metrics()`.

As you can see, `predictions` is a two-dimensional array with shape 408 x 2 (408 being the number of elements in the dataset we used). Those are the logits for each element of the dataset we passed to `predict()` (as you saw in the previous chapter, all Transformer models return logits). To transform them into predictions that we can compare to our labels, we need to take the index with the maximum value on the second axis:

In [51]:
preds = np.argmax(predictions.predictions, axis=-1)

In [53]:
# Here, predictions.label_ids is same as the validation dataset original label
print ('Accuracy:')
print (np.mean(np.array(tokenized_datasets['validation']['label']) == preds))
print (np.mean(predictions.label_ids == preds))

Accuracy:
0.8431372549019608
0.8431372549019608


We can now also compare those preds to the labels using library functions. To build our `compute_metric()` function, we will rely on the metrics from the ü§ó [Evaluate](https://github.com/huggingface/evaluate/) library. We can load the metrics associated with the MRPC dataset as easily as we loaded the dataset, this time with the `evaluate.load()` function. The object returned has a `compute()` method we can use to do the metric calculation:

In [10]:
# !pip install evaluate

In [54]:
import evaluate

metric = evaluate.load("glue", "mrpc")
metric.compute(predictions=preds, references=predictions.label_ids)

{'accuracy': 0.8431372549019608, 'f1': 0.8915254237288135}

The exact results you get may vary, as the random initialization of the model head might change the metrics it achieved. Here, <span style="color:blue">we can see our model has an accuracy of ~85.78% on the validation set and an F1 score of ~89.97. Those are the two metrics used to evaluate results on the MRPC dataset for the GLUE benchmark</span>. The table in the [BERT paper](https://arxiv.org/pdf/1810.04805.pdf) reported an F1 score of 88.9 for the base model. That was the uncased model while we are currently using the cased model, which explains the better result.

Wrapping everything together, we get `our compute_metrics()` function:

In [16]:
def compute_metrics(eval_preds):
    metric = evaluate.load("glue", "mrpc")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

And to see it used in action to report metrics at the end of each epoch, here is how we define a new `Trainer` with this `compute_metrics()` function:



In [17]:
training_args = TrainingArguments(
    "test-trainer", 
    evaluation_strategy="epoch" # Tells Trainer to evaluate model performance at end of every epoch
)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"], 
    data_collator=data_collator, 
    tokenizer=tokenizer,
    compute_metrics=compute_metrics, # Computes and returns metrics (accuracy and f1-score)
)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly i

Note that we create a new `TrainingArguments` with its `evaluation_strategy` set to `"epoch"` and a new model ‚Äî otherwise, we would just be continuing the training of the model we have already trained. To launch a new training run, we execute:

In [18]:
# Takes ~16 mins with Apple M1 Pro 16 GB RAM
trainer.train()



Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.39805,0.823529,0.870968
2,0.499200,0.709245,0.835784,0.89141
3,0.271200,0.696411,0.862745,0.90378


TrainOutput(global_step=1377, training_loss=0.31093509870763264, metrics={'train_runtime': 979.8104, 'train_samples_per_second': 11.231, 'train_steps_per_second': 1.405, 'total_flos': 406183858377360.0, 'train_loss': 0.31093509870763264, 'epoch': 3.0})

This time, it will report the validation loss and metrics at the end of each epoch on top of the training loss. Again, the exact accuracy/F1 score you reach might be a bit different from what we found, because of the random head initialization of the model, but it should be in the same ballpark.

The `Trainer` will work out of the box on multiple GPUs or TPUs and provides lots of options, like mixed-precision training (use `fp16 = True` in your training arguments). We will go over everything it supports in Chapter 10.

This concludes the introduction to fine-tuning using the `Trainer API`. An example of doing this for most common NLP tasks will be given in Chapter 7, but for now let‚Äôs look at how to do the same thing in pure PyTorch.

> ‚úèÔ∏è Try it out! Fine-tune a model on the GLUE SST-2 dataset, using the data processing you did in section 2.