# **Fine-tuning a Pretrained 🤗Hugging Face Model**

## Abstract

This notebook explores how to fine-tune 🤗Hugging Face pretrained models. It does so by fine-tuning BERT on classifying semantic equivalency between two sentences.

## Table of Contents

>[Fine-tuning a Pretrained 🤗Hugging Face Model](#scrollTo=2m7BTz7fA4Y5)

>>[Abstract](#scrollTo=lNljD8U_BBqR)

>>[Table of Contents](#scrollTo=ConqRFDeBCkh)

>>[Setup and Imports](#scrollTo=3GuaZcdLBEWo)

>>[Download the Dataset](#scrollTo=roCW9iWIBcMb)

>>[Preprocessing the Dataset](#scrollTo=RunR9-gKC0ck)

>>>[Tokenization](#scrollTo=zbFyk5YjF_jq)

>>>[Tokenization with .map()](#scrollTo=PGCXDTGnKzje)

>>>[Dynamic Padding](#scrollTo=fRfmbzhIGPXO)

>>>[Putting it all Together](#scrollTo=HSxxffePIMV7)

>>[Training](#scrollTo=roitb9bpfaRg)

>>[Inference](#scrollTo=_1YyBolOh-iM)

>>>[Obtain the Model's Logits](#scrollTo=7OAFjW-FiD_h)

>>>[Logits to Classifications](#scrollTo=xcYzs1I-iF2A)

>>[Evaluation](#scrollTo=WLRJZijoiLY5)



## Setup and Imports

In [16]:
!pip install transformers datasets evaluate -q

[?25l[K     |████▌                           | 10 kB 16.4 MB/s eta 0:00:01[K     |█████████                       | 20 kB 5.3 MB/s eta 0:00:01[K     |█████████████▌                  | 30 kB 7.7 MB/s eta 0:00:01[K     |██████████████████              | 40 kB 7.1 MB/s eta 0:00:01[K     |██████████████████████▌         | 51 kB 7.0 MB/s eta 0:00:01[K     |███████████████████████████     | 61 kB 8.2 MB/s eta 0:00:01[K     |███████████████████████████████▌| 71 kB 9.3 MB/s eta 0:00:01[K     |████████████████████████████████| 72 kB 1.2 MB/s 
[?25h

In [50]:
from datasets import Dataset
from datasets import load_dataset
from transformers import AutoTokenizer
from transformers import DataCollatorWithPadding
from transformers import TFAutoModelForSequenceClassification

from keras.optimizers import Adam
from keras.losses import SparseCategoricalCrossentropy
from tensorflow.keras.optimizers.schedules import PolynomialDecay

import evaluate
import numpy as np

## Download the Dataset

The Microsoft Research Paraphrase Corpus ([Dolan & Brockett, 2005](https://aclanthology.org/I05-5002.pdf)) is a corpus of sentence pairs automatically extracted from online news sources, with human annotations for whether the sentences in the pair are semantically equivalent.

This is one of the 10 datasets composing the [GLUE](https://gluebenchmark.com) benchmark, which is an academic benchmark that is used to measure the performance of ML models across 10 different text classification tasks.



In [3]:
dataset = load_dataset("glue", "mrpc")

Downloading builder script:   0%|          | 0.00/28.8k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/28.7k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/22.0k [00:00<?, ?B/s]

Downloading and preparing dataset glue/mrpc (download: 1.43 MiB, generated: 1.43 MiB, post-processed: Unknown size, total: 2.85 MiB) to /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data: 0.00B [00:00, ?B/s]

Downloading data: 0.00B [00:00, ?B/s]

Downloading data: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/3668 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/408 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1725 [00:00<?, ? examples/s]

Dataset glue downloaded and prepared to /root/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

As shown below, `DatasetDict` object is returned which contains the training set, the validation set, and the test set. Each of those contains several columns (`sentence1`, `sentence2`, `label`, and `idx`) and a variable number of rows, which are the number of elements in each set (so, there are 3,668 pairs of sentences in the training set, 408 in the validation set, and 1,725 in the test set).

In [4]:
dataset

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

## Preprocessing the Dataset

Preprocessing the dataset (when using a BERT transformer) consists of:

1. Adding special `[CLS]` and `[SEP]` tokens.
2. Tokenizing the sentence-pairs.
3. Generating `token_type_ids` to distinguish the two sentences.

Additionally, the batches must be truncated/padded to a fixed sequence length.

In [5]:
# Specify the model's desired checkpoint
checkpoint = "bert-base-uncased"

# Instanciate the corresponding tokenizer
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

### Tokenization

As shown below, the tokenizer can be fed a list of sentence-pairs by giving it the list of first sentences, then the list of second sentences. This is also compatible with the padding and truncation options.

In [6]:
tokenized_dataset = tokenizer(
    dataset["train"]["sentence1"],
    dataset["train"]["sentence2"],
    padding=True,
    truncation=True,
)

# Examine the returned dict keys
tokenized_dataset.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

### Tokenization with `.map()`

This works well, but it has the disadvantage of returning a dictionary (with the keys being `input_ids`, `attention_mask`, and `token_type_ids`, and values that are lists of lists). It will also only work if there's enough RAM to store the whole dataset during the tokenization. Whereas the datasets from the 🤗Datasets library are Apache Arrow files stored on the disk, so only the required samples are loaded in memory.

To keep the data as a dataset, the `Dataset.map()` method can be used. This also allows some extra flexibility, if more preprocessing than just tokenization is needed. The `map()` method works by applying a function on each element of the dataset.

In [7]:
def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

The padding argument has been left out in the tokenization function for now. This is because padding all the samples to the maximum length is not efficient: it’s better to pad the samples when building a batch, as then the padding is done up to the maximum length in that batch, instead of that of the entire dataset. This can save a lot of time and processing power when the inputs have very variable lengths.

Additionally, `batched` is set to `True` in the call to `map` so the function is applied to multiple elements of the dataset at once, and not on each element separately. This allows for faster preprocessing.

In [8]:
tokenized_dataset = dataset.map(tokenize_function, batched=True)

  0%|          | 0/4 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

In [9]:
tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1725
    })
})

### Dynamic Padding

Dynamic padding means the samples in a batch should all be padded to the maximum length inside the batch. Without dynamic padding, all of the samples would have to be padded to the maximum length in the whole dataset, or the maximum length the model can accept. 

The function that is responsible for putting together samples inside a batch is called a **collate** function. The default collator is a function that will just convert the samples to `tf.Tensor`. Padding has been deliberately postponed so that it's only applied as necessary on each batch and to avoid having inputs with a lot of padding. 

To do this in practice, the 🤗Transformers library provides with the `DataCollatorWithPadding` function. It takes a tokenizer when instantiated (to know which padding token to use, and whether the model expects padding to be on the left or on the right of the inputs).

In [10]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")

### Putting it all Together

Once the dataset is tokenized and a collator function is defined, `to_tf_dataset()` will wrap a `tf.data.Dataset` around the dataset, with an optional collation function. `tf.data.Dataset` is a native TensorFlow format that Keras can use for `model.fit()`, so this one method immediately converts a 🤗Dataset to a format that’s ready for training. 

In [11]:
tf_train_dataset = tokenized_dataset["train"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    label_cols=["labels"],
    shuffle=True,
    collate_fn=data_collator,
    batch_size=8,
)

tf_validation_dataset = tokenized_dataset["validation"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    label_cols=["labels"],
    shuffle=False,
    collate_fn=data_collator,
    batch_size=8,
)

tf_test_dataset = tokenized_dataset["test"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    label_cols=["labels"],
    shuffle=False,
    collate_fn=data_collator,
    batch_size=8,
)

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


## Training

TensorFlow models imported from 🤗Transformers are already Keras models.

In [12]:
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

Downloading:   0%|          | 0.00/536M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Transformer models benefit from a much lower learning rate than the default for Adam, which is 1e-3 (0.001). 5e-5 (0.00005), which is some twenty times lower, is a much better starting point.

In addition to lowering the learning rate, It is slowly reduced over the course of training. In the literature, this is referred to as decaying or annealing the learning rate. In Keras, the best way to do this is to use a learning rate scheduler. A good one to use is PolynomialDecay — despite the name, with default settings it simply linearly decays the learning rate from the initial value to the final value over the course of training. 

In order to use a scheduler correctly, though, it need to know how long training is going to be. This is computed as `num_train_steps`. The number of training steps is the number of samples in the dataset, divided by the batch size then multiplied by the total number of epochs. Note that the tf_train_dataset here is a batched tf.data.Dataset, not the original Hugging Face Dataset, so its len() is already num_samples // batch_size.

In [13]:
batch_size = 8
num_epochs = 3

num_train_steps = len(tf_train_dataset) * num_epochs
lr_scheduler = PolynomialDecay(
    initial_learning_rate=5e-5, 
    end_learning_rate=0.0, 
    decay_steps=num_train_steps
)

opt = Adam(learning_rate=lr_scheduler)

In [14]:
model.compile(
    optimizer=opt, 
    loss=SparseCategoricalCrossentropy(from_logits=True), 
    metrics=["accuracy"]
    )

model.fit(
    tf_train_dataset,
    validation_data=tf_validation_dataset,
    epochs=num_epochs
)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7f932041c810>

## Inference

### Obtain the Model's Logits

In [20]:
preds = model.predict(tf_test_dataset)["logits"]
preds.shape



(1725, 2)

### Logits to Classifications

In [21]:
class_preds = np.argmax(preds, axis=1)
class_preds.shape

(1725,)

## Evaluation

In [22]:
metric = evaluate.load("glue", "mrpc")
metric.compute(predictions=class_preds, references=dataset["test"]["label"])

Downloading builder script:   0%|          | 0.00/5.75k [00:00<?, ?B/s]

{'accuracy': 0.8463768115942029, 'f1': 0.887186036611324}