Material taken from https://huggingface.co/learn/nlp-course/chapter3/1?fw=pt

In [1]:
!pip install transformers[torch]
!pip install datasets

Collecting datasets
  Downloading datasets-2.21.0-py3-none-any.whl.metadata (21 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-2.21.0-py3-none-any.whl (527 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m527.3/527.3 kB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (39.9 MB)
[2

# Overview
In this notebook we will explore how to fine-tune the BERT pre-trained model on sentence pair classification task.

we will use as an example the **MRPC** (Microsoft Research Paraphrase Corpus) dataset, which consists of 5,801 pairs of sentences, with a label indicating if they are paraphrases or not (i.e., if both sentences mean the same thing). More specifically, this is one of the 10 datasets composing the [GLUE benchmark](https://gluebenchmark.com/), which is an academic benchmark that is used to measure the performance of ML models across 10 different text classification tasks.

# Data processing

The 🤗 Datasets library provides a very simple command to download this dataset:

In [2]:
from datasets import load_dataset

raw_datasets = load_dataset("glue", "mrpc")
raw_datasets

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/35.3k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/649k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/75.7k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/308k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3668 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/408 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1725 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

In [3]:
raw_train_dataset = raw_datasets["train"]
raw_train_dataset[0]

{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
 'label': 1,
 'idx': 0}

We can see the labels are already integers, so we won’t have to do any preprocessing there. To know which integer corresponds to which label, we can inspect the features of our raw_train_dataset. This will tell us the type of each column:

In [4]:
raw_train_dataset.features

{'sentence1': Value(dtype='string', id=None),
 'sentence2': Value(dtype='string', id=None),
 'label': ClassLabel(names=['not_equivalent', 'equivalent'], id=None),
 'idx': Value(dtype='int32', id=None)}

To preprocess the dataset, we need to convert the text to numbers the model can make sense of. As you saw in the previous chapter, this is done with a tokenizer. We can feed the tokenizer one sentence or a list of sentences, so we can directly tokenize all the first sentences and all the second sentences of each pair like this:

In [5]:
from transformers import AutoTokenizer

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokenized_sentences_1 = tokenizer(raw_datasets["train"]["sentence1"])
tokenized_sentences_2 = tokenizer(raw_datasets["train"]["sentence2"])

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



However, we can’t just pass two sequences to the model and get a prediction of whether the two sentences are paraphrases or not. We need to handle the two sequences as a pair, and apply the appropriate preprocessing. Fortunately, the tokenizer can also take a pair of sequences and prepare it the way our BERT model expects:

In [6]:
inputs = tokenizer("This is the first sentence.", "This is the second one.")
inputs

{'input_ids': [101, 2023, 2003, 1996, 2034, 6251, 1012, 102, 2023, 2003, 1996, 2117, 2028, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

token_type_ids tells the model which part of the input is the first sentence and which is the second sentence.

If we decode the IDs inside input_ids back to words:

In [7]:
tokenizer.convert_ids_to_tokens(inputs["input_ids"])

['[CLS]',
 'this',
 'is',
 'the',
 'first',
 'sentence',
 '.',
 '[SEP]',
 'this',
 'is',
 'the',
 'second',
 'one',
 '.',
 '[SEP]']

So we see the model expects the inputs to be of the form [CLS] sentence1 [SEP] sentence2 [SEP] when there are two sentences.

As you can see, the parts of the input corresponding to [CLS] sentence1 [SEP] all have a token type ID of 0, while the other parts, corresponding to sentence2 [SEP], all have a token type ID of 1.

Note that if you select a different checkpoint, you won’t necessarily have the token_type_ids in your tokenized inputs (for instance, they’re not returned if you use a DistilBERT model). They are only returned when the model will know what to do with them, because it has seen them during its pretraining.

Now that we have seen how our tokenizer can deal with one pair of sentences, we can use it to tokenize our whole dataset: we can feed the tokenizer a list of pairs of sentences by giving it the list of first sentences, then the list of second sentences.

In [8]:
tokenized_dataset = tokenizer(
    raw_datasets["train"]["sentence1"],
    raw_datasets["train"]["sentence2"],
    padding=True,
    truncation=True,
)
tokenized_dataset.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

This works well, but it has the disadvantage of returning a dictionary (with our keys, input_ids, attention_mask, and token_type_ids, and values that are lists of lists). It will also only work if you have enough RAM to store your whole dataset during the tokenization.

The typical approach to solve this problem is to implement a "dataset" class with PyTorch.

In [9]:
raw_datasets["train"]

Dataset({
    features: ['sentence1', 'sentence2', 'label', 'idx'],
    num_rows: 3668
})

In [10]:
from torch.utils.data import Dataset, DataLoader
import torch

class MyDataset(Dataset):

    def __init__(self, data):
        self.sentence1 = data['sentence1']
        self.sentence2 = data['sentence2']
        self.label = data['label']
        self.idx = data['idx']

    def __len__(self):
        return len(self.sentence1)

    def __getitem__(self, idx):
        return {
            'sentence1': self.sentence1[idx],
            'sentence2': self.sentence2[idx],
            'label': self.label[idx]
        }

In [11]:
train_dataset = MyDataset(raw_datasets["train"])
train_dataset

<__main__.MyDataset at 0x7e52fac57c40>

In [12]:
train_dataset[0]

{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
 'label': 1}

In [13]:
valid_dataset = MyDataset(raw_datasets["validation"])

Note that we haven't applied the tokenizer directly in the dataset because we want to encode batch of pairs of sentences and not a single sentence pair at a time. Furthermore applying padding to all the samples to the maximum length is not efficient: it’s better to pad the samples when we’re building a batch, as then we only need to pad to the maximum length in that batch, and not the maximum length in the entire dataset. This can save a lot of time and processing power when the inputs have very variable lengths!

The function that is responsible for putting together samples inside a batch is called a **collate function**. It’s an argument you can pass when you build a **DataLoader**, the default being a function that will just convert your samples to PyTorch tensors and concatenate them (recursively if your elements are lists, tuples, or dictionaries). This won’t be possible in our case since the inputs we have won’t all be of the same size. We have deliberately postponed the application of the tokenizer (which will take also care about the padding), to only apply it as necessary on each batch and avoid having over-long inputs with a lot of padding. This will also speed up training by quite a bit.

To do this in practice, we have to define a collate function that will apply the correct amount of padding to the items of the dataset we want to batch together.

In [14]:
class DataCollator:
    def __init__(self, tokenizer, max_length: int):
      self.tokenizer = tokenizer
      self.max_length = max_length

    def __call__(self, examples):
        sentence1 = [example['sentence1'] for example in examples]
        sentence2 = [example['sentence2'] for example in examples]
        label = [example['label'] for example in examples]

        batch = self.tokenizer(
            sentence1, sentence2, padding=True, truncation=True,
            max_length=self.max_length, return_tensors='pt'
        )

        batch['labels'] = torch.LongTensor(label)

        return batch

In [15]:
max_length = 128
data_collator = DataCollator(tokenizer, max_length)

In [16]:
train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True, collate_fn=data_collator)
valid_loader = DataLoader(valid_dataset, batch_size=8, collate_fn=data_collator)

In [17]:
for batch in train_loader:
  break
batch

{'input_ids': tensor([[  101,  1996, 24265,  5571,  2008,  1049,  6844, 21041,  2001,  2920,
          1999, 12929,  2005,  1996,  2886,  2127,  1996,  2203,  1998,  2001,
          5204,  1997,  1998,  3569,  1996, 17857,  1005,  6355,  3289,  1012,
           102, 19608,  2360,  1049,  6844, 21041,  2001,  2920,  1999,  1996,
         12929,  2005,  1996,  2886,  2127,  1996,  2345,  2617,  1998,  2008,
          2002,  2001,  5204,  1997,  1998,  3569,  1996, 17857,  1005,  6355,
          3289,  1012,   102,     0,     0,     0,     0,     0,     0],
        [  101,  2004,  1997,  9317,  1010,  2045,  2020,  3515, 15596, 18906,
          2015,  3572,  1999,  1996,  4361,  2555,  1012,   102,  2004,  1997,
          6928,  1010,  2045,  2020,  5764, 15596,  3572,  1999,  1998,  2105,
          4361,  1010,  1037,  2103,  1997,  1018,  2454,  2111,  1012,   102,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0, 

# Fine-tuning
We will explore two approaches for performing the fine-tuning. The first uses the Trainer class provided by the Transformers library, while the second implements a custom training loop.

## With HuggingFace's Transformers

🤗 Transformers provides a Trainer class to help you fine-tune any of the pretrained models it provides on your dataset. Once you’ve done all the data preprocessing work in the last section, you have just a few steps left to define the Trainer. The hardest part is likely to be preparing the environment to run Trainer.train().

The first step before we can define our Trainer is to define a TrainingArguments class that will contain all the hyperparameters the Trainer will use for training and evaluation. The only argument you have to provide is a directory where the trained model will be saved, as well as the checkpoints along the way. For all the rest, you can leave the defaults, which should work pretty well for a basic fine-tuning.

In [18]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="test-trainer",
    remove_unused_columns=False,
    num_train_epochs=1,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    logging_dir='./logs',
    logging_steps=100
)

The second step is to define our model: we will use the AutoModelForSequenceClassification class, with two labels.

In [19]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Once we have our model, we can define a Trainer by passing it all the objects constructed up to now — the model, the training_args, the training and validation datasets, our data_collator, and our tokenizer:

In [20]:
from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,
    data_collator=data_collator
)

To fine-tune the model on our dataset, we just have to call the train() method of our Trainer:

In [21]:
trainer.train()

Step,Training Loss
100,0.6371
200,0.5963
300,0.4964
400,0.4689


TrainOutput(global_step=459, training_loss=0.5355236618607132, metrics={'train_runtime': 78.6157, 'train_samples_per_second': 46.657, 'train_steps_per_second': 5.839, 'total_flos': 135411749085120.0, 'train_loss': 0.5355236618607132, 'epoch': 1.0})

The trainer woun't tell you how well (or badly) your model is performing. This is because:

*   We didn’t tell the Trainer to evaluate during training by setting evaluation_strategy to either "steps" (evaluate every eval_steps) or "epoch" (evaluate at the end of each epoch).
*   We didn’t provide the Trainer with a compute_metrics() function to calculate a metric during said evaluation (otherwise the evaluation would just have printed the loss, which is not a very intuitive number).

Let’s see how we can build a useful compute_metrics() function and use it the next time we train. The function must take an EvalPrediction object (which is a named tuple with a predictions field and a label_ids field) and will return a dictionary mapping strings to floats (the strings being the names of the metrics returned, and the floats their values). To get some predictions from our model, we can use the Trainer.predict() command:

In [22]:
predictions = trainer.predict(valid_dataset)
print(predictions.predictions.shape, predictions.label_ids.shape)

(408, 2) (408,)


The output of the predict() method is another named tuple with three fields: predictions, label_ids, and metrics. The metrics field will just contain the loss on the dataset passed, as well as some time metrics (how long it took to predict, in total and on average). Once we complete our compute_metrics() function and pass it to the Trainer, that field will also contain the metrics returned by compute_metrics().

As you can see, predictions is a two-dimensional array with shape 408 x 2 (408 being the number of elements in the dataset we used). Those are the logits for each element of the dataset we passed to predict(). To transform them into predictions that we can compare to our labels, we need to take the index with the maximum value on the second axis:

In [23]:
import numpy as np

preds = np.argmax(predictions.predictions, axis=-1)
preds

array([1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1,
       0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0,
       0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0,
       1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1,
       1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0,
       1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1,
       1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1,
       1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0,
       1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1,
       0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1,
       1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1,

We can now compare those preds to the labels.

In [24]:
from sklearn.metrics import f1_score

def my_custom_metric(pred):
    labels = pred.label_ids
    preds = np.argmax(pred.predictions, axis=-1)
    accuracy = np.mean(labels == preds)
    f1 = f1_score(labels, preds)

    return {"accuracy": accuracy, "f1": f1}

In [25]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="test-trainer",
    remove_unused_columns=False,
    num_train_epochs=2,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    logging_dir='./logs',
    logging_steps=100,
    evaluation_strategy="epoch"
)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

trainer = Trainer(
    model,
    training_args,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,
    data_collator=data_collator,
    compute_metrics=my_custom_metric,
)
trainer.train()

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.543,0.429231,0.79902,0.867314
2,0.3983,0.470806,0.838235,0.886986


TrainOutput(global_step=918, training_loss=0.4835704287152924, metrics={'train_runtime': 150.0961, 'train_samples_per_second': 48.875, 'train_steps_per_second': 6.116, 'total_flos': 270291109394160.0, 'train_loss': 0.4835704287152924, 'epoch': 2.0})

## With PyTorch

In [26]:
max_length = 128
data_collator = DataCollator(tokenizer, max_length)

In [27]:
train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True, collate_fn=data_collator)
valid_loader = DataLoader(valid_dataset, batch_size=8, collate_fn=data_collator)

In [28]:
for batch in train_loader:
    break

In [29]:
import torch.nn as nn
from transformers import AutoModel

class MyModelForSequenceClassification(nn.Module):
  def __init__(self, model_name: str, num_classes: int):
      super(MyModelForSequenceClassification, self).__init__()
      self.pt_model = AutoModel.from_pretrained(model_name)
      self.dropout = nn.Dropout(p=0.3)
      self.classifier = nn.Linear(self.pt_model.config.hidden_size, num_classes)
      self.loss_fn = nn.CrossEntropyLoss()

  def forward(self, input_ids, attention_mask=None, token_type_ids=None, labels=None):
      outputs = self.pt_model(
          input_ids=input_ids,
          attention_mask=attention_mask,
          token_type_ids=token_type_ids
      )
      # Use the [CLS] token representation
      cls_output = outputs.last_hidden_state[:, 0, :]
      # cls_output = outputs[1]
      cls_output = self.dropout(cls_output)
      logits = self.classifier(cls_output)

      loss = None
      if labels is not None:
          loss = self.loss_fn(logits, labels)
          return loss, logits
      return logits

In [30]:
mymodel = MyModelForSequenceClassification(model_name=checkpoint, num_classes=2)
outputs = mymodel(**batch)
print(outputs[0], outputs[1].shape) # Loss, Logits

tensor(0.7801, grad_fn=<NllLossBackward0>) torch.Size([8, 2])


We’re almost ready to write our training loop! We’re just missing two things: an optimizer and a learning rate scheduler. Since we are trying to replicate what the Trainer was doing by hand, we will use the same defaults.

In [31]:
from transformers import AdamW

optimizer = AdamW(mymodel.parameters(), lr=5e-5)

from transformers import get_scheduler

num_epochs = 2
num_training_steps = num_epochs * len(train_loader)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)
print(num_training_steps)

918




In [32]:
import torch

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
mymodel.to(device)
device

device(type='cuda')

In [33]:
from sklearn.metrics import f1_score

def my_custom_metric(preds, labels):
    accuracy = (preds == labels).float().mean().item()
    f1 = f1_score(labels.cpu(), preds.cpu())

    return {"accuracy": accuracy, "f1": f1}

In [34]:
from tqdm.auto import tqdm

progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_epochs):
    mymodel.train()
    for batch in train_loader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = mymodel(**batch)
        loss = outputs[0]
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)


    mymodel.eval()
    preds = []
    labels = []
    for batch in valid_loader:
        batch = {k: v.to(device) for k, v in batch.items()}
        with torch.no_grad():
            outputs = mymodel(**batch)

        logits = outputs[1]
        predictions = torch.argmax(logits, dim=-1)
        preds.append(predictions)
        labels.append(batch["labels"])

    preds = torch.cat(preds)
    labels = torch.cat(labels)

    scores = my_custom_metric(preds, labels)
    print(f"[EPOCH: {epoch}] {scores}")

  0%|          | 0/918 [00:00<?, ?it/s]

[EPOCH: 0] {'accuracy': 0.8774510025978088, 'f1': 0.9125874125874125}
[EPOCH: 1] {'accuracy': 0.8553921580314636, 'f1': 0.8981001727115717}
