# Fine-tuning a pretrained model for text classification

In this notebook, we learn how to fine-tune a pretrained language model on our own dataset. In this case, we are using the IMDB dataset for sentiment analysis. You can find more info about the dataset here: https://huggingface.co/datasets/imdb.

The model we are using is DistilBERT, which is a significantly smaller and faster version of BERT, produced through a process called knowledge distillation. It is reported to retain around 97% of BERT's language understanding capabilities.

If you are using Google Colab, make sure that you are using a GPU (Runtime > Change runtime type > Hardware accelerator > GPU).

In [1]:
# Install the required libraries
!pip install transformers
!pip install datasets




If using Google Colab: Mount Google Drive to save the fine-tuned model.

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Define where to save the fine-tuned model. If you are using Colab, the model needs to be saved to Google Drive (as specified below). Otherwise, you can use a local dir.

In [3]:
import os
output_dir = os.path.join('drive', 'My Drive', 'distilbert-finetuned-imdb')

In [4]:
# Import torch and check if GPU is available
import torch
train_on_gpu = torch.cuda.is_available()
print('Train on GPU: ', train_on_gpu)

Train on GPU:  True


## 1 Data preparation

We use the Datasets library to download the data. We further split the data into traininig, validation and test sets. We only use 3000 out of 25000 training examples because otherwise fine-tuning would take too much time.

In [5]:
# Load the dataset and create the data splits
from datasets import load_dataset

imdb = load_dataset("imdb")
imdb = imdb.shuffle(seed=42)

# We use a small subset of the dataset to decrease training time: 3000 training examples and 300 validation/test examples.
train_dataset = imdb["train"].select(range(3000))
val_dataset = imdb["train"].select(range(3000, 3300))
test_dataset = imdb["test"].select(range(300))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Inspect the data to see if it looks as you expect.

We preprocess the data using a Transformers Tokenizer, which tokenizes the data and formats it for input to the model. Transformer models use sub-word tokenizers, meaning that a token can be a whole word or a part of a word. This process varies across different tokenizers, so it is important to use the correct tokenizer for your chosen model. Typically, the tokenizer name will be the same as model name. If this does not work, you can find the correct tokenizer name on the model card of your chosen model.

In this case, the tokenizer we use is distilbert-base-uncased (same as the model), and we specify that we want to use the fast version of the tokenizer.



In [6]:
# Instantiate the tokenizer
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased", use_fast=True)

We apply the map method to tokenize the entire dataset at once. The data is passed in batches for faster tokenization.

In [7]:
# Prepare the text inputs for the model
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

tokenized_train = train_dataset.map(preprocess_function, batched=True)
tokenized_val = val_dataset.map(preprocess_function, batched=True)
tokenized_test = test_dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/300 [00:00<?, ? examples/s]

Inspect also the tokenized data to see the transformations applied. You should see lists of token IDs and attention masks.

In [8]:
# Use data_collector to convert our samples to PyTorch tensors and concatenate them with the correct amount of padding
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

## 2 Training the model

In [9]:
# Define DistilBERT as our base model and ensure the utilization of the GPU
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

if train_on_gpu:
  model = model.to('cuda')

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [10]:
# Define the evaluation metrics
import numpy as np
from datasets import load_metric

def compute_metrics(eval_pred):
    load_accuracy = load_metric("accuracy")
    load_f1 = load_metric("f1")

    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    accuracy = load_accuracy.compute(predictions=predictions, references=labels)["accuracy"]
    f1 = load_f1.compute(predictions=predictions, references=labels)["f1"]
    return {"accuracy": accuracy, "f1": f1}

We use the Trainer class for fine-tuning. Trainer is specifically optimised for training models from the Transformers library. If you prefer to write your own training loop, that is also possible. More info here: https://huggingface.co/docs/transformers/training.

We also specify the training arguments, which define some hyperparameters and strategies. Since we are only training for two epochs with a modest number of training examples, we set evaluation after every 50 steps so that we can monitor the progress. Take a look at the documentation if you want to understand the arguments better.

In [11]:

!pip install accelerate -U



In [12]:
!pip install transformers[torch]



In [13]:
# Define a new Trainer with all the objects we constructed so far
from transformers import TrainingArguments, Trainer
!pip install transformers[torch]
!pip install accelerate -U

training_args = TrainingArguments(
    output_dir=output_dir,
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01,
    evaluation_strategy='steps',
    logging_steps=50,
    eval_steps=50,
    save_steps=200
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)





In [16]:
# Train and save the model
trainer.train()
trainer.save_model(output_dir=output_dir)

Step,Training Loss,Validation Loss


KeyboardInterrupt: 

## 3 Testing the model

In [15]:
# Compute the evaluation metrics
trainer.evaluate(tokenized_test)

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


{'eval_loss': 0.3342441916465759,
 'eval_accuracy': 0.86,
 'eval_f1': 0.8636363636363636,
 'eval_runtime': 5.7651,
 'eval_samples_per_second': 52.037,
 'eval_steps_per_second': 3.296,
 'epoch': 2.0}

## 4 Improving the results

When fine-tuning a transformer model, we have a lot less flexibility compared to training a neural network from scratch. That is because we are taking an existing model whose architecture has already been defined, but we can still change some hyperparameters.

Try to see if you can get better results by varying training parameters like the learning rate, weight decay or the number of epochs. You can also try changing the batch size, but increasing it significantly might cause a memory crash.

Once you find the best combination of hyperparameters, try training on more data (remember: we only used a subset for faster processing.) Does more data improve the results?

In [None]:
from transformers import TrainingArguments, Trainer

# Adjust hyperparameters
training_args = TrainingArguments(
    output_dir=output_dir,
    learning_rate=3e-5,  # Experiment with different learning rates
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,  # Increase the number of epochs
    weight_decay=0.01,  # Experiment with different weight decay values
    evaluation_strategy='steps',
    logging_steps=100,
    eval_steps=100,
    save_steps=200,
    warmup_steps=500,  # Add warmup steps
    gradient_accumulation_steps=2,  # Use gradient accumulation
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,  # Ensure this is correctly defined
)



In [18]:
# Train and save the model
trainer.train()
trainer.save_model(output_dir=output_dir)

Step,Training Loss,Validation Loss,Accuracy,F1
100,0.1079,0.312467,0.906667,0.90604
200,0.1074,0.366444,0.893333,0.888889


You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


In [19]:
trainer.evaluate(tokenized_test)

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


{'eval_loss': 0.37853336334228516,
 'eval_accuracy': 0.8633333333333333,
 'eval_f1': 0.8664495114006515,
 'eval_runtime': 5.8699,
 'eval_samples_per_second': 51.108,
 'eval_steps_per_second': 3.237,
 'epoch': 3.0}

# Analyse van de 2 modellen

1. Eval Loss:

    Eerste Training: 0.3342
    Tweede Training: 0.3785
    Analyse: De eval loss is iets hoger in de tweede training. Dit kan suggereren dat het model in de tweede training iets minder goed generaliseert naar de validatieset.

2. Eval Accuracy:

    Eerste Training: 0.86
    Tweede Training: 0.8633
    Analyse: De nauwkeurigheid is bijna hetzelfde, met een lichte verbetering in de tweede training. Dit suggereert dat de veranderingen in hyperparameters geen significante invloed hebben gehad op de nauwkeurigheid.

3. Eval F1 Score:

    Eerste Training: 0.8636
    Tweede Training: 0.8664
    Analyse: De F1-score is iets beter in de tweede training, wat aangeeft dat de balans tussen precisie en recall iets verbeterd is.

4. Eval Runtime:

    Eerste Training: 5.7651 seconden
    Tweede Training: 5.8699 seconden
    Analyse: De evaluatietijd is iets langer in de tweede training, waarschijnlijk als gevolg van de extra epoch en de complexiteit door gradient accumulation.

5. Eval Samples per Second & Eval Steps per Second:

    Eerste Training: 52.037 samples/s, 3.296 steps/s
    Tweede Training: 51.108 samples/s, 3.237 steps/s
    Analyse: De snelheid van de evaluatie is iets afgenomen in de tweede training, wat verwacht kan worden gezien de verhoogde complexiteit van de training parameters zoals gradient accumulation.

### Conclusie:

De tweede training, met een hogere learning rate, meer epochs, warmup steps en gradient accumulation, resulteert in een iets hogere F1-score en nauwkeurigheid, hoewel de eval loss iets hoger is. De verschillen in prestaties zijn over het algemeen klein, maar de tweede training toont aan dat een complexere training configuratie (zoals het verhogen van de learning rate en het aantal epochs) kan leiden tot een betere balans tussen precisie en recall (hoger F1), ondanks de iets hogere eval loss.



## BONUS: Fine-tune a model on a different dataset/task of your choice.