<a href="https://colab.research.google.com/github/nyp-sit/iti107-2024s2/blob/main/session-5/fine-tune-bert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tuning BERT for Text Classification

One of the approaches where we can use BERT for downstream task such as text classification is to do fine-tuning of the pretrained model.

In this lab, we will see how we can use a pretrained DistilBert Model and fine-tune it with custom training data for text classification task.

At the end of this session, you will be able to:

prepare data and use model-specific Tokenizer to format data suitable for use by the model
configure the transformer model for fine-tuning
train the model for binary and multi-class text classification

## Install Transformers and other libraries

If you are running this notebook in Google Colab, you will need to install the Hugging Face transformers library as it is not part of the standard environment.


In [None]:
%%capture
!pip install datasets>=2.18.0 transformers>=4.38.2 sentence-transformers>=2.5.1 setfit>=1.0.3 accelerate>=0.27.2 seqeval>=1.2.2

## Prepare Dataset

The train set has 40000 samples. We will use only a small subset (e.g. 2500) samples for finetuning our pretrained model. Similarly we will use a smaller test set for evaluating our model.

In [None]:
import numpy as np
from datasets import load_dataset

# downloaded the datasets.
test_data_url = 'https://nyp-aicourse.s3-ap-southeast-1.amazonaws.com/datasets/imdb_test.csv'
train_data_url = 'https://nyp-aicourse.s3-ap-southeast-1.amazonaws.com/datasets/imdb_train.csv'

train_data = load_dataset('csv', data_files=train_data_url, split="train").shuffle(seed=128).select(range(2500))
test_data = load_dataset('csv', data_files=test_data_url, split="train").shuffle(seed=128).select(range(500))

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Load Model and Tokenizer
model_id = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_id)

### Tokenization

Let us take a closer look at the output of the tokenization process.

We notice that the tokenizer will return a dictionary of two items 'input_ids' and 'attention_mask'. The input_ids contains the IDs of the tokens. While the 'attention_mask' contains the masking pattern for those padded positions. If you are using BERT tokenizer, there will be additional item called 'token_type_ids'.

We also notice that for the example sentence, the word 'Transformer' is being broken up into two tokens 'Trans' and '##former'. The '##' means that the rest of the token should be attached to the previous one.

We also see that the tokenizer appended [CLS] (token_id=101) to the beginning of the token sequence, and [SEP] (token_id=102) at the end.

In [None]:
test_sentence = "Transformer is really good for Natural Language Processing."

encoding = tokenizer(test_sentence, padding=True, truncation=True)
print(f"Encoding keys:  {encoding.keys()}\n")

print(f"token ids: {encoding['input_ids']}\n")
print(f"attention_mask: {encoding['attention_mask']}\n")
print(f"tokens: {tokenizer.decode(encoding['input_ids'])}")

###Create the tokenized dataset

We will first convert the sentiment label from text to numeric label. We will need to create a data field called 'label' as the model will use this 'label' key as the target label.

We will also use the model's (in this case the DistilBERT) tokenizer to produce the input data that are suitable to be used by the DistilBert model, e.g. the input_ids, the attention_mask.  It automatically append the [CLS] token in the front of the sequence of token_ids and the [SEP] token at the end of the sequence of token_ids , and also the attention mask for those padded positions in the input sequence of tokens.

We also specify the DataCollator to use. Data collators are objects that will form a batch by using a list of dataset elements as input. These elements are of the same type as the elements of train_dataset or eval_dataset. To be able to build batches, data collators may apply some processing (like padding).

In [None]:
from transformers import DataCollatorWithPadding

# Pad to the longest sequence in the batch
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

def convert_label(sample):
    return {
        "label" : 0 if sample['sentiment'] == 'negative' else 1
    }
    # sample['sentiment'] = 0 if sample['sentiment'] == 'negative' else 1
    # return { "label": sample['sentiment']}
    # return sample
    # return {
    #     "text":  sample['review'],
    #     "label": sample['sentiment'] }

def preprocess_function(examples):
    """Tokenize input data"""
    return tokenizer(examples["review"], truncation=True)


# Tokenize train/test data
tokenized_train = train_data.map(convert_label).map(preprocess_function, remove_columns=['review', 'sentiment'], batched=True)
tokenized_test = test_data.map(convert_label).map(preprocess_function, remove_columns=['review', 'sentiment'], batched=True)

In [None]:
print(tokenized_test)

We will define a compute_metrics() function to calculate the necessary metrics. With compute_metrics we can define any number of metrics that we are
interested in and that can be printed out or logged during training. This is
especially helpful during training as it allows for detecting overfitting
behavior.

In [None]:
import numpy as np
import evaluate


def compute_metrics(eval_pred):
    """Calculate F1 score"""
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)

    load_f1 = evaluate.load("f1")
    f1 = load_f1.compute(predictions=predictions, references=labels)["f1"]
    return {"f1": f1}

###Train the Model

We will instantiate a pretrained model 'distilbert-base-uncased', using AutoModelForSequenceClassification.

We define the number of labels that we want to predict beforehand. This is
necessary to create the feedforward neural network that is applied on top of
our pretrained model:

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(model_id, num_labels=2)

Transformer models benefit from a much lower learning rate than the default used by AdamW, which is 0.001. In this training, we will start the training with 2e-5 (0.00002) and slowly reduce the learning rate over the course of training. In the literature, you will sometimes see this referred to as decaying or annealing the learning rate.

In [None]:
import os

# go to https://wandb.ai/authorize to get your access key
os.environ['WANDB_API_KEY']="<<your wandb access key>"
os.environ['WANDB_PROJECT']="transformer_proj"

In [None]:
from transformers import TrainingArguments, Trainer

# Training arguments for parameter tuning
training_args = TrainingArguments(
   "model",
   learning_rate=2e-5,
   per_device_train_batch_size=16,
   per_device_eval_batch_size=16,
   num_train_epochs=1,
   weight_decay=0.01,
   save_strategy="epoch",
   eval_strategy='steps',
   eval_steps=0.1,
   report_to="wandb",
   logging_steps=0.1,
   run_name="bert-finetune"
)

# Trainer which executes the training process
trainer = Trainer(
   model=model,
   args=training_args,
   train_dataset=tokenized_train,
   eval_dataset=tokenized_test,
   tokenizer=tokenizer,
   data_collator=data_collator,
   compute_metrics=compute_metrics,
)


In [None]:
trainer.train()

Evaluate results.

In [None]:
trainer.evaluate()

### Freeze Layers

To show the importance of training the entire network, we will now freeze the main DistilBERT model and allow only updates to pass through the classification head.

In [None]:
# Load Model and Tokenizer
model = AutoModelForSequenceClassification.from_pretrained(model_id, num_labels=2)
tokenizer = AutoTokenizer.from_pretrained(model_id)

Our pretrained DistilBERT model contains a lot of layers that we can potentially
freeze. Inspecting these layers gives insight into the structure of the network
and what we might want to freeze:

In [None]:
# Print layer names
for name, param in model.named_parameters():
    print(name)

In [None]:
for name, param in model.named_parameters():

     # Trainable classification head
     if name.startswith("classifier") or name.startswith("pre_classifier"):
        param.requires_grad = True

      # Freeze everything else
     else:
        param.requires_grad = False

In [None]:
# We can check whether the model was correctly updated
for name, param in model.named_parameters():
     print(f"Parameter: {name} ----- {param.requires_grad}")

In [None]:
from transformers import TrainingArguments, Trainer

# Trainer which executes the training process
trainer = Trainer(
   model=model,
   args=training_args,
   train_dataset=tokenized_train,
   eval_dataset=tokenized_test,
   tokenizer=tokenizer,
   data_collator=data_collator,
   compute_metrics=compute_metrics,
)
trainer.train()

In [None]:
trainer.evaluate()

### Freeze blocks 1-4

Instead of freezing nearly all layers, let’s freeze everything up until encoder block 4 and see how it affects performance. A major benefit is that this reduces computation but still allows updates to flow through part of the
pretrained model:

In [None]:
# We can check whether the model was correctly updated
for index, (name, param) in enumerate(model.named_parameters()):
     print(f"Parameter: {index}{name} ----- {param.requires_grad}")

In [None]:
# Load model
model = AutoModelForSequenceClassification.from_pretrained(model_id, num_labels=2)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Encoder block 10 starts at index 68 and
# we freeze everything before that block
for index, (name, param) in enumerate(model.named_parameters()):
    if index < 68:
        param.requires_grad = False



In [None]:
# Trainer which executes the training process
trainer = Trainer(
   model=model,
   args=training_args,
   train_dataset=tokenized_train,
   eval_dataset=tokenized_test,
   tokenizer=tokenizer,
   data_collator=data_collator,
   compute_metrics=compute_metrics,
)
trainer.train()
trainer.evaluate()