<a href="https://colab.research.google.com/github/pinzger/handsonllms/blob/main/Fine_tuning_BERT_for_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tuning BERT for Classification
Example adopted from Hands-On Large Language Models

In [2]:
# %%capture
!pip install datasets accelerate seqeval evaluate
# !pip install transformers>=4.38.2 sentence-transformers>=2.5.1 setfit>=1.0.3 accelerate>=0.27.2 seqeval>=1.2.2

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/43.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.

## Use again the movie rating data from rotten tomatoes

In [3]:
from datasets import load_dataset

# Prepare data and splits
tomatoes = load_dataset("rotten_tomatoes")
train_data, validation_data, test_data = tomatoes["train"], tomatoes["validation"], tomatoes["test"]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/7.46k [00:00<?, ?B/s]

train.parquet:   0%|          | 0.00/699k [00:00<?, ?B/s]

validation.parquet:   0%|          | 0.00/90.0k [00:00<?, ?B/s]

test.parquet:   0%|          | 0.00/92.2k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8530 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1066 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1066 [00:00<?, ? examples/s]

In [13]:
validation_data[:3]

{'text': ['compassionately explores the seemingly irreconcilable situation between conservative christian parents and their estranged gay and lesbian children .',
  'the soundtrack alone is worth the price of admission .',
  'rodriguez does a splendid job of racial profiling hollywood style--casting excellent latin actors of all ages--a trend long overdue .'],
 'label': [1, 1, 1]}

## Load model and tokenizer
Using the BERT base model to initialize a sequence classifcation model that contains s feedforwaerd nn as classification head.

In [14]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Load Model and Tokenizer
model_id = "bert-base-cased"
model = AutoModelForSequenceClassification.from_pretrained(model_id, num_labels=2)
tokenizer = AutoTokenizer.from_pretrained(model_id)

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

## Tokenize the data
The DataCollatorWithPadding class helps with building batches and padding the data.

In [15]:
from transformers import DataCollatorWithPadding

# Pad to the longest sequence in the batch
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

def preprocess_function(examples):
   """Tokenize input data"""
   return tokenizer(examples["text"], truncation=True)

# Tokenize train/test data
tokenized_train = train_data.map(preprocess_function, batched=True)
tokenized_validation = validation_data.map(preprocess_function, batched=True)
tokenized_test = test_data.map(preprocess_function, batched=True)

Map:   0%|          | 0/8530 [00:00<?, ? examples/s]

Map:   0%|          | 0/1066 [00:00<?, ? examples/s]

Map:   0%|          | 0/1066 [00:00<?, ? examples/s]

## Defined the metrics for evaluating our model

In [16]:
import numpy as np
import evaluate


def compute_metrics(eval_pred):
    """Calculate F1 score"""
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)

    load_f1 = evaluate.load("f1")
    f1 = load_f1.compute(predictions=predictions, references=labels)["f1"]
    return {"f1": f1}

## Train the model
Initialize the trainer first.

In [17]:
from transformers import TrainingArguments, Trainer

# Training arguments for parameter tuning
training_args = TrainingArguments(
   "model",
   learning_rate=2e-5,
   per_device_train_batch_size=16,
   per_device_eval_batch_size=16,
   num_train_epochs=1, ## currently only 1 epoch!
   weight_decay=0.01,
   save_strategy="epoch",
   report_to="none"
)

# Trainer which executes the training process
trainer = Trainer(
   model=model,
   args=training_args,
   train_dataset=tokenized_train,
   eval_dataset=tokenized_validation,
   processing_class=tokenizer,
   data_collator=data_collator,
   compute_metrics=compute_metrics,
)

run the training.

In [18]:
trainer.train()

Step,Training Loss
500,0.4054


TrainOutput(global_step=534, training_loss=0.4021025918396225, metrics={'train_runtime': 115.4939, 'train_samples_per_second': 73.857, 'train_steps_per_second': 4.624, 'total_flos': 227605451772240.0, 'train_loss': 0.4021025918396225, 'epoch': 1.0})

evaluate the model with the specified eval_dataset

In [19]:
trainer.evaluate()

Downloading builder script:   0%|          | 0.00/6.79k [00:00<?, ?B/s]

{'eval_loss': 0.3439076244831085,
 'eval_f1': 0.8589743589743589,
 'eval_runtime': 4.9432,
 'eval_samples_per_second': 215.652,
 'eval_steps_per_second': 13.554,
 'epoch': 1.0}

and with the test data (should not make a big difference, because we trained only for 1 epoc)!

In [20]:
trainer.evaluate(tokenized_test)

{'eval_loss': 0.36189785599708557,
 'eval_f1': 0.8493919550982226,
 'eval_runtime': 4.5344,
 'eval_samples_per_second': 235.091,
 'eval_steps_per_second': 14.776,
 'epoch': 1.0}

# Train the model with multiple epochs

In [24]:
# Training arguments for parameter tuning
training_args_multiple_epochs = TrainingArguments(
   "model",
   learning_rate=2e-5,
   per_device_train_batch_size=16,
   per_device_eval_batch_size=16,
   eval_strategy="epoch",  # Evaluate after each epoch
   num_train_epochs=3, ## train for 3 epochs
   weight_decay=0.01,
   save_strategy="epoch",
   report_to="none"
)

# Trainer which executes the training process
trainer_multiple_epochs = Trainer(
   model=model,
   args=training_args_multiple_epochs,
   train_dataset=tokenized_train,
   eval_dataset=tokenized_validation,
   processing_class=tokenizer,
   data_collator=data_collator,
   compute_metrics=compute_metrics,
)



In [25]:
trainer_multiple_epochs.train()

Epoch,Training Loss,Validation Loss,F1
1,0.2066,0.356758,0.860404
2,0.1815,0.522898,0.851105
3,0.0976,0.662921,0.854991


TrainOutput(global_step=1602, training_loss=0.15571916505192104, metrics={'train_runtime': 356.1193, 'train_samples_per_second': 71.858, 'train_steps_per_second': 4.498, 'total_flos': 681337383407880.0, 'train_loss': 0.15571916505192104, 'epoch': 3.0})

In [26]:
trainer_multiple_epochs.evaluate()

{'eval_loss': 0.6629214882850647,
 'eval_f1': 0.8549905838041432,
 'eval_runtime': 4.784,
 'eval_samples_per_second': 222.826,
 'eval_steps_per_second': 14.005,
 'epoch': 3.0}

Result is not really better compared to only training for 1 epoch.

In [27]:
trainer_multiple_epochs.evaluate(tokenized_test)

{'eval_loss': 0.757531464099884,
 'eval_f1': 0.8481973434535104,
 'eval_runtime': 4.7403,
 'eval_samples_per_second': 224.878,
 'eval_steps_per_second': 14.134,
 'epoch': 3.0}

# Rerun the process only training the classification head
And freeze the BERT layers

In [29]:
# Load new Model and Tokenizer
model_freezed = AutoModelForSequenceClassification.from_pretrained(model_id, num_labels=2)
# tokenizer_freezed = AutoTokenizer.from_pretrained(model_id)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Inspect the model
Print the layer names. Note the 12 attention layers.

In [30]:
# Print layer names
for name, param in model_freezed.named_parameters():
    print(name)

bert.embeddings.word_embeddings.weight
bert.embeddings.position_embeddings.weight
bert.embeddings.token_type_embeddings.weight
bert.embeddings.LayerNorm.weight
bert.embeddings.LayerNorm.bias
bert.encoder.layer.0.attention.self.query.weight
bert.encoder.layer.0.attention.self.query.bias
bert.encoder.layer.0.attention.self.key.weight
bert.encoder.layer.0.attention.self.key.bias
bert.encoder.layer.0.attention.self.value.weight
bert.encoder.layer.0.attention.self.value.bias
bert.encoder.layer.0.attention.output.dense.weight
bert.encoder.layer.0.attention.output.dense.bias
bert.encoder.layer.0.attention.output.LayerNorm.weight
bert.encoder.layer.0.attention.output.LayerNorm.bias
bert.encoder.layer.0.intermediate.dense.weight
bert.encoder.layer.0.intermediate.dense.bias
bert.encoder.layer.0.output.dense.weight
bert.encoder.layer.0.output.dense.bias
bert.encoder.layer.0.output.LayerNorm.weight
bert.encoder.layer.0.output.LayerNorm.bias
bert.encoder.layer.1.attention.self.query.weight
bert.enc

## Only enable the classifier layer to be trained

In [31]:
for name, param in model_freezed.named_parameters():

     # Trainable classification head
     if name.startswith("classifier"):
        param.requires_grad = True

      # Freeze everything else
     else:
        param.requires_grad = False

In [32]:
# We can check whether the model was correctly updated
for name, param in model_freezed.named_parameters():
     print(f"Parameter: {name} ----- {param.requires_grad}")

Parameter: bert.embeddings.word_embeddings.weight ----- False
Parameter: bert.embeddings.position_embeddings.weight ----- False
Parameter: bert.embeddings.token_type_embeddings.weight ----- False
Parameter: bert.embeddings.LayerNorm.weight ----- False
Parameter: bert.embeddings.LayerNorm.bias ----- False
Parameter: bert.encoder.layer.0.attention.self.query.weight ----- False
Parameter: bert.encoder.layer.0.attention.self.query.bias ----- False
Parameter: bert.encoder.layer.0.attention.self.key.weight ----- False
Parameter: bert.encoder.layer.0.attention.self.key.bias ----- False
Parameter: bert.encoder.layer.0.attention.self.value.weight ----- False
Parameter: bert.encoder.layer.0.attention.self.value.bias ----- False
Parameter: bert.encoder.layer.0.attention.output.dense.weight ----- False
Parameter: bert.encoder.layer.0.attention.output.dense.bias ----- False
Parameter: bert.encoder.layer.0.attention.output.LayerNorm.weight ----- False
Parameter: bert.encoder.layer.0.attention.output

## Initialize a new trainer and run training

In [33]:
# Trainer which executes the training process
trainer_freezed = Trainer(
   model=model_freezed,
   args=training_args,
   train_dataset=tokenized_train,
   eval_dataset=tokenized_validation,
   processing_class=tokenizer,
   data_collator=data_collator,
   compute_metrics=compute_metrics,
)
trainer_freezed.train()

Step,Training Loss
500,0.6914


TrainOutput(global_step=534, training_loss=0.6905746995733025, metrics={'train_runtime': 51.1688, 'train_samples_per_second': 166.703, 'train_steps_per_second': 10.436, 'total_flos': 227605451772240.0, 'train_loss': 0.6905746995733025, 'epoch': 1.0})

Evaluation shows a much lower f1-score compared to the fine-tuned model before.

In [34]:
trainer_freezed.evaluate()

{'eval_loss': 0.6827377676963806,
 'eval_f1': 0.6159292035398231,
 'eval_runtime': 6.1572,
 'eval_samples_per_second': 173.131,
 'eval_steps_per_second': 10.882,
 'epoch': 1.0}