# Fine-tuning a Pretrained Model

Learning how to fine-tune a pretrained model with a classification head

| Date | User | Change Type | Remarks |  
| ---- | ---- | ----------- | ------- |
| 20/11/2025   | Martin | Created   | Created to learn how to fine-tune a pretrained model | 
| 08/12/2025   | Martin | Created   | Completed chapter on finetuning | 

# Content

* [Introduction](#introduction)
* [Training Pipeline](#training-pipeline)
* [Fine-tuning](#fine-tuning)

# Introduction

Training BERT with MRPC dataset that indicates whether a _pair of sentences are paraphrased or not_

In [2]:
%load_ext watermark

In [2]:
import torch
from torch.optim import AdamW
from transformers import AutoTokenizer, AutoModelForSequenceClassification

Sample training loop using 2 sentences

In [3]:
# Define the original model - text classification model
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

# Additional sequences to train on
sequences = [
  "I've been waiting for a HuggingFace course my whole life.",
  "This course is amazing!",
]

# Process the sequences
batch = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")
batch["labels"] = torch.tensor([1, 1])

optimizer = AdamW(model.parameters())
loss = model(**batch).loss
loss.backward()
optimizer.step()

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


---

# Training Pipeline

The original BERT model was used to predict whether the second sentence follows the first. Sentences in the dataset were split and half of them were randomly paired with another half from another sentence to form the negative examples.

`token_type_id` represents whether a sentence belongs to the first or second half. But this can be ignored given that we are fine-tuning on a different objective. As long as the model remains the same, the targets can be changed

<u>Additional Components</u>

- `datasets` is a Huggingface API for loading datasets to and from their Hub

ðŸ’¡<u>Tips</u>

- Padding batches is more efficient since it only pads to the maximum length of the batch
  - `DataCollatorWithPadding` does this by passing the relevant tokenizer as input

In [4]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

In [5]:
CHECKPOINT = "bert-base-uncased"

In [6]:
raw_datasets = load_dataset("glue", "mrpc")
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

In [7]:
raw_train_dataset = raw_datasets['train']
raw_train_dataset[0]

{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
 'label': 1,
 'idx': 0}

In [8]:
raw_train_dataset.features

{'sentence1': Value('string'),
 'sentence2': Value('string'),
 'label': ClassLabel(names=['not_equivalent', 'equivalent']),
 'idx': Value('int32')}

Preprocessing the data - tokenization

In [9]:
tokenizer = AutoTokenizer.from_pretrained(CHECKPOINT)
sent_1 = tokenizer(raw_datasets['train']['sentence1'][0])
sent_2 = tokenizer(raw_datasets['train']['sentence2'][0])

print("----- Comparing Tokenized Sentences ------")
print(sent_1)
print(sent_2)

----- Comparing Tokenized Sentences ------
{'input_ids': [101, 2572, 3217, 5831, 5496, 2010, 2567, 1010, 3183, 2002, 2170, 1000, 1996, 7409, 1000, 1010, 1997, 9969, 4487, 23809, 3436, 2010, 3350, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
{'input_ids': [101, 7727, 2000, 2032, 2004, 2069, 1000, 1996, 7409, 1000, 1010, 2572, 3217, 5831, 5496, 2010, 2567, 1997, 9969, 4487, 23809, 3436, 2010, 3350, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [10]:
inputs = tokenizer(
  "This is the first sentence.",
  "This is the second one.",
)
print(inputs)

decoded = tokenizer.convert_ids_to_tokens(inputs['input_ids'])
print(decoded)

{'input_ids': [101, 2023, 2003, 1996, 2034, 6251, 1012, 102, 2023, 2003, 1996, 2117, 2028, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
['[CLS]', 'this', 'is', 'the', 'first', 'sentence', '.', '[SEP]', 'this', 'is', 'the', 'second', 'one', '.', '[SEP]']


In [11]:
# .map() method applies a function on each element of the dataset
def tokenize_function(val):
  return tokenizer(val['sentence1'], val['sentence2'], truncation=True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1725
    })
})

Using the data collator to pad by batches

In [12]:
# Define the collator
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

samples = tokenized_datasets['train'][:8] # Take 8 samples
samples = {k: v for k, v in samples.items() if k not in ['idx', 'sentence1', 'sentence2']}

print("Variable sequence length for each sentence in the tokenized dataset")
print([len(x) for x in samples['input_ids']])
print()

batch = data_collator(samples)
print("Dynamically padded for the samples in this batch of size 8")
print({k: v.shape for k, v in batch.items()})

Variable sequence length for each sentence in the tokenized dataset
[50, 59, 47, 67, 59, 50, 62, 32]

Dynamically padded for the samples in this batch of size 8
{'input_ids': torch.Size([8, 67]), 'token_type_ids': torch.Size([8, 67]), 'attention_mask': torch.Size([8, 67]), 'labels': torch.Size([8])}


---

# Fine-tuning

In [21]:
from transformers import Trainer, TrainingArguments
from transformers import AutoModelForSequenceClassification
import evaluate

In [None]:
training_args = TrainingArguments(
  output_dir="models",            # Path to save model to
  overwrite_output_dir=True,      # Overwrite path to save to
  # num_train_epochs=1,             # Number of training epochs
  # per_device_train_batch_size=2,  # Batch size per device
  # gradient_accumulation_steps=4,  # Gradients to accumulate within each batch. Effective batch size (here) = 2 * 4
  # learning_rate=2e-5,             # Learning rate
  # lr_scheduler_type='cosine',     # Learning rate scheduler strategy
  # save_steps=2000,                # Number of updates steps between 2 checkpoints
  # logging_steps=2000,             # Number of steps to log information to
  # do_eval=True,                   # Evalute on validation set
  # eval_strategy="steps",          # Evaluation method
  # eval_steps=2000,                # Number of steps between evaluations
  # report_to=[],                   # List of integrations to external report platforms
  # save_total_limit=2,             # Max number of models to save
  # fp16=False,                     # Define mixed precision model
)

In [15]:
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [27]:
trainer = Trainer(
  model,
  training_args,
  train_dataset=tokenized_datasets['train'],
  eval_dataset=tokenized_datasets['validation'],
  data_collator=data_collator,
  processing_class=tokenizer,
  compute_metrics=compute_metrics
)

Output prediction is a named tuple with 3 fields

1. `predictions`
2. `label_ids`
3. `metrics`

In [None]:
predictions = trainer.predict(tokenized_datasets['validation'])
print(predictions.predictions.shape, predictions.label_ids.shape)

preds = np.argmax(predictions.predictions, axis=-1)

Evaluation with the evaluate library

In [24]:
metric = evaluate.load("glue", "mrpc")
metric.compute(predictions=preds, reference=predictions.label_ids)

NameError: name 'preds' is not defined

In [None]:
def compute_metrics(eval_pads):
  metric = evaluate.load("glue", "mrpc")
  logits, labels = eval_preds
  predictions = np.argmax(logits, axis=-1)
  return metric.compute(predictions=predictions, references=labels)

trainer.train()

---

# Full Training Loop

Implementing a full training loop with PyTorch backend, replicating the Trainer class functionality

In [5]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding
from transformers import AutoModelForSequenceClassification
from transformers import get_scheduler
import evaluate

import torch
from torch.utils.data import DataLoader
from torch.optim import AdamW

from tqdm.auto import tqdm

In [6]:
raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_function(example):
  return tokenizer(example['sentence1'], example['sentence2'], truncation=True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

README.md: 0.00B [00:00, ?B/s]

mrpc/train-00000-of-00001.parquet:   0%|          | 0.00/649k [00:00<?, ?B/s]

mrpc/validation-00000-of-00001.parquet:   0%|          | 0.00/75.7k [00:00<?, ?B/s]

mrpc/test-00000-of-00001.parquet:   0%|          | 0.00/308k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3668 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/408 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1725 [00:00<?, ? examples/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

Make some changes to the original datasets for the specific task:

1. Remove columns corresponding to values the model does not expect (e.g `sentence1` and `sentence2`)
2. Rename column to `labels`
3. Set the format of the datasets to return PyTorch tensors

In [7]:
tokenized_datasets = tokenized_datasets.remove_columns(["sentence1", "sentence2", "idx"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")
tokenized_datasets["train"].column_names

['labels', 'input_ids', 'token_type_ids', 'attention_mask']

In [8]:
tokenized_datasets['train']

Dataset({
    features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 3668
})

In [9]:
train_dataloader = DataLoader(
  tokenized_datasets['train'],
  shuffle=True,
  batch_size=8,
  collate_fn=data_collator
)
eval_dataloader = DataLoader(
  tokenized_datasets['validation'],
  batch_size=8,
  collate_fn=data_collator
)

In [10]:
for batch in train_dataloader:
  break
{k: v.shape for k, v in batch.items()}

{'labels': torch.Size([8]),
 'input_ids': torch.Size([8, 75]),
 'token_type_ids': torch.Size([8, 75]),
 'attention_mask': torch.Size([8, 75])}

In [11]:
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

optimizer = AdamW(model.parameters(), lr=5e-5)

# Define a learning rate scheduler - linear decay
num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
  "linear",
  optimizer=optimizer,
  num_warmup_steps=0,
  num_training_steps=num_training_steps
)

# Set the device
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model.to(device)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

In [12]:
# Training Loop
progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
  for batch in train_dataloader:
    optimizer.zero_grad()
    batch = {k: v.to(device) for k, v in batch.items()}
    outputs = model(**batch)
    loss = outputs.loss
    loss.backward()

    optimizer.step()
    lr_scheduler.step()
    progress_bar.update(1)

  0%|          | 0/1377 [00:00<?, ?it/s]

In [13]:
# Evaluation Loop
metric = evaluate.load("glue", "mrpc")
model.eval()

for batch in eval_dataloader:
  batch = {k: v.to(device) for k, v in batch.items()}
  with torch.no_grad():
    output = model(**batch)
  
  logits = output.logits
  preds = torch.argmax(logits, dim=-1)

  metric.add_batch(predictions=preds, references=batch['labels'])

metric.compute()

Downloading builder script: 0.00B [00:00, ?B/s]

{'accuracy': 0.8504901960784313, 'f1': 0.893542757417103}

---

# Training with Accelerate

Accelerate is a library that enables PyTorch training loops to use multi-GPU/TPU setups with less hassle.

It abstracts the key components for multi-hardware setups and makes minor changes to the standard PyTorch training loop

In [15]:
from accelerate import Accelerator
from torch.optim import AdamW
from transformers import AutoModelForSequenceClassification, get_scheduler

In [21]:
# NOTE: This is for notebooks only
def training_function():
  accelerator = Accelerator() # >>: Define a new accelerator object

  model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
  optimizer = AdamW(model.parameters(), lr=3e-5)

  train_dl, eval_dl, model, optimizer = accelerator.prepare(
    train_dataloader, eval_dataloader, model, optimizer
  ) # >>: Move all model training details to accelerate object

  num_epochs = 3
  num_training_steps = num_epochs * len(train_dl)
  lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
  )

  progress_bar = tqdm(range(num_training_steps))

  model.train()
  for epoch in range(num_epochs):
    for batch in train_dl:
      optimizer.zero_grad()
      outputs = model(**batch)
      loss = outputs.loss
      accelerator.backward(loss) # >>: Change the backprop call

      optimizer.step()
      lr_scheduler.step()
      progress_bar.update(1)

In [22]:
from accelerate import notebook_launcher

notebook_launcher(
  training_function,
  num_processes=1
)

Launching training on one GPU.


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/1377 [00:00<?, ?it/s]

In [3]:
%watermark

Last updated: 2025-12-08T17:52:18.200493+08:00

Python implementation: CPython
Python version       : 3.10.12
IPython version      : 8.37.0

Compiler    : GCC 11.4.0
OS          : Linux
Release     : 6.6.87.2-microsoft-standard-WSL2
Machine     : x86_64
Processor   : x86_64
CPU cores   : 20
Architecture: 64bit

