In [None]:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created: September 18, 2025
Author: Pranaydeep Singh
Last Modified: November 6, 2025
Modified by: Pranaydeep Singh
Description: Script for fine-tuning a fine-tuned BERT model for text classification with inference.
"""

'\nCreated: September 18, 2025\nLast Modified: November 6, 2025\nAuthor: Pranaydeep Singh\nDescription: Script for fine-tuning a fine-tuned BERT model for text classification with inference.\n'

# Fine-tuning BERT for Text Classification

This notebook walks you through **fine-tuning a pretrained BERT model** on a text classification dataset using the ðŸ¤— Hugging Face ecosystem.

## What you'll learn
- What *fine-tuning* is and why it's useful
- How to load an NLP dataset with `datasets`
- How tokenization works for BERT (input IDs + attention masks)
- How to train and evaluate with the `Trainer` API
- How to save a model and run inference on new texts

> Tip: Please follow the instructions in the README to install the required packages before proceeding


## 1) Imports + Quick environment check

We use:
- **transformers**: models, tokenizers, and the training utilities
- **datasets**: easy access to standard NLP datasets + fast preprocessing
- **evaluate**: standard evaluation metrics (accuracy, F1, etc.)

We'll also set a random seed so results are more reproducible.



In [4]:
import os
import numpy as np
import torch

from datasets import load_dataset
import evaluate

from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    DataCollatorWithPadding,
    Trainer,
    TrainingArguments,
    set_seed,
)

print('PyTorch:', torch.__version__)
print('CUDA available:', torch.cuda.is_available())
if torch.cuda.is_available():
    print('GPU:', torch.cuda.get_device_name(0))

set_seed(42)

PyTorch: 2.9.1+cu128
CUDA available: True
GPU: Tesla V100-SXM2-16GB


## 2) Choose a model + dataset

### Model
We'll start from a pretrained checkpoint: **`bert-base-uncased`**.

- *Pretrained* means it already learned general language patterns from lots of text.
- *Fine-tuning* means we add a small classification head and train on our labeled dataset.

### Dataset
We'll use **AG News** (`ag_news`), a classic 4-class topic classification dataset.

The dataset contains:
- `text`: the news headline + snippet
- `label`: an integer class id

We'll create a small validation split from the training set.

In [5]:
MODEL_NAME = 'bert-base-uncased'
DATASET_NAME = 'ag_news'

raw = load_dataset(DATASET_NAME)
raw

Generating train split: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 120000/120000 [00:00<00:00, 237351.63 examples/s]
Generating test split: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 7600/7600 [00:00<00:00, 134575.84 examples/s]


DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 120000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 7600
    })
})

### (Optional) Use a small subset for quick experiments

Training on the full dataset is totally fine, but when you're learning, itâ€™s often nicer to iterate quickly.

Set `USE_SMALL_SUBSET=True` to train on a small slice (2000 training samples, 200 validation samples, 500 test samples)


In [6]:
USE_SMALL_SUBSET = True

if USE_SMALL_SUBSET:
    train_raw = raw['train'].shuffle(seed=42).select(range(2000))
    test_raw  = raw['test'].shuffle(seed=42).select(range(500))
else:
    train_raw = raw['train']
    test_raw  = raw['test']

# Create a validation split from the training data
split = train_raw.train_test_split(test_size=0.1, seed=42)
train_ds = split['train']
val_ds   = split['test']

label_names = raw['train'].features['label'].names
num_labels = len(label_names)
print('Labels:', label_names)
print('num_labels:', num_labels)
print('train/val/test:', len(train_ds), len(val_ds), len(test_raw))


Labels: ['World', 'Sports', 'Business', 'Sci/Tech']
num_labels: 4
train/val/test: 1800 200 500


## 3) Tokenization (turn text into model inputs)

BERT does not read raw strings. It reads numbers:
- **input_ids**: token IDs (words/subwords mapped to integers)
- **attention_mask**: 1 for real tokens, 0 for padding

We use the model's matching tokenizer so token IDs line up correctly.

We'll use **dynamic padding** (pad to the longest sequence in a batch) which is usually faster than padding everything to a fixed length.

In [7]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

def tokenize_function(batch):
    return tokenizer(batch['text'], truncation=True)

tokenized_train = train_ds.map(tokenize_function, batched=True, remove_columns=['text'])
tokenized_val   = val_ds.map(tokenize_function, batched=True, remove_columns=['text'])
tokenized_test  = test_raw.map(tokenize_function, batched=True, remove_columns=['text'])

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Peek at a tokenized example
tokenized_train[0]

Map: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 1800/1800 [00:00<00:00, 7250.57 examples/s]
Map: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 200/200 [00:00<00:00, 6547.72 examples/s]
Map: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 500/500 [00:00<00:00, 7222.00 examples/s]


{'label': 0,
 'input_ids': [101,
  3956,
  17910,
  2015,
  5920,
  8647,
  2886,
  1024,
  2557,
  5611,
  3548,
  2031,
  17910,
  2098,
  1037,
  2825,
  5920,
  8647,
  2886,
  1998,
  14620,
  1037,
  9302,
  2450,
  1010,
  3956,
  2557,
  2988,
  2006,
  9432,
  1012,
  102],
 'token_type_ids': [0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0],
 'attention_mask': [1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1]}

## 4) Load the model

We load a pretrained model **with a classification head**.
Setting `num_labels` ensures the output layer matches the number of classes in the dataset.

In [8]:
model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME,
    num_labels=num_labels,
    id2label={i: name for i, name in enumerate(label_names)},
    label2id={name: i for i, name in enumerate(label_names)},
)

model

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

## 5) Metrics (evaluation)

We'll compute **accuracy** during evaluation.

Later, you can add F1/precision/recall (especially useful for imbalanced datasets).

In [9]:
accuracy = evaluate.load('accuracy')

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return accuracy.compute(predictions=preds, references=labels)

Downloading builder script: 4.20kB [00:00, 3.60MB/s]


## 6) Training

Key hyperparameters:
- **learning_rate**: how big each update step is (2e-5 is a common BERT starting point)
- **batch size**: how many examples per step
- **epochs**: passes through the dataset

Notes:
- If you hit out-of-memory on GPU, reduce `per_device_train_batch_size`.
- `fp16=True` can speed things up on many GPUs.


In [11]:

OUTPUT_DIR = '../models/bert-finetuned-ag-news'

training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    eval_strategy='epoch',
    save_strategy='epoch',
    load_best_model_at_end=True,
    metric_for_best_model='accuracy',

    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,

    logging_dir=os.path.join(OUTPUT_DIR, 'logs'),
    logging_steps=25,

    fp16=torch.cuda.is_available(),
    report_to='none',  # change to 'wandb' if you use Weights & Biases to keep track of experiments (highly recommended)
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()


  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy
1,1.1299,0.656158,0.905
2,0.6137,0.359945,0.92
3,0.4114,0.302699,0.94




TrainOutput(global_step=87, training_loss=0.6663558291292738, metrics={'train_runtime': 36.2426, 'train_samples_per_second': 148.996, 'train_steps_per_second': 2.4, 'total_flos': 360094500981696.0, 'train_loss': 0.6663558291292738, 'epoch': 3.0})

## 7) Evaluate on the test set

We trained on `train_ds`, tuned on `val_ds`, and now we report performance on `test_raw`.


Load saved model for inference or further evaluation

In [12]:
test_metrics = trainer.evaluate(tokenized_test)
test_metrics



{'eval_loss': 0.4201270639896393,
 'eval_accuracy': 0.88,
 'eval_runtime': 1.0926,
 'eval_samples_per_second': 457.612,
 'eval_steps_per_second': 7.322,
 'epoch': 3.0}

## 8) Save the model + tokenizer

Saving lets you:
- reuse the model later
- share it with labmates
- upload it to the Hugging Face Hub (optional)


In [13]:
trainer.save_model(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)
print('Saved to:', OUTPUT_DIR)

Saved to: ../models/bert-finetuned-ag-news


## 9) Inference (predict on new text)

We'll reload from `OUTPUT_DIR` and run predictions on new sentences.

Two beginner-friendly options:
1. Use the model directly (`model(**inputs)`) (allows for tweaking)
2. Use a `pipeline` (simpler interface shown below)

In [14]:
from transformers import pipeline

clf = pipeline('text-classification', model=OUTPUT_DIR, tokenizer=OUTPUT_DIR, device=0 if torch.cuda.is_available() else -1)

sample_texts = [
    'The stock market fell sharply after the central bank announcement.',
    'The team won the championship after a thrilling final match.',
    'Scientists discovered a new method to improve battery life.',
    'Diplomats met to discuss a new ceasefire agreement.'
]

preds = clf(sample_texts)
preds

Device set to use cuda:0


[{'label': 'Business', 'score': 0.8443780541419983},
 {'label': 'Sports', 'score': 0.8677119016647339},
 {'label': 'Sci/Tech', 'score': 0.789272665977478},
 {'label': 'World', 'score': 0.8701167702674866}]

## 10) Recommended Next Steps

- **Try your own dataset** by loading it and adapting the preprocessing steps 
- Add **F1/precision/recall** and a confusion matrix
- Track experiments with W&B and keep an experiment card (hypothesis -> config -> results)
- Look into more advanced models if you are not happy with the performance for your task (decoder-models, larger models, etc.)
