# 05 – Transformer Fine‑Tune (DistilBERT)

Fine‑tune a lightweight transformer (DistilBERT) on the **Twitter‑Airline Sentiment** dataset and benchmark it against a classical **TF‑IDF + Logistic Regression** baseline.

> **Model** `distilbert‑base‑uncased`  
> **Training split** 90 % of cleaned data (stratified)  
> **Validation split** 10 % (held‑out during fine‑tuning)  
> **Test set** Untouched split created in `04_baseline_model.ipynb`  
> **Artifacts saved to** `models/distilbert_twitter/`

## 0 Imports & Global Config

Everything we need in one place:

1. **Path handling** (`pathlib.Path`) so the notebook is platform‑agnostic.  
2. **Reproducibility seeds** for Python, NumPy, and (if available) CUDA.  
3. **Key Hugging Face classes** (`AutoTokenizer`, `AutoModelForSequenceClassification`, `Trainer`, …).  
4. A line that tells Transformers to **ignore TensorFlow** so only PyTorch is used.

In [None]:
# %% 0 Imports & Global Config ──────────────────────────────────────
import os
os.environ["TRANSFORMERS_NO_TF"] = "1"          # use PyTorch only

from pathlib import Path
import random
import numpy as np
import pandas as pd
import torch
from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification,
    TrainingArguments, Trainer, DataCollatorWithPadding,
)
from datasets import Dataset
from evaluate import load as load_metric
import json
import pprint

# reproducibility ---------------------------------------------------
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

# repo‑aware paths --------------------------------------------------
PROJ_ROOT = Path.cwd().parent
PROC_DIR  = PROJ_ROOT / "data" / "processed"
MODEL_DIR = PROJ_ROOT / "models" / "distilbert_twitter"
MODEL_DIR.mkdir(parents=True, exist_ok=True)

  from .autonotebook import tqdm as notebook_tqdm


## 1 Load Pre‑made Feather Splits

Read the parquet file that contains **14 640 pre‑cleaned tweets** and show the first few rows to confirm the schema.

In [2]:
# %% 1 Load pre‑made Feather splits ─────────────────────────────────
def _load_xy_split(split: str):
    """
    Return (X, y) for the given split.
    X : DataFrame with 'text'
    y : Series with 'label'
    """
    X = pd.read_feather(PROC_DIR / f"X_{split}.ftr")        # ['text']
    y = pd.read_feather(PROC_DIR / f"y_{split}.ftr")["label"]
    return X, y

X_train, y_train = _load_xy_split("train")
X_val,   y_val   = _load_xy_split("val")

for name, X, y in [("train", X_train, y_train), ("val", X_val, y_val)]:
    assert list(X.columns) == ["text"]
    assert y.name == "label"
    assert len(X) == len(y)
    print(f"{name:5} | rows: {len(X):,}")

display(X_train.head())
display(y_train.head())

train | rows: 11,712
val   | rows: 1,464


Unnamed: 0,text
0,over an hour on hold so far
1,your gif game is strong.
2,"i'm excited too, but perhaps you could scale y..."
3,while other airlines weren't cancelled flighti...
4,conf number fmjtyl delayed - any chance of get...


0    negative
1    negative
2    positive
3    negative
4     neutral
Name: label, dtype: object

## 3 Tokenisation → HF Datasets

1. Build a label ↔ ID mapping.  
2. Use DistilBERT’s tokenizer to turn each tweet into `input_ids` and `attention_mask`.  
3. Convert pandas DataFrames into **`datasets.Dataset`** objects for high‑speed, on‑disk caching.  
4. Remove raw text columns so the dataset now holds **tensors only** (`input_ids`, `attention_mask`, `labels`).

In [3]:
# %% 2 Tokenisation → HF Datasets ──────────────────────────────────
TEXT_COL  = "text"
LABEL_COL = "label"

LABELS   = ["negative", "neutral", "positive"]
label2id = {lab: i for i, lab in enumerate(LABELS)}
id2label = {i: lab for lab, i in label2id.items()}

tok = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def encode(batch):
    enc = tok(batch[TEXT_COL],
              truncation=True, padding="max_length", max_length=128)
    enc["labels"] = [label2id[x] for x in batch[LABEL_COL]]
    return enc

cols = [TEXT_COL, LABEL_COL]
train_ds = (Dataset.from_pandas(pd.concat([X_train, y_train], axis=1)[cols])
                     .map(encode, batched=True, remove_columns=cols))
val_ds   = (Dataset.from_pandas(pd.concat([X_val,   y_val],   axis=1)[cols])
                     .map(encode, batched=True, remove_columns=cols))

print("train_ds →", train_ds.column_names, "| rows:", train_ds.num_rows)
print("val_ds   →", val_ds.column_names,   "| rows:", val_ds.num_rows)

Map: 100%|██████████| 11712/11712 [00:01<00:00, 9715.33 examples/s] 
Map: 100%|██████████| 1464/1464 [00:00<00:00, 3798.82 examples/s]

train_ds → ['input_ids', 'attention_mask', 'labels'] | rows: 11712
val_ds   → ['input_ids', 'attention_mask', 'labels'] | rows: 1464





## 4 Model Instantiation

Load DistilBERT with a **new classification head** sized for 3 labels.  
Hugging Face warns that the classification weights are randomly initialised—exactly what we want before fine‑tuning.

In [4]:
# %% 3 Model Instantiation ─────────────────────────────────────────
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=len(LABELS),
    id2label=id2label,
    label2id=label2id,
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## 5 Training Arguments

Define *how* we train:

* 2 epochs, batch‑size 16, learning‑rate 2 × 10⁻⁵  
* Evaluate and save a checkpoint **once per epoch**  
* Basic weight‑decay and logging cadence

> **Note** Older versions of Transformers expect `eval_strategy`  
> whereas ≥ 3.4 use `evaluation_strategy`.

In [5]:
# %% 4 Training Arguments ──────────────────────────────────────────
EPOCHS        = 2
BATCH_SIZE    = 16
LEARNING_RATE = 2e-5

train_args = TrainingArguments(
    output_dir              = MODEL_DIR / "checkpoints",
    eval_strategy           = "epoch",
    save_strategy           = "epoch",
    load_best_model_at_end  = True,
    metric_for_best_model   = "eval_f1",
    greater_is_better       = True,
    learning_rate           = LEARNING_RATE,
    per_device_train_batch_size = BATCH_SIZE,
    per_device_eval_batch_size  = BATCH_SIZE,
    num_train_epochs        = EPOCHS,
    weight_decay            = 0.01,
    seed                    = SEED,
    logging_steps           = 50,
    save_total_limit        = 2,      # keep last 2 checkpoints only
    report_to               = "none",
)

## 6 Trainer + Fine‑Tune

Glue everything together:

1. **DataCollatorWithPadding** dynamically pads each batch.  
2. **compute_metrics** returns accuracy and macro‑F1 after every validation pass.  
3. **Trainer.train()** runs the full training loop and prints a neat progress bar plus validation scores.

In [6]:
# %% 5 Trainer + Fine‑Tune ─────────────────────────────────────────
data_collator = DataCollatorWithPadding(tokenizer=tok, return_tensors="pt")

metric_acc = load_metric("accuracy")
metric_f1  = load_metric("f1")

def compute_metrics(eval_pred):
    preds = eval_pred.predictions.argmax(-1)
    refs  = eval_pred.label_ids
    return {
        "accuracy": metric_acc.compute(predictions=preds, references=refs)["accuracy"],
        "f1": metric_f1.compute(predictions=preds, references=refs, average="macro")["f1"],
    }

trainer = Trainer(
    model           = model,
    args            = train_args,
    train_dataset   = train_ds,
    eval_dataset    = val_ds,
    data_collator   = data_collator,
    compute_metrics = compute_metrics,
)

trainer.train()



Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.485,0.410365,0.837432,0.787987
2,0.3195,0.419298,0.840164,0.798038




TrainOutput(global_step=1464, training_loss=0.4235726233388557, metrics={'train_runtime': 4311.2734, 'train_samples_per_second': 5.433, 'train_steps_per_second': 0.34, 'total_flos': 775742920556544.0, 'train_loss': 0.4235726233388557, 'epoch': 2.0})

## 8 Save Artifacts & Export

Persist everything required for later inference or sharing:

* **Fine‑tuned model weights** (`models/distilbert_twitter/final/`)  
* **Tokenizer vocab & config** (`models/distilbert_twitter/tokenizer/`)  
* **Validation metrics** as a tiny CSV for easy comparison

In [None]:
# %% 6 Save Artefacts & Export ─────────────────────────────────────
VAL_METRICS = trainer.evaluate()            # fetch best‑epoch metrics

SAVE_DIR = MODEL_DIR / "final"
TOKEN_DIR = SAVE_DIR / "tokenizer"

SAVE_DIR.mkdir(parents=True, exist_ok=True)

# model & tokenizer
trainer.save_model(SAVE_DIR)                # saves both config & weights
tok.save_pretrained(TOKEN_DIR)

# metrics
with open(SAVE_DIR / "val_metrics.json", "w") as fp:
    json.dump(VAL_METRICS, fp, indent=2)

print("✅ Artefacts saved to", SAVE_DIR.resolve())
pprint.pp(VAL_METRICS)



✅ Artefacts saved to C:\Projects\twitter-airline-analysis\models\distilbert_twitter\final
{'eval_loss': 0.41929781436920166,
 'eval_accuracy': 0.8401639344262295,
 'eval_f1': 0.7980384320135547,
 'eval_runtime': 62.0646,
 'eval_samples_per_second': 23.588,
 'eval_steps_per_second': 1.482,
 'epoch': 2.0}
