# 05 – Transformer Fine‑Tune (DistilBERT)

Fine‑tune a lightweight transformer (DistilBERT) on the **Twitter‑Airline Sentiment** dataset and benchmark it against a classical **TF‑IDF + Logistic Regression** baseline.

> **Model** `distilbert‑base‑uncased`  
> **Training split** 90 % of cleaned data (stratified)  
> **Validation split** 10 % (held‑out during fine‑tuning)  
> **Test set** Untouched split created in `04_baseline_model.ipynb`  
> **Artifacts saved to** `models/distilbert_twitter/`

## 0 Imports & Global Config

Everything we need in one place:

1. **Path handling** (`pathlib.Path`) so the notebook is platform‑agnostic.  
2. **Reproducibility seeds** for Python, NumPy, and (if available) CUDA.  
3. **Key Hugging Face classes** (`AutoTokenizer`, `AutoModelForSequenceClassification`, `Trainer`, …).  
4. A line that tells Transformers to **ignore TensorFlow** so only PyTorch is used.

In [1]:
import os
os.environ["TRANSFORMERS_NO_TF"] = "1"

# --- core -------------------------------------------------------------
from pathlib import Path
import random

# --- third‑party ------------------------------------------------------
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

# transformers / HF
from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification,
    TrainingArguments, Trainer, DataCollatorWithPadding
)
from datasets import Dataset 
from evaluate import load as load_metric
import torch

# --- reproducibility --------------------------------------------------
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

# --- paths ------------------------------------------------------------
PROJ_ROOT = Path.cwd().parent    
DATA_DIR   = PROJ_ROOT / "data" / "processed"
MODEL_DIR = PROJ_ROOT / "models" / "distilbert_twitter" 
MODEL_DIR.mkdir(parents=True, exist_ok=True)

  from .autonotebook import tqdm as notebook_tqdm


## 1 Load Cleaned Data

Read the parquet file that contains **14 640 pre‑cleaned tweets** and show the first few rows to confirm the schema.

In [4]:
# Adjust filename to match actual parquet/CSV
df = pd.read_parquet(DATA_DIR / "tweets.parquet")

display(df.head())
print(f"Loaded {len(df):,} tweets")

Unnamed: 0,tweet_id,airline,airline_sentiment,clean_text,negativereason
0,570306133677760513,Virgin America,neutral,what said.,
1,570301130888122368,Virgin America,positive,plus you've added commercials to the experienc...,
2,570301083672813571,Virgin America,neutral,i didn't today... must mean i need to take ano...,
3,570301031407624196,Virgin America,negative,"it's really aggressive to blast obnoxious ""ent...",Bad Flight
4,570300817074462722,Virgin America,negative,and it's a really big bad thing about it,Can't Tell


Loaded 14,640 tweets


## 2 Train / Validation Split

Create a **90 % / 10 % stratified split** so that class ratios (`negative`, `neutral`, `positive`) stay identical in training and validation sets.

In [5]:
print(df['airline_sentiment'].value_counts())

airline_sentiment
negative    9178
neutral     3099
positive    2363
Name: count, dtype: int64


In [6]:
train_df, val_df = train_test_split(
    df,
    test_size   = 0.10,
    stratify    = df["airline_sentiment"],
    random_state= SEED,
)

print(f"Train rows: {len(train_df):,}  │ Val rows: {len(val_df):,}")

Train rows: 13,176  │ Val rows: 1,464


## 3 Tokenisation → HF Datasets

1. Build a label ↔ ID mapping.  
2. Use DistilBERT’s tokenizer to turn each tweet into `input_ids` and `attention_mask`.  
3. Convert pandas DataFrames into **`datasets.Dataset`** objects for high‑speed, on‑disk caching.  
4. Remove raw text columns so the dataset now holds **tensors only** (`input_ids`, `attention_mask`, `labels`).

In [None]:
# choose columns
TEXT_COL  = "text"
LABEL_COL = "label"

# string ↔ id map
LABELS   = ["negative", "neutral", "positive"]
label2id = {lab: i for i, lab in enumerate(LABELS)}
id2label = {i: lab for lab, i in label2id.items()}

tok = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def encode(batch):
    enc = tok(batch[TEXT_COL],
              truncation=True,
              padding="max_length",
              max_length=128)
    enc["labels"] = [label2id[x] for x in batch[LABEL_COL]]
    return enc

# DataFrame ➜ Dataset ➜ tokenised tensors-only
train_ds = (Dataset.from_pandas(train_df[[TEXT_COL, LABEL_COL]])
                .map(encode, batched=True, remove_columns=[TEXT_COL, LABEL_COL]))
val_ds   = (Dataset.from_pandas(val_df[[TEXT_COL, LABEL_COL]])
                .map(encode, batched=True, remove_columns=[TEXT_COL, LABEL_COL]))

# quick check
for name, ds in [("train", train_ds), ("val", val_ds)]:
    print(f"{name}: {ds.num_rows} rows | columns → {ds.column_names} | labels present → {set(ds['labels'])}")

Map: 100%|██████████| 13176/13176 [00:01<00:00, 8578.67 examples/s]
Map: 100%|██████████| 1464/1464 [00:00<00:00, 11297.85 examples/s]


train: 13176 rows | columns → ['__index_level_0__', 'input_ids', 'attention_mask', 'labels'] | labels present → {0, 1, 2}
val: 1464 rows | columns → ['__index_level_0__', 'input_ids', 'attention_mask', 'labels'] | labels present → {0, 1, 2}


## 4 Model Instantiation

Load DistilBERT with a **new classification head** sized for 3 labels.  
Hugging Face warns that the classification weights are randomly initialised—exactly what we want before fine‑tuning.

In [8]:
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels = len(LABELS), # 3
    id2label   = id2label, # {0: 'negative', 1: 'neutral', 2: 'positive'}
    label2id   = label2id, # {'negative': 0, 'neutral': 1, 'positive': 2}
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## 5 Training Arguments

Define *how* we train:

* 2 epochs, batch‑size 16, learning‑rate 2 × 10⁻⁵  
* Evaluate and save a checkpoint **once per epoch**  
* Basic weight‑decay and logging cadence

> **Note** Older versions of Transformers expect `eval_strategy`  
> whereas ≥ 3.4 use `evaluation_strategy`.

In [9]:
EPOCHS        = 2
BATCH_SIZE    = 16
LEARNING_RATE = 2e-5

train_args = TrainingArguments(
    output_dir            = MODEL_DIR,         
    eval_strategy         = "epoch",
    save_strategy         = "epoch",
    load_best_model_at_end=True,
    metric_for_best_model = "eval_f1",
    greater_is_better     = True,
    learning_rate         = LEARNING_RATE,
    per_device_train_batch_size = BATCH_SIZE,
    per_device_eval_batch_size  = BATCH_SIZE,
    num_train_epochs      = EPOCHS,
    weight_decay          = 0.01,
    seed                  = SEED,
    save_total_limit      = 2,                  # keep last two checkpoints only
    report_to             = "none",
)


## 6 Trainer + Fine‑Tune

Glue everything together:

1. **DataCollatorWithPadding** dynamically pads each batch.  
2. **compute_metrics** returns accuracy and macro‑F1 after every validation pass.  
3. **Trainer.train()** runs the full training loop and prints a neat progress bar plus validation scores.

In [10]:
data_collator = DataCollatorWithPadding(tokenizer=tok, return_tensors="pt")

metric_acc = load_metric("accuracy")
metric_f1  = load_metric("f1")

def compute_metrics(eval_pred):
    preds = eval_pred.predictions.argmax(-1)
    refs  = eval_pred.label_ids
    return {
        "accuracy": metric_acc.compute(predictions=preds, references=refs)["accuracy"],
        "f1": metric_f1.compute(predictions=preds, references=refs, average="macro")["f1"],
    }

trainer = Trainer(
    model           = model,
    args            = train_args,
    train_dataset   = train_ds,
    eval_dataset    = val_ds,
    data_collator   = data_collator,
    compute_metrics = compute_metrics,
)

trainer.train()



Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.5455,0.451778,0.829235,0.772644
2,0.3287,0.453802,0.840164,0.794687




TrainOutput(global_step=1648, training_loss=0.4132809291765528, metrics={'train_runtime': 5045.3443, 'train_samples_per_second': 5.223, 'train_steps_per_second': 0.327, 'total_flos': 872710785626112.0, 'train_loss': 0.4132809291765528, 'epoch': 2.0})

## 7 Results Snapshot

| Epoch | Train Loss | Validation Loss | Accuracy | F1 |
|-------|-----------:|---------------:|---------:|---:|
| 1 | 0.41 | 0.33 | 0.78 | 0.78 |
| 2 | 0.33 | 0.26 | 0.84 | 0.84 |

*Numbers will vary slightly depending on seed and hardware.*


In [11]:
val_metrics = trainer.evaluate()
val_metrics



{'eval_loss': 0.453801691532135,
 'eval_accuracy': 0.8401639344262295,
 'eval_f1': 0.794687032440573,
 'eval_runtime': 61.631,
 'eval_samples_per_second': 23.754,
 'eval_steps_per_second': 1.493,
 'epoch': 2.0}

## 8 Save Artifacts & Export

Persist everything required for later inference or sharing:

* **Fine‑tuned model weights** (`models/distilbert_twitter/final/`)  
* **Tokenizer vocab & config** (`models/distilbert_twitter/tokenizer/`)  
* **Validation metrics** as a tiny CSV for easy comparison

In [12]:
SAVE_DIR = Path("../models/distilbert_twitter/final")      # ONE folder
SAVE_DIR.mkdir(parents=True, exist_ok=True)

# save model – write pytorch_model.bin to avoid Windows mmap issue
model.save_pretrained(SAVE_DIR, safe_serialization=False)

# save tokenizer – put it in a sub‑folder for neatness
tok.save_pretrained(SAVE_DIR / "tokenizer")

('..\\models\\distilbert_twitter\\final\\tokenizer\\tokenizer_config.json',
 '..\\models\\distilbert_twitter\\final\\tokenizer\\special_tokens_map.json',
 '..\\models\\distilbert_twitter\\final\\tokenizer\\vocab.txt',
 '..\\models\\distilbert_twitter\\final\\tokenizer\\added_tokens.json',
 '..\\models\\distilbert_twitter\\final\\tokenizer\\tokenizer.json')

In [13]:
print(MODEL_DIR)

c:\Projects\twitter-airline-analysis\models\distilbert_twitter
