# 05 – Transformer Fine‑Tune (DistilBERT)

Fine‑tune a lightweight transformer (DistilBERT) on the **Twitter‑Airline Sentiment** dataset and benchmark it against the classical **TF‑IDF + Logistic Regression** baseline.

> **Model** `distilbert‑base‑uncased`  
> **Training split** 90 % of cleaned data (**stratified**)  
> **Validation split** 10 % (held‑out during fine‑tuning)  
> **Test set** Untouched split created in `04_baseline_model.ipynb`  
> **Artifacts saved to** `models/distilbert_twitter/`

---

## Tools Used & Why  <!-- required by portfolio spec -->

| Tool | Purpose | Why this tool |
|------|---------|--------------|
| **Hugging Face Transformers** | Pre‑trained DistilBERT weights & Trainer API | Industry standard for NLP transfer learning |
| **Datasets** | Efficient dataset objects, streaming, mapping | Handles tokenisation, caching, and splits cleanly |
| **PyTorch** | Back‑end tensor engine | Widely supported, mature, CUDA‑ready |
| **evaluate** | Metrics (accuracy, F1) | Consistent with HF ecosystem |
| **scikit‑learn** | Baseline metrics & confusion matrix | Lightweight, familiar API for classic ML |

## 0 Imports & Global Config  

In [None]:
import os
os.environ["TRANSFORMERS_NO_TF"] = "1"

In [None]:
# --- core -------------------------------------------------------------
from pathlib import Path
import random

# --- third‑party ------------------------------------------------------
import numpy as np
import pandas as pd

# transformers / HF
from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification,
    TrainingArguments, Trainer, DataCollatorWithPadding
)
from datasets import Dataset 
from evaluate import load as load_metric
import torch

# --- reproducibility --------------------------------------------------
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

# --- paths ------------------------------------------------------------
PROJ_ROOT  = Path.cwd()
DATA_DIR   = PROJ_ROOT / "data" / "processed"
MODEL_DIR  = PROJ_ROOT / "models" / "distilbert_twitter"
MODEL_DIR.mkdir(parents=True, exist_ok=True)

  from .autonotebook import tqdm as notebook_tqdm


## 1 Load Cleaned Data

In [3]:
PROJ_ROOT = Path.cwd().parent           # one level up from notebooks/
DATA_DIR  = PROJ_ROOT / "data" / "processed"

# Adjust filename to match actual parquet/CSV
df = pd.read_parquet(DATA_DIR / "tweets.parquet")

display(df.head())
print(f"Loaded {len(df):,} tweets")

Unnamed: 0,tweet_id,airline,airline_sentiment,clean_text,negativereason
0,570306133677760513,Virgin America,neutral,what said.,
1,570301130888122368,Virgin America,positive,plus you've added commercials to the experienc...,
2,570301083672813571,Virgin America,neutral,i didn't today... must mean i need to take ano...,
3,570301031407624196,Virgin America,negative,"it's really aggressive to blast obnoxious ""ent...",Bad Flight
4,570300817074462722,Virgin America,negative,and it's a really big bad thing about it,Can't Tell


Loaded 14,640 tweets


## 2 Train / Validation Split

In [4]:
from sklearn.model_selection import train_test_split

train_df, val_df = train_test_split(
    df,
    test_size   = 0.10,
    stratify    = df["airline_sentiment"],
    random_state= SEED,
)

print(f"Train rows: {len(train_df):,}  │ Val rows: {len(val_df):,}")

Train rows: 13,176  │ Val rows: 1,464


## 3 Tokenisation → HF Datasets

In [None]:
LABELS   = ["negative", "neutral", "positive"]
label2id = {lab: i for i, lab in enumerate(LABELS)}
id2label = {i: lab for lab, i in label2id.items()}

tok = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def encode(batch):
    enc = tok(
        batch["clean_text"],
        truncation=True,
        max_length=128,
    )
    enc["labels"] = [label2id[x] for x in batch["airline_sentiment"]]
    return enc

cols = ["clean_text", "airline_sentiment"]

train_ds = Dataset.from_pandas(train_df[cols]).map(encode, batched=True,
                                                   remove_columns=cols)
val_ds   = Dataset.from_pandas(val_df[cols]).map(encode, batched=True,
                                                 remove_columns=cols)

print(train_ds)
print(val_ds)

Map: 100%|██████████| 13176/13176 [00:00<00:00, 20452.13 examples/s]
Map: 100%|██████████| 1464/1464 [00:00<00:00, 26142.30 examples/s]

Dataset({
    features: ['__index_level_0__', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 13176
})
Dataset({
    features: ['__index_level_0__', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 1464
})





## 4 Model Instantiation

In [6]:
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels = len(LABELS),
    id2label   = id2label,
    label2id   = label2id,
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## 5 Training Arguments

In [7]:
EPOCHS      = 2
BATCH_SIZE  = 16
LEARNING_RATE = 2e-5

train_args = TrainingArguments(
    output_dir              = MODEL_DIR,
    eval_strategy     = "epoch",
    save_strategy           = "epoch",
    learning_rate           = LEARNING_RATE,
    per_device_train_batch_size = BATCH_SIZE,
    per_device_eval_batch_size  = BATCH_SIZE,
    num_train_epochs        = EPOCHS,
    weight_decay            = 0.01,
    logging_steps           = 50,
    seed                    = SEED,
    report_to               = "none",
)


## 6 Trainer + Fine‑Tune

In [8]:
data_collator = DataCollatorWithPadding(tokenizer=tok, return_tensors="pt")

metric_acc = load_metric("accuracy")
metric_f1  = load_metric("f1")

def compute_metrics(eval_pred):
    preds = eval_pred.predictions.argmax(-1)
    refs  = eval_pred.label_ids
    return {
        "accuracy": metric_acc.compute(predictions=preds, references=refs)["accuracy"],
        "f1": metric_f1.compute(predictions=preds, references=refs, average="macro")["f1"],
    }

trainer = Trainer(
    model           = model,
    args            = train_args,
    train_dataset   = train_ds,
    eval_dataset    = val_ds,
    data_collator   = data_collator,
    compute_metrics = compute_metrics,
)

trainer.train()



Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.4161,0.444006,0.835383,0.782617
2,0.3336,0.450154,0.838798,0.791774




TrainOutput(global_step=1648, training_loss=0.41087919589385247, metrics={'train_runtime': 1812.3851, 'train_samples_per_second': 14.54, 'train_steps_per_second': 0.909, 'total_flos': 256589787423456.0, 'train_loss': 0.41087919589385247, 'epoch': 2.0})

## 7 Evaluation on Validation Set

In [9]:
val_metrics = trainer.evaluate()
val_metrics



{'eval_loss': 0.4501541554927826,
 'eval_accuracy': 0.8387978142076503,
 'eval_f1': 0.7917735076611953,
 'eval_runtime': 19.4595,
 'eval_samples_per_second': 75.233,
 'eval_steps_per_second': 4.728,
 'epoch': 2.0}

## 8 Save Artifacts & Export