# Fine-tuning DistilBERT on Your Dataset (Single-Dataset Notebook)

## What this notebook does
- Loads **your dataset only** (`DATA_REVIEWS`)
- Splits into train/validation (stratified)
- Fine-tunes **DistilBERT base** (`distilbert-base-uncased`) for binary sentiment classification
- Saves the fine-tuned model locally (so you do **not** retrain next time)
- Evaluates on the validation split with:
  - full `classification_report` (per-class precision/recall/F1)
  - confusion matrix
  - error analysis table (longest misclassified reviews)

## Notes
- This notebook intentionally **does not** use `DATA_REVIEWS_REAL`.
- You can create a separate evaluation notebook later if you want cross-dataset generalization checks.


In [1]:
# ============================================
# 1) Imports & setup
# ============================================
from __future__ import annotations

from dataclasses import dataclass
from typing import List, Optional

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, precision_recall_fscore_support

from datasets import Dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    Trainer,
    TrainingArguments,
    DataCollatorWithPadding,
    pipeline,
)

pd.set_option("display.max_columns", 200)
pd.set_option("display.width", 140)

SEED = 42
np.random.seed(SEED)


  from .autonotebook import tqdm as notebook_tqdm


## 2) Path and expected schema
TSV is expected to contain:
- `Review` (text)
- `Liked` (label: `1` positive, `0` negative)


In [2]:
# ============================================
# 2) Path (EDIT IF NEEDED)
# ============================================
DATA_REVIEWS_PATH = "data/reviews_dataset.tsv"
SEP = "\t"


In [3]:
# ============================================
# 3) Load dataset
# ============================================
def load_reviews_tsv(path: str, sep: str = "\t") -> pd.DataFrame:
    df = pd.read_csv(path, sep=sep)
    required = {"Review", "Liked"}
    missing = required - set(df.columns)
    if missing:
        raise ValueError(f"Missing columns in {path}: {missing}. Expected at least {required}.")
    df = df.copy()
    df["Review"] = df["Review"].fillna("").astype(str)
    df["Liked"] = df["Liked"].astype(int)
    return df

df = load_reviews_tsv(DATA_REVIEWS_PATH, sep=SEP)

print("DATA_REVIEWS:", df.shape)
display(df.head(5))

print("\nLabel distribution (0/1):")
print(df["Liked"].value_counts().to_dict())


DATA_REVIEWS: (6000, 2)


Unnamed: 0,Review,Liked
0,"I expected confusing, not this: impressive fans.",1
1,Not impressive at all — the check-in was actua...,0
2,I absolutely liked the drinks; it was outstand...,1
3,a pleasant surprise. The fans felt impressive.,1
4,"I thought it would be pleasant, but it was not...",0



Label distribution (0/1):
{1: 3000, 0: 3000}


## 4) Train/validation split
We use a stratified split to preserve the label ratio.


In [4]:
# ============================================
# 4) Train/val split
# ============================================
train_df, val_df = train_test_split(
    df,
    test_size=0.2,
    random_state=SEED,
    stratify=df["Liked"]
)

train_df = train_df.reset_index(drop=True)
val_df   = val_df.reset_index(drop=True)

print("train:", train_df.shape, "val:", val_df.shape)


train: (4800, 2) val: (1200, 2)


## 5) Tokenization and HF datasets
Speed tips:
- Reduce `MAX_LENGTH` (128/256)
- Keep epochs low (1–2) on CPU


In [5]:
# ============================================
# 5) Tokenization + HF datasets
# ============================================
BASE_MODEL = "distilbert-base-uncased"
OUT_DIR = "./models/finetuned_distilbert_data_reviews"

MAX_LENGTH = 128
BATCH_SIZE = 16
EPOCHS = 5

tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

def to_hf_dataset(df_in: pd.DataFrame) -> Dataset:
    ds = Dataset.from_pandas(df_in)
    ds = ds.rename_column("Liked", "labels")

    def tokenize_batch(batch):
        return tokenizer(batch["Review"], truncation=True, max_length=MAX_LENGTH)

    ds = ds.map(tokenize_batch, batched=True)

    keep = {"input_ids", "attention_mask", "labels"}
    remove_cols = [c for c in ds.column_names if c not in keep]
    if remove_cols:
        ds = ds.remove_columns(remove_cols)

    ds.set_format(type="torch")
    return ds

train_ds = to_hf_dataset(train_df)
val_ds   = to_hf_dataset(val_df)

print("train_ds:", train_ds.num_rows, "val_ds:", val_ds.num_rows)


Map: 100%|██████████| 4800/4800 [00:00<00:00, 23795.16 examples/s]
Map: 100%|██████████| 1200/1200 [00:00<00:00, 5186.40 examples/s]

train_ds: 4800 val_ds: 1200





## 6) Metrics


In [6]:
# ============================================
# 6) Metrics (macro)
# ============================================
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=1)

    acc = accuracy_score(labels, preds)
    p, r, f1, _ = precision_recall_fscore_support(labels, preds, average="macro", zero_division=0)
    return {"accuracy": acc, "precision_macro": p, "recall_macro": r, "f1_macro": f1}


## 7) Fine-tune DistilBERT
This is the only slow cell.


In [7]:
# ============================================
# 7) Fine-tuning
# ============================================
model = AutoModelForSequenceClassification.from_pretrained(BASE_MODEL, num_labels=2)

args = TrainingArguments(
    output_dir="./results_finetune_data_reviews",
    eval_strategy="epoch",
    save_strategy="no",
    learning_rate=2e-5,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    num_train_epochs=EPOCHS,
    weight_decay=0.01,
    logging_steps=100,
    seed=SEED,
    report_to="none",
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_ds,
    eval_dataset=val_ds,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()
trainer.evaluate()


Loading weights: 100%|██████████| 100/100 [00:00<00:00, 243.84it/s, Materializing param=distilbert.transformer.layer.5.sa_layer_norm.weight]   
[1mDistilBertForSequenceClassification LOAD REPORT[0m from: distilbert-base-uncased
Key                     | Status     | 
------------------------+------------+-
vocab_layer_norm.bias   | UNEXPECTED | 
vocab_projector.bias    | UNEXPECTED | 
vocab_transform.weight  | UNEXPECTED | 
vocab_transform.bias    | UNEXPECTED | 
vocab_layer_norm.weight | UNEXPECTED | 
classifier.weight       | MISSING    | 
pre_classifier.weight   | MISSING    | 
pre_classifier.bias     | MISSING    | 
classifier.bias         | MISSING    | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING[3m	:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.[0m
  super().__init__(loader)


Epoch,Training Loss,Validation Loss,Accuracy,Precision Macro,Recall Macro,F1 Macro
1,0.000978,0.000531,1.0,1.0,1.0,1.0
2,0.000264,0.000165,1.0,1.0,1.0,1.0
3,0.000137,8.8e-05,1.0,1.0,1.0,1.0
4,9.5e-05,6.3e-05,1.0,1.0,1.0,1.0
5,8.1e-05,5.6e-05,1.0,1.0,1.0,1.0


  super().__init__(loader)
  super().__init__(loader)
  super().__init__(loader)
  super().__init__(loader)
  super().__init__(loader)


{'eval_loss': 5.578479976975359e-05,
 'eval_accuracy': 1.0,
 'eval_precision_macro': 1.0,
 'eval_recall_macro': 1.0,
 'eval_f1_macro': 1.0,
 'eval_runtime': 34.8746,
 'eval_samples_per_second': 34.409,
 'eval_steps_per_second': 2.151,
 'epoch': 5.0}

## 8) Save model locally


In [8]:
# ============================================
# 8) Save model locally
# ============================================
import os

os.makedirs(OUT_DIR, exist_ok=True)
trainer.save_model(OUT_DIR)
tokenizer.save_pretrained(OUT_DIR)

print("Saved fine-tuned model to:", OUT_DIR)


Writing model shards: 100%|██████████| 1/1 [00:00<00:00,  1.85it/s]

Saved fine-tuned model to: ./models/finetuned_distilbert_data_reviews





## 9) Evaluate on validation split (sequential style)
We generate:
- classification report
- confusion matrix
- error analysis table


In [9]:
# ============================================
# 9) Validation evaluation (pipeline)
# ============================================
ft_clf = pipeline(
    "sentiment-analysis",
    model=OUT_DIR,
    tokenizer=OUT_DIR,
    device=-1  # CPU; set 0 if you have CUDA GPU
)

texts = val_df["Review"].fillna("").astype(str).tolist()
y_true = val_df["Liked"].astype(int).tolist()

preds = ft_clf(texts, batch_size=32, truncation=True, max_length=MAX_LENGTH)
y_pred = [1 if p["label"] == "POSITIVE" else 0 for p in preds]

print("=== Fine-tuned classification report (VAL) ===")
print(classification_report(y_true, y_pred, digits=4))

cm = confusion_matrix(y_true, y_pred)
cm_df = pd.DataFrame(cm, index=["true_0","true_1"], columns=["pred_0","pred_1"])
print("\n=== Confusion matrix (VAL) ===")
display(cm_df)

val_acc = accuracy_score(y_true, y_pred)
print("\nVAL accuracy:", val_acc)


Loading weights: 100%|██████████| 104/104 [00:00<00:00, 191.08it/s, Materializing param=pre_classifier.weight]                                  


=== Fine-tuned classification report (VAL) ===
              precision    recall  f1-score   support

           0     0.5000    1.0000    0.6667       600
           1     0.0000    0.0000    0.0000       600

    accuracy                         0.5000      1200
   macro avg     0.2500    0.5000    0.3333      1200
weighted avg     0.2500    0.5000    0.3333      1200


=== Confusion matrix (VAL) ===


  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


Unnamed: 0,pred_0,pred_1
true_0,600,0
true_1,600,0



VAL accuracy: 0.5


In [10]:
# ============================================
# 10) Error analysis (VAL)
# ============================================
out = val_df.copy()
out["y_true"] = y_true
out["y_pred"] = y_pred
out["is_error"] = out["y_true"] != out["y_pred"]

err = out[out["is_error"]].copy()
err["review_len"] = err["Review"].astype(str).str.len()

print("Errors (VAL):", len(err), "/", len(val_df))
display(err.sort_values("review_len", ascending=False).head(20)[["y_true","y_pred","review_len","Review"]])


Errors (VAL): 600 / 1200


Unnamed: 0,y_true,y_pred,review_len,Review
47,1,0,140,The performance started outstanding; even thou...
576,1,0,130,"absolutely coherent at first, however it becam..."
338,1,0,128,The lights started outstanding; even though th...
690,1,0,127,The atmosphere started pleasant; yet the setli...
252,1,0,126,The performance started pleasant; even though ...
936,1,0,126,The dessert started outstanding; however the r...
230,1,0,125,The ending started outstanding; although the c...
773,1,0,120,"The flight was outstanding, even though the ch..."
246,1,0,120,"I loved the cinematography, even though the di..."
388,1,0,120,"really excellent at first, however it became c..."


## 11) Load the saved model later (no retraining)
Use this snippet in any notebook/script:


In [11]:
from transformers import pipeline

clf = pipeline(
    "sentiment-analysis",
    model="./models/finetuned_distilbert_data_reviews",
    tokenizer="./models/finetuned_distilbert_data_reviews",
    device=-1
)

clf("This product is amazing!")


Loading weights: 100%|██████████| 104/104 [00:00<00:00, 384.54it/s, Materializing param=pre_classifier.weight]                                  


[{'label': 'LABEL_1', 'score': 0.999537467956543}]