# Toy Sentiment Classification with DistilBERT

In this notebook we:

1. **Pull the SST-2 dataset** from the GLUE benchmark:
   - **Train split:** 67 349 examples  
   - **Validation split:** 872 examples  
   - **Test split:** 1 821 examples  

2. **Subsample** due to limited compute:
   - **Training/Validation:** 40 examples (20 positive, 20 negative) drawn from the original train split, then split 80/20 into 32 training and 8 validation examples.  
   - **Held-out Test:** 200 examples (100 positive, 100 negative) also drawn from the train split for final evaluation.

3. **Fine-tune a DistilBERT classifier** (`distilbert-base-uncased`) on our tiny dataset:
   - Tokenise sentences to a maximum length of 64 tokens.  
   - Add a small MLP head (768 → GELU → dropout → 2 logits) on top of the 6-layer, 768-hidden‐size DistilBERT backbone.  
   - Train for 5 epochs with a batch size of 4, learning rate 2e-5, on CPU only.

4. **Save the resulting model** and tokenizer to  distilbert_finetuned/. (this directory contains `pytorch_model.bin`, `config.json`, `vocab.txt`, etc.)

5. **Evaluate** on both the 8-example validation set (during training) and the 200-example held-out test set.

> **Note**: This is a **toy example** to illustrate the full pipeline. In a production scenario you would fine-tune on the full SST-2 train split (≈67 k examples), use more epochs, a larger batch size or GPU acceleration, and tune hyperparameters to maximize real-world performance.



In [2]:
!pip install --quiet transformers datasets torch scikit-learn



In [11]:
!pip install --quiet tf-keras

In [17]:
!pip install --quiet evaluate


In [23]:
!pip install --quiet --upgrade transformers


In [9]:
!pip install --quiet accelerate


## Load and Inspect the SST-2 Sentiment Dataset

This cell performs the following steps in preparation for model fine-tuning:

1. **Import the dataset loader**  
   Brings in the `load_dataset` function from Hugging Face’s Datasets library.

2. **Download the SST-2 data**  
   Fetches the Stanford Sentiment Treebank (SST-2) from the GLUE benchmark, including `train`, `validation`, and `test` splits.

3. **Show available splits**  
   Prints out a summary of each split (its name and number of examples) so you can verify that the data was loaded correctly.

4. **Select the training split**  
   Extracts the `train` portion of the dataset (approximately 67 000 sentences) for use in fine-tuning.

5. **Display dataset details**  
   - Prints the total number of training examples.  
   - Shows the very first example (a dictionary containing a sentence and its sentiment label) so you can confirm the data format.


In [1]:
#pull in SST-2 from the GLUE benchmark
from datasets import load_dataset

# 1. Download the SST-2 dataset
dataset = load_dataset("glue", "sst2")

# 2. Inspect the splits
print(dataset)

# 3. Grab the training split
train_ds = dataset["train"]
print(f"Number of training examples: {len(train_ds)}")
print(train_ds[0])


DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 67349
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 872
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1821
    })
})
Number of training examples: 67349
{'sentence': 'hide new secretions from the parental units ', 'label': 0, 'idx': 0}


## Create and Save Toy Training and Test Sets

This cell builds small, balanced subsets from the full SST-2 training data and writes them to Excel files:

1. **Import necessary libraries**  
   - `random` for reproducible sampling  
   - Hugging Face’s `load_dataset` to retrieve SST-2  
   - `pandas` to manipulate and save the data  

2. **Load the complete SST-2 training split**  
   Retrieves all ~67 000 examples from the GLUE SST-2 “train” split.

3. **Gather positive and negative indices**  
   Iterates through the dataset to collect two lists of example indices:  
   - `pos_idxs` for positive sentences (`label == 1`)  
   - `neg_idxs` for negative sentences (`label == 0`)  

4. **Sample a small training set**  
   - Seeds the random number generator for reproducibility.  
   - Randomly selects 20 positive and 20 negative indices.  
   - Combines them into `train_idxs`.

5. **Sample a held-out test set**  
   - Removes the training indices from the original pools.  
   - Randomly picks 100 new positive and 100 new negative indices.  
   - Combines them into `test_idxs`.

6. **Extract the subsets**  
   Uses `dataset.select()` to pull out only the sampled examples for both the training and test sets.

7. **Convert to pandas DataFrames and shuffle**  
   - Builds a DataFrame with columns `sentence` and `label` for each split.  
   - Shuffles the rows and resets the index for randomness.

8. **Save to Excel files**  
   - Writes the 40-example training set to **`training.xlsx`**  
   - Writes the 200-example test set to **`test.xlsx`**  

9. **Confirm saving**  
   Prints the number of rows saved in each file to verify that the correct number of examples were written.


In [3]:
# training and test datasets
import random
from datasets import load_dataset
import pandas as pd

# 1) Load the full SST-2 train split
dataset = load_dataset("glue", "sst2")["train"]

# 2) Build index lists
pos_idxs = [i for i, ex in enumerate(dataset) if ex["label"] == 1]
neg_idxs = [i for i, ex in enumerate(dataset) if ex["label"] == 0]

random.seed(42)

# 3) Sample for TRAIN: 20 pos, 20 neg
train_pos = random.sample(pos_idxs, 20)
train_neg = random.sample(neg_idxs, 20)
train_idxs = train_pos + train_neg

# 4) Remove those from pools, then sample for TEST: 100 pos, 100 neg
remaining_pos = [i for i in pos_idxs if i not in train_pos]
remaining_neg = [i for i in neg_idxs if i not in train_neg]

test_pos = random.sample(remaining_pos, 100)
test_neg = random.sample(remaining_neg, 100)
test_idxs = test_pos + test_neg

# 5) Select subsets
train_ds = dataset.select(train_idxs)
test_ds  = dataset.select(test_idxs)

# 6) Convert to DataFrame
train_df = pd.DataFrame({
    "sentence": train_ds["sentence"],
    "label":    train_ds["label"]
}).sample(frac=1, random_state=42).reset_index(drop=True)  # shuffle

test_df = pd.DataFrame({
    "sentence": test_ds["sentence"],
    "label":    test_ds["label"]
}).sample(frac=1, random_state=42).reset_index(drop=True)   # shuffle

# 7) Save to Excel
train_df.to_excel("training.xlsx", index=False)
test_df.to_excel("test.xlsx",      index=False)

print(f"Saved:\n • training.xlsx ({len(train_df)} rows)\n • test.xlsx ({len(test_df)} rows)")



Saved:
 • training.xlsx (40 rows)
 • test.xlsx (200 rows)


## Load and Split the 40-Example Training Set into Train/Validation

This cell prepares our small toy dataset for fine-tuning by:

1. **Importing libraries**  
   - `pandas` for reading Excel files  
   - Hugging Face’s `Dataset` class for easy downstream processing  

2. **Loading the 40-example CSV**  
   - Reads **`training.xlsx`** into a pandas DataFrame (`train_df`) containing exactly 40 rows.

3. **Converting to a Hugging Face Dataset**  
   - Wraps the DataFrame with `Dataset.from_pandas()` to get `ds40`, enabling `.map()`, `.train_test_split()`, and other Dataset operations.

4. **Cleaning up index columns**  
   - Removes any auto-generated pandas index columns (names starting with `__`) so only our original `sentence` and `label` fields remain.

5. **Splitting into train and validation**  
   - Uses an 80/20 split (`test_size=0.2`) to create:  
     - **`train_ds`** with 32 examples for fine-tuning  
     - **`val_ds`** with 8 examples for in-training validation  
   - A fixed seed guarantees reproducible splits.

6. **Confirmation printout**  
   - Prints the sizes of each split to ensure we have the expected 32 training and 8 validation examples.  


In [5]:
#1. Load only the 40-example training file and split into training/validation
import pandas as pd
from datasets import Dataset

# Load the 40-example training set
train_df = pd.read_excel("training.xlsx")
ds40     = Dataset.from_pandas(train_df)

# Remove any pandas index column
ds40 = ds40.remove_columns([c for c in ds40.column_names if c.startswith("__")])

# Split: 80% train (32 examples), 20% validation (8 examples)
split = ds40.train_test_split(test_size=0.2, seed=42)
train_ds = split["train"]
val_ds   = split["test"]

print(f"Train on {len(train_ds)} examples, validate on {len(val_ds)} examples")


Train on 32 examples, validate on 8 examples


## Load the Held-Out Test Set for Final Evaluation

This cell performs the following steps to prepare the test data for a one-time evaluation after training:

1. **Read the 200-example test file**  
   Loads **`test.xlsx`** into a pandas DataFrame (`test_df`) containing 200 sentences and their labels.

2. **Convert to a Hugging Face Dataset**  
   Wraps the DataFrame with `Dataset.from_pandas()` to create `test_ds`, allowing it to be processed like our other datasets.

3. **Clean up any stray index columns**  
   Removes any auto-generated pandas index columns (names starting with `__`) so that only the original `sentence` and `label` fields remain.

4. **Confirm dataset size**  
   Prints out the total number of examples in the held-out test set (should be 200) to verify that it loaded correctly.


In [5]:
#2. Load the 200-example test set for final evaluation only (later)
test_df = pd.read_excel("test.xlsx")
test_ds = Dataset.from_pandas(test_df)
test_ds = test_ds.remove_columns([c for c in test_ds.column_names if c.startswith("__")])

print("Held-out test set size:", len(test_ds))


Held-out test set size: 200


## Fine-Tuning DistilBERT on Our Toy Sentiment Data

This cell puts everything together—from loading our small Excel-based datasets all the way through to training, saving, and evaluating a DistilBERT sentiment classifier. Below is a step-by-step explanation of what happens and the key parameters used.

---

### 1) Load the Data  
- **`training.xlsx`** (40 examples, 20 positive / 20 negative) is read into `train_df`.  
- **`test.xlsx`** (200 examples, 100 positive / 100 negative) is read into `test_df`.  

### 2) Wrap in Hugging Face Datasets  
- We convert `train_df` and `test_df` into `Dataset` objects so we can apply Hugging Face utilities like `train_test_split`, `.map()`, and `.set_format()`.

### 3) Split for Training/Validation  
- **Train/validation split**: We use an 80/20 split on the 40-example set, yielding **32 training** and **8 validation** examples.  
- A fixed random seed (`seed=42`) ensures that this split is reproducible.

### 4) Tokenisation  
- We load the **DistilBERT tokenizer** (`distilbert-base-uncased`), which uses WordPiece tokenisation with a 30 522-token vocabulary.  
- Sentences are padded or truncated to **64 tokens** to keep each batch small and fast on CPU.  

### 5) Formatting for PyTorch  
- The original column `label` is renamed to `labels` (what the Trainer expects).  
- We call `.set_format(type="torch", columns=[…])` so that each batch yields PyTorch tensors for `input_ids`, `attention_mask`, and `labels`.

### 6) Model, Metric and Data Collator  
- **Model**: We load **DistilBertForSequenceClassification** (a 6-layer, 768-hidden-size, 12-head Transformer with ∼66 million parameters) pre-trained on general English.  
- **Classification head (MLP)**:  
  - **pre_classifier**: a single hidden layer of size 768, followed by GELU activation and dropout (0.1).  
  - **classifier**: the output layer mapping those 768 activations down to 2 logits (positive vs. negative).  
  - **Thought for a couple of seconds**  
    Not quite two hidden layers of 768 each—instead there is **one** hidden layer (768 units) plus the final output layer (2 units).  
- **Metric**: We use the `evaluate` library’s **accuracy** for both validation and test.  
- **Data collator**: `DataCollatorWithPadding` batches and pads examples on the fly.

### 7) Training Arguments  
| Argument                         | Value             | Purpose                                   |
|----------------------------------|-------------------|-------------------------------------------|
| `output_dir="distilbert_finetuned"` | —               | Directory to save checkpoints & final model |
| `num_train_epochs=5`             | 5 epochs          | Number of full passes over the 32 examples|
| `per_device_train_batch_size=4`  | 4 examples/batch  | Small batch size to fit CPU memory        |
| `learning_rate=2e-5`             | 2 × 10⁻⁵          | Standard fine-tuning learning rate        |
| `logging_steps=10`               | every 10 steps    | Log training loss/metrics every 10 steps  |
| `save_steps=10`                  | every 10 steps    | Save checkpoint frequently for safety     |
| `no_cuda=True`                   | force CPU         | Ensure no GPU is used                     |

### 8) Trainer Setup  
- We instantiate the Hugging Face **`Trainer`** with all components:  
  - **`model`**: our DistilBERT classifier  
  - **`args`**: the training arguments above  
  - **`train_dataset`**, **`eval_dataset`**: our 32/8 split  
  - **`tokenizer`** & **`data_collator`**: for preparing batches  
  - **`compute_metrics`**: to report accuracy on validation each epoch  

### 9) Training  
- `trainer.train()` runs 5 epochs over 32 examples each, validating on the 8-example set after every epoch.

### 10) Saving the Model  
- After training, the fine-tuned model and tokenizer are both saved into the directory **`distilbert_finetuned/`**:  

distilbert_finetuned/
├── config.json
├── pytorch_model.bin ← fine-tuned weights
├── tokenizer_config.json
├── vocab.txt ← tokenizer vocabulary
└── … ← other tokenizer files


### 11) Final Evaluation  
- We print **validation** accuracy (on the 8 examples seen only between epochs) and **held-out test** accuracy (200 examples never used during training nor validation).

---

**Model Specification Recap**  
- **Architecture**:  
- DistilBERT Base (6 Transformer layers, hidden size 768, 12 self-attention heads per layer)  
- Classification head: one 768-unit hidden layer (GELU + dropout 0.1) → 2-unit output  
- **Pre-training**: Masked LM + distillation from BERT-Base on Wikipedia + BookCorpus  
- **Total parameters**: ∼66 million (including MLP head)  

This cell encapsulates the full fine-tuning pipeline—from raw Excel data to a saved, ready-to-deploy sentiment classifier (`distilbert_finetuned`).


In [11]:
import pandas as pd
import torch
from datasets import Dataset
from evaluate import load as load_metric
from transformers import (
    DistilBertTokenizerFast,
    DistilBertForSequenceClassification,
    Trainer,
    TrainingArguments,
    DataCollatorWithPadding
)

# ---- 1) Load Excel files into DataFrames ----
train_df = pd.read_excel("training.xlsx")  # 40 examples
test_df  = pd.read_excel("test.xlsx")      # 200 examples

# ---- 2) Create HF Datasets ----
full_train_ds = Dataset.from_pandas(train_df)
test_ds       = Dataset.from_pandas(test_df)

# ---- 3) Split full_train_ds into train/val ----
split    = full_train_ds.train_test_split(test_size=0.2, seed=42)
train_ds = split["train"]  # 32 examples
val_ds   = split["test"]   #  8 examples

# ---- 4) Tokeniser & tokenisation function ----
tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")

def tokenize_fn(batch):
    return tokenizer(
        batch["sentence"],
        padding="max_length",
        truncation=True,
        max_length=64
    )

train_ds = train_ds.map(tokenize_fn, batched=True)
val_ds   = val_ds.map(tokenize_fn, batched=True)
test_ds  = test_ds.map(tokenize_fn, batched=True)

# ---- 5) Rename label → labels & set PyTorch format ----
train_ds = train_ds.rename_column("label", "labels")
val_ds   = val_ds.rename_column("label", "labels")
test_ds  = test_ds.rename_column("label", "labels")

train_ds.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])
val_ds.set_format(  type="torch", columns=["input_ids", "attention_mask", "labels"])
test_ds.set_format( type="torch", columns=["input_ids", "attention_mask", "labels"])

# ---- 6) Prepare model, metric, data collator ----
model  = DistilBertForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=2
)
metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = torch.argmax(torch.tensor(logits), dim=-1)
    return metric.compute(predictions=preds, references=labels)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="pt")

# ---- 7) TrainingArguments ----
training_args = TrainingArguments(
    output_dir="distilbert_finetuned",
    num_train_epochs=5,
    per_device_train_batch_size=4,
    learning_rate=2e-5,
    logging_steps=10,
    save_steps=10,
    no_cuda=True
)

# ---- 8) Trainer setup ----
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=val_ds,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

# ---- 9) Train ----
trainer.train()

# ---- 10) Save fine-tuned model & tokenizer ----
# This writes model weights, config, and tokenizer files into `distilbert_finetuned/`
trainer.save_model(training_args.output_dir)
tokenizer.save_pretrained(training_args.output_dir)

# ---- 11) Evaluate ----
print("\nValidation set results:")
print(trainer.evaluate(eval_dataset=val_ds))

print("\nHeld-out test set results:")
print(trainer.evaluate(eval_dataset=test_ds))


Map:   0%|          | 0/32 [00:00<?, ? examples/s]

Map:   0%|          | 0/8 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Step,Training Loss
10,0.6832
20,0.6439
30,0.5723
40,0.5525



Validation set results:


{'eval_loss': 0.7560411691665649, 'eval_accuracy': 0.375, 'eval_runtime': 0.2213, 'eval_samples_per_second': 36.143, 'eval_steps_per_second': 4.518, 'epoch': 5.0}

Held-out test set results:
{'eval_loss': 0.6634376645088196, 'eval_accuracy': 0.58, 'eval_runtime': 4.7849, 'eval_samples_per_second': 41.798, 'eval_steps_per_second': 5.225, 'epoch': 5.0}
