In [1]:
!pip install transformers datasets peft evaluate torch

Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.3


## 🔄 Workflow Overview: Data Preparation & Utilities

This section of the script handles **reproducibility**, **device setup**, **data preprocessing**, and defines helper functions essential for fine-tuning the model.

---

### 1. 🧪 Environment Setup

```python
SEED = 42
torch.manual_seed(SEED)
np.random.seed(SEED)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
```

- Sets a **random seed** for reproducibility.
- Detects **GPU** if available; otherwise defaults to CPU.

---

### 2. 🧹 Text Cleaning

```python
def clean_text(text):
    text = re.sub(r'http\S+', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text
```

- Removes **URLs** and **extra whitespaces** from the input text.
- Ensures a clean input for tokenization.

---

### 3. 🔁 Data Augmentation (Random Deletion)

```python
def random_deletion(text, p=0.1):
    ...
```

- Implements a **simple augmentation strategy**.
- Randomly deletes each word in the text with probability `p = 0.1`.
- Ensures **at least one word remains** in the output.

Used **only** on the training set to introduce noise and improve generalization.

---

### 4. ✂️ Tokenization Function

```python
def tokenize_and_augment_function(examples, augment=False):
    ...
```

- Accepts a batch of text samples (`examples["text"]`).
- If `augment=True`, applies `random_deletion`.
- Uses a **HuggingFace tokenizer** (`tokenizer`) with:
  - `truncation=True`: Trims sequences longer than `max_length`
  - `padding="max_length"`: Pads shorter sequences
  - `max_length`: Determined dynamically later

---

### 5. 📏 Evaluation Metrics

```python
def compute_metrics(eval_pred):
    ...
```

Computes the following evaluation metrics from model predictions:

| Metric      | Description                              |
|-------------|------------------------------------------|
| Accuracy    | Proportion of correct predictions        |
| Precision   | Weighted average of class-wise precision |
| Recall      | Weighted average of class-wise recall    |
| F1 Score    | Weighted F1 score (balance of P & R)     |

- Uses `sklearn.metrics` under the hood.
- Handles **class imbalance** via `average="weighted"`.
- Avoids division errors with `zero_division=0`.

---

These utilities form the **foundation** for data loading, augmentation, and evaluation in the full training pipeline that follows.


## 🧠 Model Training Pipeline: RoBERTa with LoRA + Distillation

This section outlines the full training and evaluation workflow using the AG News dataset. It includes data preparation, model configuration, distillation setup, training, fine-tuning, and test-time prediction generation.

---

### 📦 1. Data Preparation

- Loads AG News dataset using HuggingFace's `datasets` library.
- Applies text cleaning (`clean_text`) to remove noise.
- Computes optimal `max_length` from a sample to ensure efficient padding.
- Splits the training set into **train** and **validation** sets.
- Applies **random deletion** augmentation on training data.
- Tokenizes all datasets using `AutoTokenizer` from `roberta-base`.

---

### 🧪 2. Test Dataset Preparation

- Loads a pickled unlabeled test dataset (`test_unlabelled.pkl`).
- Applies the same cleaning and tokenization pipeline (without augmentation).
- Format set to PyTorch tensors.

---

### 🔧 3. LoRA Configuration for Student Model

```python
LoraConfig(
    task_type=TaskType.SEQ_CLS,
    r=4,
    lora_alpha=16,
    lora_dropout=0.1,
    bias="none",
    target_modules=["query", "key", "value"]
)
```

- Ensures that student model has <1M trainable parameters by injecting low-rank matrices into attention layers (`query`, `key`, `value`).
- LoRA improves parameter efficiency and allows fast adaptation.

---

### 🔁 4. Knowledge Distillation Trainer

- Custom `DistillationTrainer` defined using HuggingFace's `Trainer`.
- Final loss:
  \[
  \text{Loss} = \alpha \cdot \text{KL}(S \| T) + (1 - \alpha) \cdot \text{CE}
  \]
- Uses teacher logits with **temperature scaling**.
- Applies **label smoothing** to improve generalization.

---

### 🧑‍🏫 5. Train the Teacher Model

- Standard fine-tuning of `roberta-base` on full training data.
- Evaluated on the validation split.
- Saves the best model using early stopping and `load_best_model_at_end`.

---

### 👩‍🎓 6. Train the Student Model (with Distillation)

- Loads a new RoBERTa model and applies LoRA.
- Student model is trained using:
  - **Teacher soft logits** (KL loss)
  - **True labels** (Cross-Entropy)
- Validation monitored every 500 steps.

---

### 🎯 7. Final Fine-Tuning of the Student

- Fine-tunes the distilled student on the **entire** dataset without augmentation.
- This boosts final performance before test inference.

---

### 🧾 8. Generate Predictions on Test Data

- Sets model to eval mode and runs inference on the tokenized test set.
- Collects predictions using `argmax` over model logits.
- Saves outputs in `submission.csv`:

```csv
ID,label
0,2
1,0
...
```

---

This modular training pipeline ensures both **model quality** (via a strong teacher) and **efficiency** (via a small student model with LoRA), and is well-suited for deployment and experimentation.

In [2]:
import numpy as np
import pandas as pd
import torch
import os
import pickle
import re
import gc
import matplotlib.pyplot as plt

from datasets import load_dataset, Dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    EarlyStoppingCallback
)
from peft import get_peft_model, LoraConfig, TaskType
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from torch.utils.data import DataLoader
import torch.nn.functional as F

# ---------------------------
# Reproducibility & Device
# ---------------------------
SEED = 42
torch.manual_seed(SEED)
np.random.seed(SEED)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# ---------------------------
# Helpers: Cleaning & Augmentation
# ---------------------------
def clean_text(text):
    text = re.sub(r'http\S+', '', text)
    return re.sub(r'\s+', ' ', text).strip()

def random_deletion(text, p=0.1):
    words = text.split()
    if not words:
        return text
    kept = [w for w in words if np.random.rand() > p]
    return " ".join(kept if kept else words)

def tokenize_and_augment_function(examples, augment=False):
    texts = examples["text"]
    if augment:
        texts = [random_deletion(t, p=0.1) for t in texts]
    enc = tokenizer(texts, truncation=True, padding="max_length", max_length=max_length)
    if augment:
        # random mask ~10% of tokens (excluding special/pad)
        for seq in enc["input_ids"]:
            for i, tok in enumerate(seq):
                if tok not in {
                    tokenizer.pad_token_id,
                    tokenizer.cls_token_id,
                    tokenizer.sep_token_id
                } and np.random.rand() < 0.1:
                    seq[i] = tokenizer.mask_token_id
    return enc

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    acc = accuracy_score(labels, preds)
    pr, rc, f1, _ = precision_recall_fscore_support(
        labels, preds, average="weighted", zero_division=0
    )
    return {"accuracy": acc, "precision": pr, "recall": rc, "f1": f1}

# ---------------------------
# Load & Prepare Data
# ---------------------------
dataset = load_dataset("ag_news")
dataset = dataset.map(lambda x: {"text": clean_text(x["text"])})

tokenizer = AutoTokenizer.from_pretrained("roberta-base")

# decide max_length
lens = [len(tokenizer.encode(t)) for t in dataset["train"]["text"][:1000]]
max_length = min(128, 8 * round(np.percentile(lens, 95) / 8))
print(f"Using max_length: {max_length}")

# train/val split
train_val = dataset["train"].train_test_split(test_size=0.1, seed=SEED)

# tokenized versions
tokenized_train = train_val["train"].map(
    lambda x: tokenize_and_augment_function(x, augment=True), batched=True
)
tokenized_train = tokenized_train.rename_column("label", "labels")
tokenized_train.set_format("torch", columns=["input_ids", "attention_mask", "labels"])

tokenized_val = train_val["test"].map(
    lambda x: tokenize_and_augment_function(x, augment=False), batched=True
)
tokenized_val = tokenized_val.rename_column("label", "labels")
tokenized_val.set_format("torch", columns=["input_ids", "attention_mask", "labels"])

# full train (no aug) for teacher & final student
full_train = dataset["train"].map(
    lambda x: tokenize_and_augment_function(x, augment=False), batched=True
)
full_train = full_train.rename_column("label", "labels")
full_train.set_format("torch", columns=["input_ids", "attention_mask", "labels"])

# unlabeled test
test_path = "/kaggle/input/dlp2-2025/test_unlabelled.pkl"
if not os.path.exists(test_path):
    for d, _, fs in os.walk("/kaggle/input"):
        for f in fs:
            if f == "test_unlabelled.pkl":
                test_path = os.path.join(d, f)
with open(test_path, "rb") as f:
    raw_test = pickle.load(f)

test_ds = Dataset.from_dict({"text": [clean_text(t) for t in raw_test["text"]]})
tokenized_test = test_ds.map(
    lambda x: tokenize_and_augment_function(x, augment=False), batched=True
)
tokenized_test.set_format("torch", columns=["input_ids", "attention_mask"])

# ---------------------------
# LoRA Config for Student (<1M params)
# ---------------------------
def get_lora_config(r=4, alpha=16, dropout=0.1):
    return LoraConfig(
        task_type=TaskType.SEQ_CLS,
        r=r,
        lora_alpha=alpha,
        lora_dropout=dropout,
        bias="none",
        target_modules=["query", "key", "value"]
    )

student_config = get_lora_config()

# ---------------------------
# Distillation Trainer
# ---------------------------
class DistillationTrainer(Trainer):
    def __init__(self, teacher, temp=4.0, alpha=0.7, **kwargs):
        super().__init__(**kwargs)
        self.teacher = teacher.eval()
        self.temp = temp
        self.alpha = alpha

    def compute_loss(self, model, inputs, return_outputs=False, **kw):
        labels = inputs["labels"]
        outputs = model(**inputs)
        s_logits = outputs.logits

        # CE w/ label smoothing
        loss_ce = F.cross_entropy(s_logits, labels, label_smoothing=0.1)

        # teacher soft
        with torch.no_grad():
            t_logits = self.teacher(**inputs).logits

        T = self.temp
        loss_kl = F.kl_div(
            F.log_softmax(s_logits / T, dim=-1),
            F.softmax(t_logits / T, dim=-1),
            reduction="batchmean"
        ) * (T * T)

        loss = self.alpha * loss_kl + (1 - self.alpha) * loss_ce
        return (loss, outputs) if return_outputs else loss

# ---------------------------
# 1) Train Teacher Model
# ---------------------------
teacher = AutoModelForSequenceClassification.from_pretrained(
    "roberta-base", num_labels=4
).to(device)

ta = TrainingArguments(
    output_dir="./teacher",
    learning_rate=2e-4,
    per_device_train_batch_size=64,
    per_device_eval_batch_size=128,
    num_train_epochs=5,
    weight_decay=0.01,
    eval_strategy="steps",
    eval_steps=500,
    save_steps=500,
    save_total_limit=1,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    fp16=True,
    report_to="none",
    warmup_steps=500,
    lr_scheduler_type="cosine_with_restarts",
    gradient_accumulation_steps=2,
)

teacher_trainer = Trainer(
    model=teacher,
    args=ta,
    train_dataset=full_train,
    eval_dataset=tokenized_val,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(3)]
)

print("\n>> Training teacher")
teacher_trainer.train()
print("Teacher val acc:", teacher_trainer.evaluate()["eval_accuracy"])

# ---------------------------
# 2) Train Student w/ Distillation
# ---------------------------
student = AutoModelForSequenceClassification.from_pretrained(
    "roberta-base", num_labels=4
)
student = get_peft_model(student, student_config).to(device)

n_params = sum(p.numel() for p in student.parameters() if p.requires_grad)
print("Student trainable params:", n_params)

sa = TrainingArguments(
    output_dir="./student",
    learning_rate=2e-4,
    per_device_train_batch_size=64,
    per_device_eval_batch_size=128,
    num_train_epochs=6,
    weight_decay=0.01,
    eval_strategy="steps",
    eval_steps=500,
    save_steps=500,
    save_total_limit=1,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    fp16=True,
    report_to="none",
    warmup_steps=500,
    lr_scheduler_type="cosine_with_restarts",
    gradient_accumulation_steps=2,
)

distiller = DistillationTrainer(
    teacher=teacher,
    model=student,
    args=sa,
    train_dataset=full_train,
    eval_dataset=tokenized_val,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(3)]
)

print("\n>> Training student")
distiller.train()
print("Student val acc:", distiller.evaluate()["eval_accuracy"])

# ---------------------------
# 3) Final Fine-tune Student
# ---------------------------
fa = TrainingArguments(
    output_dir="./final_student",
    learning_rate=2e-4,
    per_device_train_batch_size=64,
    per_device_eval_batch_size=128,
    num_train_epochs=3,
    weight_decay=0.01,
    eval_strategy="steps",
    eval_steps=500,
    save_steps=500,
    save_total_limit=1,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    fp16=True,
    report_to="none",
    warmup_steps=500,
    lr_scheduler_type="cosine_with_restarts",
    gradient_accumulation_steps=2,
)

final_trainer = Trainer(
    model=student,
    args=fa,
    train_dataset=full_train,
    eval_dataset=tokenized_val,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(3)]
)

print("\n>> Fine-tuning student")
final_trainer.train()
print("Final student val acc:", final_trainer.evaluate()["eval_accuracy"])

# ---------------------------
# Plot Loss Curves (save instead of display)
# ---------------------------
def plot_trainer_loss(trainer, title="Loss Curve"):
    """
    Plots training and evaluation loss curves from a HuggingFace Trainer,
    then saves them to disk instead of displaying.

    Args:
        trainer: An instance of transformers.Trainer after .train() has been called.
        title:   Title for the plot (used as filename).
    """
    history = trainer.state.log_history
    if not history:
        raise ValueError("No logs found in trainer.state.log_history! Make sure you've called trainer.train().")

    df = pd.DataFrame(history)
    train_df = df[df["loss"].notna()]
    eval_df  = df[df.get("eval_loss", pd.NA).notna()]

    # Loss vs. training step
    fig, ax = plt.subplots(figsize=(8, 5))
    ax.plot(train_df["step"], train_df["loss"], marker=".", linestyle="-", label="Train Loss")
    if not eval_df.empty:
        ax.plot(eval_df["step"], eval_df["eval_loss"], marker=".", linestyle="--", label="Eval Loss")
    ax.set_xlabel("Training Step")
    ax.set_ylabel("Loss")
    ax.set_title(title)
    ax.legend()
    ax.grid(True)
    plt.tight_layout()
    filename = f"{title.replace(' ', '_')}.png"
    fig.savefig(filename)
    plt.close(fig)

    # Per‑epoch average loss
    if "epoch" in train_df.columns:
        epoch_means = train_df.groupby("epoch")["loss"].mean()
        fig2, ax2 = plt.subplots(figsize=(8, 4))
        ax2.plot(epoch_means.index, epoch_means.values, marker="o", linestyle="-")
        ax2.set_xlabel("Epoch")
        ax2.set_ylabel("Average Train Loss")
        ax2.set_title(f"{title} (Per-Epoch Average)")
        ax2.grid(True)
        plt.tight_layout()
        filename2 = f"{title.replace(' ', '_')}_epoch.png"
        fig2.savefig(filename2)
        plt.close(fig2)

# Example: save plots to current working directory on Kaggle
plot_trainer_loss(teacher_trainer, title="Teacher_Loss_vs_Step")
plot_trainer_loss(distiller,       title="Student_Distill_Loss_vs_Step")
plot_trainer_loss(final_trainer,   title="Student_Finetune_Loss_vs_Step")

# ---------------------------
# 4) Inference & Submission
# ---------------------------
print("\n>> Generating submission")
student.eval()
all_preds = []
dl = DataLoader(tokenized_test, batch_size=128)
with torch.no_grad():
    for batch in dl:
        batch = {k: v.to(device) for k, v in batch.items()}
        logits = student(**batch).logits
        preds = logits.argmax(dim=-1).cpu().numpy()
        all_preds.extend(preds)

pd.DataFrame({"ID": range(len(all_preds)), "label": all_preds}) \
  .to_csv("submission.csv", index=False)
print("Done – submission.csv written.")


Using device: cuda


README.md:   0%|          | 0.00/8.07k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/18.6M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/1.23M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/120000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/7600 [00:00<?, ? examples/s]

Map:   0%|          | 0/120000 [00:00<?, ? examples/s]

Map:   0%|          | 0/7600 [00:00<?, ? examples/s]

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Using max_length: 96


Map:   0%|          | 0/108000 [00:00<?, ? examples/s]

Map:   0%|          | 0/12000 [00:00<?, ? examples/s]

Map:   0%|          | 0/120000 [00:00<?, ? examples/s]

Map:   0%|          | 0/8000 [00:00<?, ? examples/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  teacher_trainer = Trainer(



>> Training teacher


Step,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
500,0.3702,0.241607,0.921417,0.923851,0.921417,0.920812
1000,0.3052,0.259011,0.9255,0.925364,0.9255,0.925335
1500,0.6299,0.585973,0.705417,0.594021,0.705417,0.629118
2000,0.9657,0.957589,0.540667,0.533548,0.540667,0.452681
2500,0.7624,0.766627,0.64825,0.560101,0.64825,0.575172


Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Teacher val acc: 0.9255
Student trainable params: 814852

>> Training student


  super().__init__(**kwargs)


Step,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
500,0.9126,0.324126,0.914917,0.915576,0.914917,0.914858
1000,0.31,0.258791,0.923667,0.923538,0.923667,0.923429
1500,0.2703,0.250017,0.925,0.924699,0.925,0.92481
2000,0.2609,0.240335,0.92725,0.926869,0.92725,0.926961
2500,0.2489,0.233429,0.928083,0.927803,0.928083,0.927903
3000,0.2437,0.230863,0.925917,0.925632,0.925917,0.925655
3500,0.2387,0.226037,0.929,0.928819,0.929,0.928797
4000,0.2389,0.223399,0.929417,0.929108,0.929417,0.929207
4500,0.2319,0.22188,0.929583,0.929332,0.929583,0.929434
5000,0.2285,0.220656,0.93025,0.930003,0.93025,0.930106


Student val acc: 0.9303333333333333

>> Fine-tuning student


  final_trainer = Trainer(


Step,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
500,0.2059,0.183089,0.938833,0.939744,0.938833,0.938969
1000,0.1863,0.169473,0.942417,0.942488,0.942417,0.942418
1500,0.1745,0.162543,0.944667,0.944732,0.944667,0.944679
2000,0.1709,0.155072,0.947167,0.947341,0.947167,0.947167
2500,0.1596,0.153103,0.947083,0.947391,0.947083,0.947128


Final student val acc: 0.9471666666666667

>> Generating submission
Done – submission.csv written.


In [3]:
trainable_params = sum(p.numel() for p in student.parameters() if p.requires_grad)
print(f"Student trainable parameters: {trainable_params}")
if trainable_params > 1_000_000:
    print("Warning: Student model exceeds 1 million trainable parameters!")

Student trainable parameters: 814852


## 📊 Results

### 🧑‍🏫 Teacher Model
- **Validation Loss:** 0.2205  
- **Accuracy:** 93.03%  
- **F1 Score:** 93.02%

---

### 👩‍🎓 Student Model (LoRA + Distillation)

- **Validation Loss:** 0.1530  
- **Accuracy:** 94.71%  
- **F1 Score:** 94.71%


