## 🔄 Workflow Overview: Data Preparation & Utilities

This section of the script handles **reproducibility**, **device setup**, **data preprocessing**, and defines helper functions essential for fine-tuning the model.

---

### 1. 🧪 Environment Setup

```python
SEED = 42
torch.manual_seed(SEED)
np.random.seed(SEED)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
```

- Sets a **random seed** for reproducibility.
- Detects **GPU** if available; otherwise defaults to CPU.

---

### 2. 🧹 Text Cleaning

```python
def clean_text(text):
    text = re.sub(r'http\S+', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text
```

- Removes **URLs** and **extra whitespaces** from the input text.
- Ensures a clean input for tokenization.

---

### 3. 🔁 Data Augmentation (Random Deletion)

```python
def random_deletion(text, p=0.1):
    ...
```

- Implements a **simple augmentation strategy**.
- Randomly deletes each word in the text with probability `p = 0.1`.
- Ensures **at least one word remains** in the output.

Used **only** on the training set to introduce noise and improve generalization.

---

### 4. ✂️ Tokenization Function

```python
def tokenize_and_augment_function(examples, augment=False):
    ...
```

- Accepts a batch of text samples (`examples["text"]`).
- If `augment=True`, applies `random_deletion`.
- Uses a **HuggingFace tokenizer** (`tokenizer`) with:
  - `truncation=True`: Trims sequences longer than `max_length`
  - `padding="max_length"`: Pads shorter sequences
  - `max_length`: Determined dynamically later

---

### 5. 📏 Evaluation Metrics

```python
def compute_metrics(eval_pred):
    ...
```

Computes the following evaluation metrics from model predictions:

| Metric      | Description                              |
|-------------|------------------------------------------|
| Accuracy    | Proportion of correct predictions        |
| Precision   | Weighted average of class-wise precision |
| Recall      | Weighted average of class-wise recall    |
| F1 Score    | Weighted F1 score (balance of P & R)     |

- Uses `sklearn.metrics` under the hood.
- Handles **class imbalance** via `average="weighted"`.
- Avoids division errors with `zero_division=0`.

---

These utilities form the **foundation** for data loading, augmentation, and evaluation in the full training pipeline that follows.


In [None]:
import numpy as np
import pandas as pd
import torch
import os
import pickle
import re
from datasets import load_dataset, Dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    EarlyStoppingCallback
)
from peft import get_peft_model, LoraConfig, TaskType
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from torch.utils.data import DataLoader
import gc
import torch.nn.functional as F

# ---------------------------
# Set up reproducibility and device
# ---------------------------
SEED = 42
torch.manual_seed(SEED)
np.random.seed(SEED)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# ---------------------------
# Helper Functions
# ---------------------------
def clean_text(text):
    # Remove URLs and extra whitespace
    text = re.sub(r'http\S+', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

# Simple data augmentation: random deletion (applied only to training)
def random_deletion(text, p=0.1):
    words = text.split()
    if len(words) == 0:
        return text
    new_words = [word for word in words if np.random.rand() > p]
    if len(new_words) == 0:
        new_words = words  # Ensure at least one word remains
    return " ".join(new_words)

def tokenize_and_augment_function(examples, augment=False):
    texts = examples["text"]
    if augment:
        texts = [random_deletion(text, p=0.1) for text in texts]
    return tokenizer(texts, truncation=True, padding="max_length", max_length=max_length)

# Compute evaluation metrics
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    acc = accuracy_score(labels, predictions)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, predictions, average="weighted", zero_division=0)
    return {"accuracy": acc, "precision": precision, "recall": recall, "f1": f1}

## 🧠 Model Training Pipeline: RoBERTa with LoRA + Distillation

This section outlines the full training and evaluation workflow using the AG News dataset. It includes data preparation, model configuration, distillation setup, training, fine-tuning, and test-time prediction generation.

---

### 📦 1. Data Preparation

- Loads AG News dataset using HuggingFace's `datasets` library.
- Applies text cleaning (`clean_text`) to remove noise.
- Computes optimal `max_length` from a sample to ensure efficient padding.
- Splits the training set into **train** and **validation** sets.
- Applies **random deletion** augmentation on training data.
- Tokenizes all datasets using `AutoTokenizer` from `roberta-base`.

---

### 🧪 2. Test Dataset Preparation

- Loads a pickled unlabeled test dataset (`test_unlabelled.pkl`).
- Applies the same cleaning and tokenization pipeline (without augmentation).
- Format set to PyTorch tensors.

---

### 🔧 3. LoRA Configuration for Student Model

```python
LoraConfig(
    task_type=TaskType.SEQ_CLS,
    r=4,
    lora_alpha=16,
    lora_dropout=0.1,
    bias="none",
    target_modules=["query", "key", "value"]
)
```

- Ensures that student model has <1M trainable parameters by injecting low-rank matrices into attention layers (`query`, `key`, `value`).
- LoRA improves parameter efficiency and allows fast adaptation.

---

### 🔁 4. Knowledge Distillation Trainer

- Custom `DistillationTrainer` defined using HuggingFace's `Trainer`.
- Final loss:
  \[
  \text{Loss} = \alpha \cdot \text{KL}(S \| T) + (1 - \alpha) \cdot \text{CE}
  \]
- Uses teacher logits with **temperature scaling**.
- Applies **label smoothing** to improve generalization.

---

### 🧑‍🏫 5. Train the Teacher Model

- Standard fine-tuning of `roberta-base` on full training data.
- Evaluated on the validation split.
- Saves the best model using early stopping and `load_best_model_at_end`.

---

### 👩‍🎓 6. Train the Student Model (with Distillation)

- Loads a new RoBERTa model and applies LoRA.
- Student model is trained using:
  - **Teacher soft logits** (KL loss)
  - **True labels** (Cross-Entropy)
- Validation monitored every 500 steps.

---

### 🎯 7. Final Fine-Tuning of the Student

- Fine-tunes the distilled student on the **entire** dataset without augmentation.
- This boosts final performance before test inference.

---

### 🧾 8. Generate Predictions on Test Data

- Sets model to eval mode and runs inference on the tokenized test set.
- Collects predictions using `argmax` over model logits.
- Saves outputs in `submission.csv`:

```csv
ID,label
0,2
1,0
...
```

---

This modular training pipeline ensures both **model quality** (via a strong teacher) and **efficiency** (via a small student model with LoRA), and is well-suited for deployment and experimentation.

In [2]:
# ---------------------------
# Data Preparation
# ---------------------------
# Load the AG News dataset and clean texts
dataset = load_dataset("ag_news")
dataset = dataset.map(lambda x: {"text": clean_text(x["text"])})

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("roberta-base")

# Determine optimal max_length from a sample of training texts
text_lengths = [len(tokenizer.encode(text)) for text in dataset["train"]["text"][:1000]]
max_length = min(128, 8 * round(np.percentile(text_lengths, 95) / 8))
print(f"Using max_length: {max_length}")

# Create training/validation split from the training set
train_val_dataset = dataset["train"].train_test_split(test_size=0.1, seed=SEED)

# Tokenize the training set (with augmentation) and validation set (without augmentation)
tokenized_train = train_val_dataset["train"].map(
    lambda x: tokenize_and_augment_function(x, augment=True), batched=True
)
tokenized_val = train_val_dataset["test"].map(
    lambda x: tokenize_and_augment_function(x, augment=False), batched=True
)
tokenized_train = tokenized_train.rename_column("label", "labels")
tokenized_val = tokenized_val.rename_column("label", "labels")
tokenized_train.set_format("torch", columns=["input_ids", "attention_mask", "labels"])
tokenized_val.set_format("torch", columns=["input_ids", "attention_mask", "labels"])

# For final training on full training data (without augmentation)
tokenized_full_train = dataset["train"].map(
    lambda x: tokenize_and_augment_function(x, augment=False), batched=True
)
tokenized_full_train = tokenized_full_train.rename_column("label", "labels")
tokenized_full_train.set_format("torch", columns=["input_ids", "attention_mask", "labels"])

# Prepare test set: clean and tokenize (without augmentation)
test_file = "/kaggle/input/dlp2-2025/test_unlabelled.pkl"
if not os.path.exists(test_file):
    for dirname, _, filenames in os.walk('/kaggle/input'):
        for filename in filenames:
            if filename == 'test_unlabelled.pkl':
                test_file = os.path.join(dirname, filename)
                break
with open(test_file, "rb") as f:
    test_dataset = pickle.load(f)
test_dataset = Dataset.from_dict({"text": [clean_text(text) for text in test_dataset["text"]]})
tokenized_test = test_dataset.map(lambda x: tokenize_and_augment_function(x, augment=False), batched=True)
tokenized_test.set_format("torch", columns=["input_ids", "attention_mask"])

# ---------------------------
# LoRA Configuration for the Student Model
# (This configuration ensures trainable parameters remain below 1 million)
def get_lora_config(r=4, lora_alpha=16, lora_dropout=0.1):
    return LoraConfig(
        task_type=TaskType.SEQ_CLS,
        r=r,
        lora_alpha=lora_alpha,
        lora_dropout=lora_dropout,
        bias="none",
        target_modules=["query", "key", "value"]
    )

student_config = get_lora_config(r=4, lora_alpha=16, lora_dropout=0.1)

# ---------------------------
# Distillation Trainer Definition
# ---------------------------
class DistillationTrainer(Trainer):
    def __init__(self, teacher_model, temperature=4.0, alpha=0.7, *args, **kwargs):
        """
        temperature: smoothing temperature for teacher predictions.
        alpha: weight for the distillation (KL) loss. Final loss = alpha * KL_loss + (1 - alpha) * CE_loss.
        """
        super().__init__(*args, **kwargs)
        self.teacher_model = teacher_model
        self.temperature = temperature
        self.alpha = alpha
        self.teacher_model.eval()  # Freeze teacher parameters

    def compute_loss(self, model, inputs, return_outputs=False, **kwargs):
        labels = inputs.get("labels")
        # Forward pass of student model
        outputs = model(**inputs)
        student_logits = outputs.logits

        # Standard cross-entropy loss with label smoothing
        loss_ce = F.cross_entropy(student_logits, labels, label_smoothing=0.1)

        # Get teacher model logits (no gradient)
        with torch.no_grad():
            teacher_outputs = self.teacher_model(**inputs)
            teacher_logits = teacher_outputs.logits

        T = self.temperature
        loss_kl = F.kl_div(
            F.log_softmax(student_logits / T, dim=-1),
            F.softmax(teacher_logits / T, dim=-1),
            reduction="batchmean"
        ) * (T * T)

        loss = self.alpha * loss_kl + (1 - self.alpha) * loss_ce
        if return_outputs:
            return loss, outputs
        return loss

# ---------------------------
# Step 1: Train the Teacher Model
# ---------------------------
# Teacher model: standard RoBERTa fine-tuned on the full training data (without LoRA modification)
teacher_model = AutoModelForSequenceClassification.from_pretrained("roberta-base", num_labels=4)
teacher_model.to(device)

teacher_training_args = TrainingArguments(
    output_dir="./teacher_model",
    learning_rate=2e-4,
    per_device_train_batch_size=64,
    per_device_eval_batch_size=128,
    num_train_epochs=5,
    weight_decay=0.01,
    evaluation_strategy="steps",
    eval_steps=500,
    save_steps=500,
    save_total_limit=1,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    fp16=True,
    report_to="none",
    warmup_steps=500,
    lr_scheduler_type="cosine",
    gradient_accumulation_steps=2
)

teacher_trainer = Trainer(
    model=teacher_model,
    args=teacher_training_args,
    train_dataset=tokenized_full_train,
    eval_dataset=tokenized_val,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)]
)

print("\nTraining the teacher model...")
teacher_trainer.train()
teacher_results = teacher_trainer.evaluate(eval_dataset=tokenized_val)
print(f"Teacher model validation accuracy: {teacher_results['eval_accuracy']}")

# ---------------------------
# Step 2: Train the Student Model with Distillation
# ---------------------------
# Create student model with LoRA modification (ensuring < 1M trainable parameters)
student_model = AutoModelForSequenceClassification.from_pretrained("roberta-base", num_labels=4)
student_model = get_peft_model(student_model, student_config)
student_model.to(device)

# Check trainable parameter count
trainable_params = sum(p.numel() for p in student_model.parameters() if p.requires_grad)
print(f"Student trainable parameters: {trainable_params}")
if trainable_params > 1_000_000:
    print("Warning: Student model exceeds 1 million trainable parameters!")

student_training_args = TrainingArguments(
    output_dir="./student_model",
    learning_rate=2e-4,
    per_device_train_batch_size=64,
    per_device_eval_batch_size=128,
    num_train_epochs=6,  # Slight increase in epochs for improved convergence
    weight_decay=0.01,
    evaluation_strategy="steps",
    eval_steps=500,
    save_steps=500,
    save_total_limit=1,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    fp16=True,
    report_to="none",
    warmup_steps=500,
    lr_scheduler_type="cosine",
    gradient_accumulation_steps=2
)

# Create and use the custom DistillationTrainer with updated hyperparameters
distillation_trainer = DistillationTrainer(
    teacher_model=teacher_model,
    temperature=4.0,
    alpha=0.7,
    model=student_model,
    args=student_training_args,
    train_dataset=tokenized_full_train,
    eval_dataset=tokenized_val,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)]
)

print("\nTraining the student model with distillation...")
distillation_trainer.train()
student_results = distillation_trainer.evaluate(eval_dataset=tokenized_val)
print(f"Student model (with distillation) validation accuracy: {student_results['eval_accuracy']}")

# ---------------------------
# Step 3: Final Fine-Tuning of the Student Model
# ---------------------------
# Further fine-tune the student model on full training data without augmentation (optional)
print("\nFine-tuning the student model on the full training dataset (no augmentation)...")
final_training_args = TrainingArguments(
    output_dir="./final_student_model",
    learning_rate=2e-4,
    per_device_train_batch_size=64,
    per_device_eval_batch_size=128,
    num_train_epochs=3,
    weight_decay=0.01,
    evaluation_strategy="steps",
    eval_steps=500,
    save_steps=500,
    save_total_limit=1,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    fp16=True,
    report_to="none",
    warmup_steps=500,
    lr_scheduler_type="cosine",
    gradient_accumulation_steps=2
)

final_trainer = Trainer(
    model=student_model,
    args=final_training_args,
    train_dataset=tokenized_full_train,
    eval_dataset=tokenized_val,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)]
)

final_trainer.train()
final_results = final_trainer.evaluate()
print(f"Final student model validation accuracy: {final_results['eval_accuracy']}")

# ---------------------------
# Step 4: Generate Predictions on Test Data
# ---------------------------
print("Generating predictions on test data...")
student_model.eval()
all_predictions = []
test_dataloader = DataLoader(tokenized_test, batch_size=128)
with torch.no_grad():
    for batch in test_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = student_model(**batch)
        preds = torch.argmax(outputs.logits, dim=-1)
        all_predictions.extend(preds.cpu().numpy())

df = pd.DataFrame({
    "ID": list(range(len(all_predictions))),
    "label": all_predictions
})
df.to_csv("submission.csv", index=False)
print("✅ Predictions complete. Saved to submission.csv.")

Using device: cuda


README.md:   0%|          | 0.00/8.07k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/18.6M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/1.23M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/120000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/7600 [00:00<?, ? examples/s]

Map:   0%|          | 0/120000 [00:00<?, ? examples/s]

Map:   0%|          | 0/7600 [00:00<?, ? examples/s]

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Using max_length: 96


Map:   0%|          | 0/108000 [00:00<?, ? examples/s]

Map:   0%|          | 0/12000 [00:00<?, ? examples/s]

Map:   0%|          | 0/120000 [00:00<?, ? examples/s]

Map:   0%|          | 0/8000 [00:00<?, ? examples/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  teacher_trainer = Trainer(



Training the teacher model...


Step,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
500,0.3702,0.241607,0.921417,0.923851,0.921417,0.920812
1000,0.3052,0.259011,0.9255,0.925364,0.9255,0.925335
1500,0.6299,0.585973,0.705417,0.594021,0.705417,0.629118
2000,0.9657,0.957589,0.540667,0.533548,0.540667,0.452681
2500,0.7624,0.766627,0.64825,0.560101,0.64825,0.575172


Teacher model validation accuracy: 0.9255


Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Student trainable parameters: 814852

Training the student model with distillation...


  super().__init__(*args, **kwargs)


Step,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
500,0.9126,0.324126,0.914917,0.915576,0.914917,0.914858
1000,0.31,0.258791,0.923667,0.923538,0.923667,0.923429
1500,0.2703,0.250017,0.925,0.924699,0.925,0.92481
2000,0.2609,0.240335,0.92725,0.926869,0.92725,0.926961
2500,0.2489,0.233429,0.928083,0.927803,0.928083,0.927903
3000,0.2437,0.230863,0.925917,0.925632,0.925917,0.925655
3500,0.2387,0.226037,0.929,0.928819,0.929,0.928797
4000,0.2389,0.223399,0.929417,0.929108,0.929417,0.929207
4500,0.2319,0.22188,0.929583,0.929332,0.929583,0.929434
5000,0.2285,0.220656,0.93025,0.930003,0.93025,0.930106


Student model (with distillation) validation accuracy: 0.9303333333333333

Fine-tuning the student model on the full training dataset (no augmentation)...


  final_trainer = Trainer(


Step,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
500,0.2059,0.183089,0.938833,0.939744,0.938833,0.938969
1000,0.1863,0.169473,0.942417,0.942488,0.942417,0.942418
1500,0.1745,0.162543,0.944667,0.944732,0.944667,0.944679
2000,0.1709,0.155072,0.947167,0.947341,0.947167,0.947167
2500,0.1596,0.153103,0.947083,0.947391,0.947083,0.947128


Final student model validation accuracy: 0.9471666666666667
Generating predictions on test data...
✅ Predictions complete. Saved to submission.csv.


In [3]:
trainable_params = sum(p.numel() for p in student_model.parameters() if p.requires_grad)
print(f"Student trainable parameters: {trainable_params}")
if trainable_params > 1_000_000:
    print("Warning: Student model exceeds 1 million trainable parameters!")

Student trainable parameters: 814852
