![bse_logo_textminingcourse](https://bse.eu/sites/default/files/bse_logo_small.png)

# *Yahoo Answers Topic Classification*

### Authors: **Tirdod, Behbehani, Marvin Ernst, Pol Garcia, Oliver Tausendschön** 

#### Class: **22DM015 Advanced Methods in Natural Language Processing**

##### *Final Assigment*

##### Supervisor: **Arnault Gombert**

**Date: June 15, 2025**


# **Part 4: Model Distillation and Quantization**

In this section, we aim to reduce the computational load of our best-performing model from Part 3. This is essential for efficient deployment in resource-constrained environments.

We want to reduce computational cost, while retaining as much perfromance as possible.

**Importing the relevant libraries:**

In [None]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification

from datasets import load_dataset, Dataset
import pandas as pd
import numpy as np
import random
from sklearn.model_selection import train_test_split

import torch
from torch.utils.data import DataLoader
from transformers import AdamW

import torch.nn.functional as F

from sklearn.metrics import accuracy_score, precision_score, recall_score
from torch.utils.data import DataLoader

import time

ImportError: cannot import name 'AdamW' from 'transformers' (/opt/anaconda3/envs/NLPenv/lib/python3.11/site-packages/transformers/__init__.py)

Set a  seed for reproducibility:

In [None]:
seed = 42
def set_seed(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
set_seed(seed)

## a. Model Distillation/Quantization 

**Goal**: Convert our large BERT-based model into a smaller and faster model using knowledge distillation and/or quantization techniques.

### 1. Distillation Setup

We will use the `distilbert-base-uncased` architecture as our student model. The teacher is our fine-tuned BERT model from Part 2, which was our best model.


In [None]:
from huggingface_hub import login
# Login to Hugging Face Hub

teacher_ckpt = "tirdodbehbehani/yahoo-bert-32shot_stratified_augm_2"
teacher_model = AutoModelForSequenceClassification.from_pretrained(teacher_ckpt)
teacher_tokenizer = AutoTokenizer.from_pretrained(teacher_ckpt)

student_ckpt = "distilbert-base-uncased"
student_model = DistilBertForSequenceClassification.from_pretrained(student_ckpt, num_labels=10)
student_tokenizer = DistilBertTokenizerFast.from_pretrained(student_ckpt)

KeyError: 'HF_TOKEN'

### 2. Dataset Preparation

We will use 10% of the dataset for distillation due to computational constraints.

First, we load the Yahoo dataset:

In [5]:
df = pd.concat([
    load_dataset("community-datasets/yahoo_answers_topics", split="train").to_pandas(),
    load_dataset("community-datasets/yahoo_answers_topics", split="test").to_pandas()
])

df = df[df["question_content"].str.strip() != ""]
df = df[df["best_answer"].str.strip() != ""]
df = df[["question_content", "topic"]].rename(columns={"topic": "label"})

Sample 10%:

In [6]:
df_sample, _ = train_test_split(df, train_size=0.1, stratify=df["label"], random_state=42)
df_sample = df_sample.reset_index(drop=True)

In [7]:
label_to_id = {label: idx for idx, label in enumerate(sorted(df_sample["label"].unique()))}
df_sample["label"] = df_sample["label"].map(label_to_id)

Convert to Hugging Face Dataset:

In [8]:
dataset = Dataset.from_pandas(df_sample)

In [9]:
def tokenize_student(example):
    return student_tokenizer(example["question_content"], truncation=True, padding="max_length", max_length=128)

dataset = dataset.map(tokenize_student, batched=True)
dataset = dataset.train_test_split(test_size=0.2, seed=42)
dataset.set_format("torch", columns=["input_ids", "attention_mask", "label"])

Map:   0%|          | 0/78594 [00:00<?, ? examples/s]

### 3. Distillation Training

We use a simple loss function that encourages the student to match the teacher's predictions.

In [10]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
teacher_model.to(device).eval()
student_model.to(device).train()

train_loader = DataLoader(dataset["train"], batch_size=32, shuffle=True)
optimizer = AdamW(student_model.parameters(), lr=5e-5)

NameError: name 'torch' is not defined

We define a distillation loss that combines cross-entropy loss and Kullback-Leibler divergence.

In [None]:
def distillation_loss(student_logits, teacher_logits, true_labels, alpha=0.5, temperature=2.0):
    ce_loss = F.cross_entropy(student_logits, true_labels)
    kl_loss = F.kl_div(
        F.log_softmax(student_logits / temperature, dim=1),
        F.softmax(teacher_logits / temperature, dim=1),
        reduction="batchmean"
    ) * (temperature ** 2)
    return alpha * ce_loss + (1 - alpha) * kl_loss

In [None]:
for epoch in range(3):
    total_loss = 0
    for batch in train_loader:
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["label"].to(device)

        with torch.no_grad():
            teacher_logits = teacher_model(input_ids=input_ids, attention_mask=attention_mask).logits

        student_logits = student_model(input_ids=input_ids, attention_mask=attention_mask).logits
        loss = distillation_loss(student_logits, teacher_logits, labels)

        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

        total_loss += loss.item()
    print(f"Epoch {epoch + 1}: Loss = {total_loss:.4f}")

### 4. Quantization with ONNX or PyTorch Static Quantization
# 
# - Quantization can further reduce model size and speed up inference.
# - To implement: Use Hugging Face Optimum + ONNX export, or PyTorch FX graph for static quantization.


## b. Performance and Speed Comparison

In [None]:
student_model.eval()
test_loader = DataLoader(dataset["test"], batch_size=64)
all_preds, all_labels = [], []

with torch.no_grad():
    for batch in test_loader:
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["label"].to("cpu").numpy()

        outputs = student_model(input_ids=input_ids, attention_mask=attention_mask)
        preds = torch.argmax(outputs.logits, axis=1).cpu().numpy()

        all_preds.extend(preds)
        all_labels.extend(labels)

acc = accuracy_score(all_labels, all_preds)
prec = precision_score(all_labels, all_preds, average="macro")
rec = recall_score(all_labels, all_preds, average="macro")

print(f"Student Model Accuracy: {acc:.4f}")
print(f"Student Model Precision: {prec:.4f}")
print(f"Student Model Recall: {rec:.4f}")

### Inference Speed Comparison

Sample 100 texts:

In [None]:
texts = df_sample["question_content"].tolist()[:100]
inputs = student_tokenizer(texts, return_tensors="pt", padding=True, truncation=True).to(device)

start_time = time.time()
with torch.no_grad():
    _ = student_model(**inputs)
end_time = time.time()

print(f"Inference time on 100 samples (student): {end_time - start_time:.2f} seconds")

## c. Analysis and Improvements