# LLM Project ‚Äî Email Routing (5 departments)

This notebook is the **main deliverable**. It loads the dataset via `datapreparation.py`, trains a **DistilBERT** classifier, and reports **Accuracy**, **Inference Time**, and **Memory**.

**Note:** Two other agents are required by the project specification (GPT-2 prompting and GPT-2 + LoRA). This notebook includes **placeholders** for them, but the implementation will be added in the `src/` step.


## 0. Environment / GPU check
If you have an NVIDIA GPU, this project can run much faster.

If training is slow:
- verify CUDA is available (`torch.cuda.is_available()`)
- free VRAM (close other Python notebooks / apps using the GPU)
- reduce batch size to fit 4GB GPUs (RTX 3050 Laptop).


In [26]:
import os, time, json, random
from pathlib import Path

import numpy as np
import torch
import psutil

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Reproducibility
RANDOM_SEED = 42
random.seed(RANDOM_SEED)
np.random.seed(RANDOM_SEED)
torch.manual_seed(RANDOM_SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(RANDOM_SEED)

# Paths
ROOT = Path('.').resolve()
OUT_METRICS = ROOT / 'outputs' / 'metrics'
OUT_FIG = ROOT / 'outputs' / 'figures'
OUT_METRICS.mkdir(parents=True, exist_ok=True)
OUT_FIG.mkdir(parents=True, exist_ok=True)

# =========================
# Device selection (CPU / GPU)
# =========================

FORCE_DEVICE = None  
FORCE_DEVICE = "cpu"   
# FORCE_DEVICE = "cuda"  

if FORCE_DEVICE is not None:
    device = torch.device(FORCE_DEVICE)
else:
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

print("Device:", device)
print("CUDA available:", torch.cuda.is_available())
if device.type == "cuda":
    print("GPU:", torch.cuda.get_device_name(0))


def ram_mb() -> float:
    return psutil.Process(os.getpid()).memory_info().rss / (1024**2)


Device: cpu
CUDA available: True


## 1. Load and prepare data (provided script)
We **must** use `datapreparation.py` as provided in the project instructions:
- load Hugging Face dataset `Tobi-Bueck/customer-support-tickets`
- filter English tickets
- keep only 5 departments
- shuffle and split into train / val / test
- build `label2id` and `id2label`


In [27]:
from datapreparation import load_and_prepare_data

train_ds, val_ds, test_ds, label_list, label2id, id2label = load_and_prepare_data()

print('Labels:', label_list)
print('Train size:', len(train_ds), '| Val size:', len(val_ds), '| Test size:', len(test_ds))


Label distribution (train):
Counter({'Technical Support': 6476, 'Customer Service': 3471, 'Billing and Payments': 2307, 'Sales and Pre-Sales': 655, 'General Inquiry': 340})
Labels: ['Billing and Payments', 'Customer Service', 'General Inquiry', 'Sales and Pre-Sales', 'Technical Support']
Train size: 13249 | Val size: 1656 | Test size: 1657


### Quick sanity check
We inspect a few examples and the label distribution to ensure the dataset matches the task.


In [28]:
from collections import Counter

print('Train label distribution:')
print(Counter(train_ds['queue']))

# Show a few samples
for i in range(3):
    ex = train_ds[i]
    print('\n--- Example', i, '---')
    print('queue:', ex.get('queue'))
    print('subject:', (ex.get('subject') or '')[:120])
    print('body:', (ex.get('body') or '')[:200])


Train label distribution:
Counter({'Technical Support': 6476, 'Customer Service': 3471, 'Billing and Payments': 2307, 'Sales and Pre-Sales': 655, 'General Inquiry': 340})

--- Example 0 ---
queue: Customer Service
subject: Guidance on Investment Data Analytics
body: Is it possible to receive guidance on optimizing investments through the use of data analytics and available tools and services? I am interested in learning how to make data-driven decisions.

--- Example 1 ---
queue: Sales and Pre-Sales
subject: 
body: Dear customer support, the data analytics tool is failing to process investment data efficiently. The problem might be due to software compatibility issues. After updating the associated software devi

--- Example 2 ---
queue: Customer Service
subject: Concern Regarding CRM System Malfunction
body: Dear Support Team, our marketing agency is facing issues with the Salesforce CRM system, which is disrupting our client data management process. It seems that recent software upda

## 2. Text formatting + labels
We create a single text field by concatenating subject and body, then add an integer label.


In [29]:
def format_text(ex):
    subject = ex.get('subject', '') or ''
    body = ex.get('body', '') or ''
    return f"Subject: {subject}\nBody: {body}"

def add_text_and_label(ds):
    def _map(ex):
        ex['text'] = format_text(ex)
        ex['label'] = label2id[ex['queue']]
        return ex
    return ds.map(_map)

train_ds2 = add_text_and_label(train_ds)
val_ds2   = add_text_and_label(val_ds)
test_ds2  = add_text_and_label(test_ds)

print('Columns:', train_ds2.column_names)
print('Example fields:', {k: train_ds2[0][k] for k in ['queue','label','text']})


Columns: ['subject', 'body', 'answer', 'type', 'queue', 'priority', 'language', 'version', 'tag_1', 'tag_2', 'tag_3', 'tag_4', 'tag_5', 'tag_6', 'tag_7', 'tag_8', 'text', 'label']
Example fields: {'queue': 'Customer Service', 'label': 1, 'text': 'Subject: Guidance on Investment Data Analytics\nBody: Is it possible to receive guidance on optimizing investments through the use of data analytics and available tools and services? I am interested in learning how to make data-driven decisions.'}


## 3. DistilBERT (Encoder) ‚Äî discriminative classifier
We fine-tune `distilbert-base-uncased` for 5-way classification.

### Why DistilBERT?
- Encoder models are naturally suited for classification tasks.
- Fine-tuning is efficient and usually yields strong accuracy.


In [30]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

MODEL_NAME = 'distilbert-base-uncased'

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=len(label_list),
    id2label=id2label,
    label2id=label2id,
)

Loading weights:   0%|          | 0/100 [00:00<?, ?it/s]

DistilBertForSequenceClassification LOAD REPORT from: distilbert-base-uncased
Key                     | Status     | 
------------------------+------------+-
vocab_transform.weight  | UNEXPECTED | 
vocab_projector.bias    | UNEXPECTED | 
vocab_layer_norm.weight | UNEXPECTED | 
vocab_layer_norm.bias   | UNEXPECTED | 
vocab_transform.bias    | UNEXPECTED | 
classifier.weight       | MISSING    | 
pre_classifier.weight   | MISSING    | 
pre_classifier.bias     | MISSING    | 
classifier.bias         | MISSING    | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING	:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.


### Tokenization
Important: we set the target column name to `labels` because `Trainer` expects that name.


In [31]:
def tokenize_batch(batch):
    tok = tokenizer(
        batch['text'],
        truncation=True,
        padding='max_length',
        max_length=256,
    )
    tok['labels'] = batch['label']
    return tok

train_tok = train_ds2.map(tokenize_batch, batched=True, remove_columns=train_ds2.column_names)
val_tok   = val_ds2.map(tokenize_batch, batched=True, remove_columns=val_ds2.column_names)
test_tok  = test_ds2.map(tokenize_batch, batched=True, remove_columns=test_ds2.column_names)

# Torch format helps performance
train_tok.set_format(type='torch')
val_tok.set_format(type='torch')
test_tok.set_format(type='torch')

print('train_tok columns:', train_tok.column_names)


Map:   0%|          | 0/1657 [00:00<?, ? examples/s]

train_tok columns: ['input_ids', 'token_type_ids', 'attention_mask', 'labels']


### Training (Transformers v5 compatible)
Notes:
- With small GPUs (4GB VRAM), use a smaller batch size and gradient accumulation.
- `eval_strategy` is the Transformers v5 name (instead of `evaluation_strategy`).
- `Trainer` in v5 no longer accepts `tokenizer=`.


In [32]:
from transformers import TrainingArguments, Trainer

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return {'accuracy': accuracy_score(labels, preds)}

args = TrainingArguments(
    output_dir=str(ROOT / 'outputs' / 'checkpoints' / 'distilbert'),
    eval_strategy='epoch',
    save_strategy='epoch',
    logging_strategy='steps',
    logging_steps=50,

    num_train_epochs=2,

    # Good defaults for RTX 3050 Laptop (4GB)
    per_device_train_batch_size=8,
    per_device_eval_batch_size=16,
    gradient_accumulation_steps=2,   # effective batch ~16

    learning_rate=2e-5,
    weight_decay=0.01,

    fp16=torch.cuda.is_available(),
    dataloader_pin_memory=True,
    dataloader_num_workers=2,

    report_to='none',
    load_best_model_at_end=True,
    metric_for_best_model='accuracy',
    greater_is_better=True,
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_tok,
    eval_dataset=val_tok,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)


In [33]:
trainer.train()


Epoch,Training Loss,Validation Loss,Accuracy
1,1.844028,0.879475,0.65157
2,1.538122,0.814499,0.687198


Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

There were missing keys in the checkpoint model loaded: ['distilbert.embeddings.LayerNorm.weight', 'distilbert.embeddings.LayerNorm.bias'].
There were unexpected keys in the checkpoint model loaded: ['distilbert.embeddings.LayerNorm.beta', 'distilbert.embeddings.LayerNorm.gamma'].


TrainOutput(global_step=1658, training_loss=1.7423453601335588, metrics={'train_runtime': 681.9986, 'train_samples_per_second': 38.853, 'train_steps_per_second': 2.431, 'total_flos': 1755154461834240.0, 'train_loss': 1.7423453601335588, 'epoch': 2.0})

## 4. Evaluation on test set
We report:
- Accuracy
- Inference time (total + per item)
- RAM usage (process RSS)


In [34]:
def evaluate_on_test(trainer: Trainer, test_tok, label_list):
    ram_before = ram_mb()

    t0 = time.perf_counter()
    pred = trainer.predict(test_tok)
    t1 = time.perf_counter()

    logits = pred.predictions
    y_true = pred.label_ids
    y_pred = np.argmax(logits, axis=-1)

    acc = accuracy_score(y_true, y_pred)
    elapsed = t1 - t0

    ram_after = ram_mb()

    report = classification_report(y_true, y_pred, target_names=label_list, zero_division=0)
    cm = confusion_matrix(y_true, y_pred)

    result = {
        'model': MODEL_NAME,
        'agent': 'DistilBERT classifier',
        'accuracy': float(acc),
        'inference_time_sec_total': float(elapsed),
        'inference_time_sec_per_item': float(elapsed / max(1, len(y_true))),
        'ram_before_mb': float(ram_before),
        'ram_after_mb': float(ram_after),
        'ram_delta_mb': float(ram_after - ram_before),
        'n_test': int(len(y_true)),
    }
    return result, report, cm

# Run after training:
result, report, cm = evaluate_on_test(trainer, test_tok, label_list)
print('Accuracy:', result['accuracy'])
print('Inference total (s):', result['inference_time_sec_total'])
print('RAM delta (MB):', result['ram_delta_mb'])
print('\nClassification report:\n', report)
print('\nConfusion matrix:\n', cm)


Accuracy: 0.687990343995172
Inference total (s): 22.551611000002595
RAM delta (MB): 0.23046875

Classification report:
                       precision    recall  f1-score   support

Billing and Payments       0.89      0.71      0.79       291
    Customer Service       0.47      0.53      0.50       395
     General Inquiry       0.00      0.00      0.00        29
 Sales and Pre-Sales       1.00      0.03      0.06        93
   Technical Support       0.73      0.85      0.79       849

            accuracy                           0.69      1657
           macro avg       0.62      0.42      0.43      1657
        weighted avg       0.70      0.69      0.67      1657


Confusion matrix:
 [[207  41   0   0  43]
 [ 13 209   0   0 173]
 [  1  12   0   0  16]
 [  2  60   0   3  28]
 [  9 119   0   0 721]]


### Save results
After evaluation, save metrics as JSON in `outputs/metrics/`.


In [35]:
# After evaluation:
(OUT_METRICS / 'distilbert_results.json').write_text(json.dumps(result, indent=2), encoding='utf-8')
print('Saved:', OUT_METRICS / 'distilbert_results.json')


Saved: C:\Users\Luc\Documents\ING5\NLP\LLM\Project\llm-email-router\outputs\metrics\distilbert_results.json


## 5. Other required agents (to be added)
The project requires two additional methods:
1. **GPT-2 / DistilGPT-2 prompting** (no training)
2. **GPT-2 + LoRA fine-tuning**

We will implement them in `src/agents/` and call them from this notebook.


### Agent 1 ‚Äî Prompting (LLM zero-shot) with GPT-2 (baseline)

In this section, we evaluate a *prompt-based* routing approach using a small decoder-only language model (`distilgpt2`).
The goal is to test whether a generative LLM can infer the correct department **without any supervised training**.

**Input to the agent**
- We build a short prompt from the email `subject` and `body`.
- The model must output one label among:
  `Billing and Payments`, `Customer Service`, `General Inquiry`, `Sales and Pre-Sales`, `Technical Support`.

**Why this baseline matters**
- It provides a reference ‚Äúno-training‚Äù solution.
- It is expected to be weaker than a discriminative classifier (DistilBERT fine-tuned), but it helps quantify the gap.

**Compute constraints**
- We run this agent on **CPU** to avoid saturating GPU memory (and because inference is already slow for generation).
- We measure: accuracy, inference time, and RAM usage on a subset of 200 test samples (for faster execution).

In [36]:
from src.agents.gpt2_prompting import GPT2PromptingRouter, PromptingConfig
from src.eval_utils import ram_mb, timed_predict, eval_classification

# Prepare test items
test_items = [{"subject": ex.get("subject",""), "body": ex.get("body","")} for ex in test_ds]
y_true = [label2id[ex["queue"]] for ex in test_ds]

# üëá ICI : on force le device CPU (important)
router = GPT2PromptingRouter(
    label_list=label_list,
    cfg=PromptingConfig(
        model_name="distilgpt2",
        device="cpu",          
        max_new_tokens=8,
        temperature=0.0,
        do_sample=False,
    )
)

ram_before = ram_mb()
timed = timed_predict(router.predict_batch, test_items[:200])  # start with 200 for speed
ram_after = ram_mb()

y_pred = timed["preds"]
metrics = eval_classification(y_true[:200], y_pred, label_list)

result_prompting = {
    "model": "distilgpt2",
    "agent": "GPT-2 prompting (CPU)",
    "accuracy": metrics["accuracy"],
    "inference_time_sec_total": timed["total_sec"],
    "inference_time_sec_per_item": timed["per_item_sec"],
    "ram_before_mb": ram_before,
    "ram_after_mb": ram_after,
    "ram_delta_mb": ram_after - ram_before,
    "n_test": timed["n_items"],
}

result_prompting, metrics["report"]

Loading weights:   0%|          | 0/76 [00:00<?, ?it/s]

GPT2LMHeadModel LOAD REPORT from: distilgpt2
Key                                        | Status     |  | 
-------------------------------------------+------------+--+-
transformer.h.{0, 1, 2, 3, 4, 5}.attn.bias | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


({'model': 'distilgpt2',
  'agent': 'GPT-2 prompting (CPU)',
  'accuracy': 0.155,
  'inference_time_sec_total': 32.21349569997983,
  'inference_time_sec_per_item': 0.16106747849989916,
  'ram_before_mb': 516.87109375,
  'ram_after_mb': 2808.06640625,
  'ram_delta_mb': 2291.1953125,
  'n_test': 200},
 '                      precision    recall  f1-score   support\n\nBilling and Payments       0.15      1.00      0.27        31\n    Customer Service       0.00      0.00      0.00        56\n     General Inquiry       0.00      0.00      0.00         2\n Sales and Pre-Sales       0.00      0.00      0.00        12\n   Technical Support       0.00      0.00      0.00        99\n\n            accuracy                           0.15       200\n           macro avg       0.03      0.20      0.05       200\n        weighted avg       0.02      0.15      0.04       200\n')

## Results ‚Äî GPT-2 Prompting baseline

**Observed performance (200 test samples):**
- Accuracy is low compared to the fine-tuned DistilBERT classifier.
- Inference is slow because the model must *generate* text tokens for every email.
- RAM usage is high because the GPT-2 model + tokenizer + generation cache occupy significant memory.

**Interpretation**
This confirms that a small decoder-only model (`distilgpt2`) used in a zero-shot prompting setup is not well-suited for reliable multi-class routing on this dataset.

Main reasons:
1. **No supervised learning**: the model is not trained to map emails ‚Üí queues.
2. **Generation instability**: even with `temperature=0` and `do_sample=False`, outputs can be inconsistent.
3. **Label format mismatch**: the model may output partial text, synonyms, or irrelevant tokens, which hurts exact label matching.
4. **Class imbalance**: rare classes (e.g., `General Inquiry`) are especially difficult to predict without training.

**Conclusion**
Prompting with GPT-2 is a useful baseline, but it is clearly outperformed by the discriminative DistilBERT classifier fine-tuned on labeled data.
The next step is to evaluate a *fine-tuned generative agent* (e.g., GPT-2 + LoRA) to see how much training improves routing quality.

## 6. Final comparison table (to be added)
At the end, we will build a table comparing all agents:
- Accuracy
- Inference time
- Memory usage
