# Email classification

Workflow:
1. Load CSV (columns: `target`, `content`)  
2. Tokenise + train (DistilBERT)  
3. Dynamic quantisation (CPU)  
4. Push model to HF Hub  
5. Benchmark base vs. quantised

In [1]:
# 0. Install missing packages (Colab / clean env)
!pip install -q -U transformers datasets evaluate accelerate scikit-learn

In [2]:
# 1. Imports
import os, torch, csv, time
import numpy as np
from datasets import load_dataset, DatasetDict, ClassLabel
from transformers import (AutoTokenizer, AutoModelForSequenceClassification,
                          TrainingArguments, Trainer, DataCollatorWithPadding)
from sklearn.metrics import accuracy_score, f1_score

MODEL_NAME = "FacebookAI/roberta-large-mnli"
# HF_HUB_MODEL_ID = "your-hf-username/distilbert-10cls-quant"  # <- CHANGE
# !hf download --repo-type=dataset zefang-liu/phishing-email-dataset --cache-dir ./dataset
CSV_PATH = "/content/dataset/sample-phising.csv"

# Device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)

Using device: cuda


In [4]:
# 2. Load CSV -> DatasetDict
raw_ds = load_dataset("csv", data_files=CSV_PATH)["train"]

# Build label mapping automatically
labels = sorted(raw_ds.unique("target"))
assert len(labels) == 2, f"Expected 2 labels, got {len(labels)}"
print(f"Labels: {labels}")

label2id = {l: i for i, l in enumerate(labels)}
id2label = {i: l for l, i in label2id.items()}

def preprocess(examples):
    examples["label"] = [label2id[t] for t in examples["target"]]
    return examples

raw_ds = raw_ds.map(preprocess, batched=True)

# Convert 'label' column to ClassLabel
raw_ds = raw_ds.cast_column("label", ClassLabel(names=labels))

# 80/10/10 split
train_test = raw_ds.train_test_split(test_size=0.2, stratify_by_column="label", seed=42)
test_valid = train_test["test"].train_test_split(test_size=0.5, stratify_by_column="label", seed=42)
ds = DatasetDict({
    "train": train_test["train"],
    "validation": test_valid["train"],
    "test": test_valid["test"]
})
print(ds)

Generating train split: 0 examples [00:00, ? examples/s]

Labels: ['Phishing Email', 'Safe Email']


Map:   0%|          | 0/452 [00:00<?, ? examples/s]

Casting the dataset:   0%|          | 0/452 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'content', 'target', 'label'],
        num_rows: 361
    })
    validation: Dataset({
        features: ['Unnamed: 0', 'content', 'target', 'label'],
        num_rows: 45
    })
    test: Dataset({
        features: ['Unnamed: 0', 'content', 'target', 'label'],
        num_rows: 46
    })
})


In [5]:
ds['train'][0]['label']

0

In [6]:
# 3. Tokeniser
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

def tokenize(batch):
    # Add a type check for the 'content' column
    for i, item in enumerate(batch["content"]):
        if not isinstance(item, str):
            # print(f"Detected non-string type in 'content' at index {i}: {type(item)}")
            batch["content"][i] = str("*")
            # Optionally, you can decide how to handle non-string types,
            # for now, we'll just print a warning.
            # You might consider converting to string: batch["content"][i] = str(item)

    return tokenizer(batch["content"], truncation=True)

ds_tok = ds.map(tokenize, batched=True, remove_columns=["content", "target"])
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Map:   0%|          | 0/361 [00:00<?, ? examples/s]

Map:   0%|          | 0/45 [00:00<?, ? examples/s]

Map:   0%|          | 0/46 [00:00<?, ? examples/s]

## Original model response

In [7]:
from transformers import pipeline

classifier = pipeline('zero-shot-classification', model='roberta-large-mnli')
# sequence_to_classify = ds['train'][0]['content']
sequence_to_classify = "software at incredibly low prices ( 86 % lower ) . drapery seventeen term represent any sing . feet wild break able build . tail , send subtract represent . job cow student inch gave . let still warm , family draw , land book . glass plan include . sentence is , hat silent nothing . order , wild famous long their . inch such , saw , person , save . face , especially sentence science . certain , cry does . two depend yes , written carry ."

classifier(sequence_to_classify, labels)

Some weights of the model checkpoint at roberta-large-mnli were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0


{'sequence': 'software at incredibly low prices ( 86 % lower ) . drapery seventeen term represent any sing . feet wild break able build . tail , send subtract represent . job cow student inch gave . let still warm , family draw , land book . glass plan include . sentence is , hat silent nothing . order , wild famous long their . inch such , saw , person , save . face , especially sentence science . certain , cry does . two depend yes , written carry .',
 'labels': ['Safe Email', 'Phishing Email'],
 'scores': [0.643171489238739, 0.3568284809589386]}

## Train custom model

In [8]:
# 4. Build model
model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME,
    num_labels=len(labels),
    id2label=id2label,
    label2id=label2id,
    ignore_mismatched_sizes=True
).to(device)

Some weights of the model checkpoint at FacebookAI/roberta-large-mnli were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at FacebookAI/roberta-large-mnli and are newly initialized because the shapes did not match:
- classifier.out_proj.bias: found shape torch.Size([3]) in the checkpoint and torch.Size([2]) in the model instanti

In [9]:
# 5. Metrics
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return {
        "accuracy": accuracy_score(labels, preds),
        "f1_macro": f1_score(labels, preds, average="macro")
    }

In [10]:
# 6. Training arguments
args = TrainingArguments(
    output_dir="./out",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=16,
    gradient_accumulation_steps=4,
    num_train_epochs=10,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    push_to_hub=True,
    report_to="none",  # Disable wandb reporting
    fp16=True,         # Enable mixed precision training for GPU
    # hub_model_id=HF_HUB_MODEL_ID,
    # hub_strategy="every_save",
    # hub_token=os.getenv("HF_TOKEN") or None,  # export HF_TOKEN=...
)

In [11]:
# 7. Trainer
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=ds_tok["train"],
    eval_dataset=ds_tok["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)
trainer.train()

  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy,F1 Macro
1,No log,0.287836,0.866667,0.86336
2,No log,0.449748,0.911111,0.907407
3,No log,0.351628,0.888889,0.883117
4,No log,0.683332,0.911111,0.905462
5,No log,1.002756,0.866667,0.854526
6,No log,0.538156,0.933333,0.92987
7,No log,0.634823,0.933333,0.92987
8,No log,0.615268,0.933333,0.92987
9,No log,0.609644,0.933333,0.92987
10,No log,0.608111,0.933333,0.92987


TrainOutput(global_step=460, training_loss=0.10232805998429008, metrics={'train_runtime': 2453.4996, 'train_samples_per_second': 1.471, 'train_steps_per_second': 0.187, 'total_flos': 3256097196508716.0, 'train_loss': 0.10232805998429008, 'epoch': 10.0})

In [18]:
# Best model
print(trainer.state.best_model_checkpoint)

./out/checkpoint-276


## Benchmark

In [12]:
from transformers import pipeline

classifier = pipeline('zero-shot-classification', model=model, tokenizer=tokenizer)

sequence_to_classify = "software at incredibly low prices ( 86 % lower ) . drapery seventeen term represent any sing . feet wild break able build . tail , send subtract represent . job cow student inch gave . let still warm , family draw , land book . glass plan include . sentence is , hat silent nothing . order , wild famous long their . inch such , saw , person , save . face , especially sentence science . certain , cry does . two depend yes , written carry ."

classifier(sequence_to_classify, labels)

Device set to use cuda:0
Failed to determine 'entailment' label id from the label2id mapping in the model config. Setting to -1. Define a descriptive label2id mapping in the model config to ensure correct outputs.


{'sequence': 'software at incredibly low prices ( 86 % lower ) . drapery seventeen term represent any sing . feet wild break able build . tail , send subtract represent . job cow student inch gave . let still warm , family draw , land book . glass plan include . sentence is , hat silent nothing . order , wild famous long their . inch such , saw , person , save . face , especially sentence science . certain , cry does . two depend yes , written carry .',
 'labels': ['Phishing Email', 'Safe Email'],
 'scores': [0.5050869584083557, 0.4949130117893219]}

In [13]:
# 8. Evaluate base fine-tuned model on held-out test set
print("=== Fine-tuned (FP32) ===")
fp32_metrics = trainer.evaluate(ds_tok["test"])
print(fp32_metrics)

=== Fine-tuned (FP32) ===


{'eval_loss': 0.13930277526378632, 'eval_accuracy': 0.9782608695652174, 'eval_f1_macro': 0.9777455249153362, 'eval_runtime': 4.0344, 'eval_samples_per_second': 11.402, 'eval_steps_per_second': 1.487, 'epoch': 10.0}


## Acceleration for CPU

In [14]:
# 9. Dynamic quantisation (CPU only)
if device.type == "cpu":
    from torch.quantization import quantize_dynamic
    quantized_model = quantize_dynamic(
        trainer.model, {torch.nn.Linear}, dtype=torch.qint8
    )
else:
    print("GPU detected – skipping quantisation (only CPU supported)")
    quantized_model = None

GPU detected – skipping quantisation (only CPU supported)


In [15]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive
