# Spam Classification using Encoder LLMs with Linear Probing [5 points]
In this part, we will use encoder Large Language Models (LLMs) for spam classification. We will leverage the rich features of pre-trained LLMs without fine-tuning them. Instead, we will freeze the LLM weights and train a lightweight classifier head (MLP) on top for spam classification.

**Dataset:** Enron Spam Dataset

**Expected Performance (Best Model):** {Accuracy: >85%, F1: >85%, Precision: >85%, Recall: >82%}

1. Load the Enron Spam dataset. Use the train/val/test splits and tokenize the text using your pre-trained LLM’s tokenizer. Use your best judgement for the relevant input fields.

In [1]:
from datasets import load_dataset, DatasetDict
from transformers import AutoTokenizer
from sklearn.model_selection import train_test_split

dataset = load_dataset("SetFit/enron_spam")
df = dataset["train"].to_pandas()

train_df, val_df = train_test_split(df, test_size=0.1, stratify=df["label"], random_state=42)

dataset = DatasetDict({
    "train": dataset["train"].select(train_df.index.tolist()),
    "validation": dataset["train"].select(val_df.index.tolist()),
    "test": dataset["test"]
})

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def tokenize_function(example):
    return tokenizer(example["text"], truncation=True, padding="max_length", max_length=512)

tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets.set_format("torch", columns=["input_ids", "attention_mask", "label"])

README.md:   0%|          | 0.00/176 [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.


train.jsonl:   0%|          | 0.00/101M [00:00<?, ?B/s]

test.jsonl:   0%|          | 0.00/6.27M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/31716 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2000 [00:00<?, ? examples/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/28544 [00:00<?, ? examples/s]

Map:   0%|          | 0/3172 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

2. Model Setup – Probing:

   a. Load a pre-trained LLM (e.g., DistilBERT, BART-encoder) for sequence classification. Choose a lightweight encoder model that is amenable to your GPU size. Consider using DistilBERT, TinyBERT, MobileBERT, AlBERT, or others. **Specify the chosen LLM below.**

   **Chosen Encoder LLM:** <span style='color:green'>DistilBERT</span>

In [2]:
from transformers import AutoModel

base_model = AutoModel.from_pretrained("distilbert-base-uncased").to("cuda")

2025-04-10 03:17:38.640531: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1744255059.052376      31 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1744255059.173534      31 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

   b. Freeze all base model weights and attach a lightweight MLP (the classification head) that maps the model’s representations to binary labels. You may want to create a separate model class that defines these components and a forward function or use out of the box 🤗 classification wrappers.

   c. Use the [CLS] token if available or mean-pooled final hidden states from the LLM as input to your classifier head.

In [14]:
for param in base_model.parameters():
    param.requires_grad = False 
    
import torch.nn as nn

class LLMClassifier(nn.Module):
    def __init__(self, base_model, hidden_size=768, num_labels=2):
        super().__init__()
        self.base_model = base_model
        self.classifier = nn.Sequential(
            nn.Linear(hidden_size, 256),
            nn.ReLU(),
            nn.Linear(256, num_labels)
        )
        self.loss_fn = nn.CrossEntropyLoss()

    def forward(self, input_ids, attention_mask, labels=None):
        outputs = self.base_model(input_ids=input_ids, attention_mask=attention_mask)
        cls_output = outputs.last_hidden_state[:, 0] 
        logits = self.classifier(cls_output)

        if labels is not None:
            loss = self.loss_fn(logits, labels)
            return {"loss": loss, "logits": logits}
        return {"logits": logits}

3. Configure your training parameters (learning rate, batch size, epochs) and train the model using only the classifier head while the LLM remains frozen.

In [15]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = logits.argmax(axis=-1)
    acc = accuracy_score(labels, preds)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
    return {"accuracy": acc, "precision": precision, "recall": recall, "f1": f1}

In [16]:
!pip install -U transformers

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




In [21]:
import transformers

model = LLMClassifier(base_model)

training_args = TrainingArguments(
    output_dir="./results",
    do_eval=True,
    eval_steps=500,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    num_train_epochs=3,
    learning_rate=2e-4,
    logging_dir='./logs',
    logging_steps=100,
    report_to="none"
)

4. Evaluation and Analysis:

In [18]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    compute_metrics=compute_metrics
)

trainer.train()



Step,Training Loss
10,0.6778
20,0.6201
30,0.5552
40,0.4834
50,0.4439
60,0.3925
70,0.3601
80,0.2932
90,0.2576
100,0.2793




TrainOutput(global_step=4460, training_loss=0.09152704664409962, metrics={'train_runtime': 1346.2326, 'train_samples_per_second': 106.014, 'train_steps_per_second': 3.313, 'total_flos': 0.0, 'train_loss': 0.09152704664409962, 'epoch': 5.0})

   a. Evaluate the model on the test set using accuracy, precision, recall, and F1-score.

In [19]:
trainer.evaluate(tokenized_datasets["test"])



{'eval_loss': 0.05272546038031578,
 'eval_accuracy': 0.9795,
 'eval_precision': 0.9820538384845464,
 'eval_recall': 0.9771825396825397,
 'eval_f1': 0.9796121332670313,
 'eval_runtime': 15.7191,
 'eval_samples_per_second': 127.234,
 'eval_steps_per_second': 1.018,
 'epoch': 5.0}

   b. Select **two** encoder LLMs, repeat steps 2-4 for the second LLM, and compare and discuss any performance trends between the two models. **Specify the second chosen LLM below and report performance comparison.**

   **Second Chosen Encoder LLM:** <span style='color:green'>ALBERT-base-v2</span>

In [23]:
from transformers import AutoModel

base_model2 = AutoModel.from_pretrained("albert-base-v2").to("cuda")
for param in base_model2.parameters():
    param.requires_grad = False

model2 = LLMClassifier(base_model2).to("cuda")

config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/47.4M [00:00<?, ?B/s]

In [24]:
trainer2 = Trainer(
    model=model2,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    compute_metrics=compute_metrics
)

trainer2.train()
results2 = trainer2.evaluate(tokenized_datasets["test"])
print(results2)



Step,Training Loss
100,0.7082
200,0.6894
300,0.6918
400,0.679
500,0.6915
600,0.6731
700,0.6942
800,0.6713
900,0.6654
1000,0.6646




{'eval_loss': 0.622987687587738, 'eval_accuracy': 0.673, 'eval_precision': 0.6603260869565217, 'eval_recall': 0.7232142857142857, 'eval_f1': 0.6903409090909091, 'eval_runtime': 34.0664, 'eval_samples_per_second': 58.709, 'eval_steps_per_second': 0.47, 'epoch': 3.0}


   **Performance Comparison and Trend Discussion:**

Performance Comparison:

| Model           | Accuracy | Precision | Recall | F1 Score |
|----------------|----------|-----------|--------|----------|
| DistilBERT      | 97.95%   | 98.21%    | 97.72% | 97.96%   |
| ALBERT-base-v2  | 67.30%   | 66.03%    | 72.32% | 69.03%   |

Discussion:

DistilBERT significantly outperformed ALBERT-base-v2 across all metrics.

ALBERT's architecture emphasizes parameter sharing for efficiency, which can limit representational capacity in low-data or subtle-text scenarios like spam detection.

DistilBERT, being a distilled version of BERT, maintains a good trade-off between speed and performance, which may explain its strong results here.

ALBERT’s underperformance may also be due to fewer epochs (3 vs. 5) — additional training might improve its results but likely won’t match DistilBERT in this setting.

   c. The best model is expected to attain {Accuracy: >85%, F1: >85%, Precision: >85%, Recall: >82%}. Report whether your best model achieves these metrics and discuss.

   **Performance vs. Expected Metrics Discussion:**

Expected Thresholds:
{Accuracy: >85%, F1: >85%, Precision: >85%, Recall: >82%}

Best Model:
DistilBERT-base-uncased

Achieved Metrics:

Accuracy: 97.95%

F1 Score: 97.96%

Precision: 98.21%

Recall: 97.72%

Discussion:

The DistilBERT model exceeds all expected performance metrics by a wide margin.

It is highly effective at separating spam from ham in the Enron dataset even without fine-tuning, validating the power of pre-trained LLM embeddings when paired with a lightweight classifier head.

This shows that even lightweight models like DistilBERT can be extremely capable in real-world classification tasks with minimal compute.

5. References. Include details on all the resources used to complete this part.

- [Hugging Face Transformers Documentation](https://huggingface.co/docs/transformers/index)
- [DistilBERT Model Card](https://huggingface.co/distilbert-base-uncased)
- [ALBERT Model Card](https://huggingface.co/albert-base-v2)
- [SetFit Enron Spam Dataset](https://huggingface.co/datasets/SetFit/enron_spam)
- [Hugging Face Datasets Documentation](https://huggingface.co/docs/datasets)
- [PyTorch Documentation](https://pytorch.org/docs/stable/index.html)
- [Scikit-learn Metrics](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics)
