<a href="https://colab.research.google.com/github/gkementzidis/spam_email_LSTM_LM/blob/main/spam_detection_LLM_01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.metrics import classification_report

import torch
from torch.utils.data import Dataset, DataLoader
from torch.optim import AdamW
from transformers import get_scheduler
from tqdm import tqdm

from transformers import GPT2ForSequenceClassification, GPT2Tokenizer, TrainingArguments, Trainer

In [None]:
!pip install evaluate

In [None]:
import evaluate
from google.colab import drive
drive.mount("/content/drive")

## Load the data

In [None]:
# loading data
data = pd.read_csv('/content/drive/MyDrive/emails.csv')
print(data.shape)
data.head()

### Class imbalance

In updated versions of this file, we will address class imbalance and how to tackle possible adverse effects.

In [None]:
data['spam'].value_counts()

### Preprocessing

In contrast with the LSTM version, I chose to keep many stopwords in the text, since the attention mechanism will handle their relative importance. Sometimes stopwords such as "and" or "through" do add significant meaning to the text, and help us avoid misunderstandings. Also, I don't need to turn upper case letters to lower case, since the GPT-2 tokenizer will handle them accordingly.

In [None]:
###

## Tokenizer

We are using the GPT-2 tokenizer.

In [None]:
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

### Prepare the data

In [None]:
class CustomDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]

        # Tokenize text
        encoding = self.tokenizer(
            text,
            max_length=self.max_length,
            padding="max_length",
            truncation=True,
            return_tensors="pt"
        )

        return {
            "input_ids": encoding["input_ids"].squeeze(0),
            "attention_mask": encoding["attention_mask"].squeeze(0),
            "label": torch.tensor(label, dtype=torch.long)
        }

In [None]:
# Create dataset
dataset = CustomDataset(
    texts=data["text"].tolist(),
    labels=data["spam"].tolist(),
    tokenizer=tokenizer,
    max_length=tokenizer.model_max_length
)

# train-test split
train_size = int(0.8 * len(dataset))
test_size = len(dataset) - train_size

train_dataset, test_dataset = torch.utils.data.random_split(dataset, [train_size, test_size])

# DataLoader for batching
# train_dataloader = DataLoader(train_dataset, batch_size=1, shuffle=True)
# test_dataloader = DataLoader(test_dataset, batch_size=1, shuffle=False)

## GPT-2 Model

GPT-2 stands for "Generative Pre-trained Transformer 2". Developed by OpenAI, it is a transformer-based model.

In [None]:
model = GPT2ForSequenceClassification.from_pretrained("gpt2", num_labels=2)
model.config.pad_token_id = tokenizer.eos_token_id

### Evaluation metrics

Accuracy, precision, recall, and F1-score are typical metrics used in classification tasks. Especially when it comes to imbalanced datasets, accuracy alone is not a good indicator of whether the model is trained well.

In [None]:
# Load metrics
accuracy = evaluate.load("accuracy")
precision = evaluate.load("precision")
recall = evaluate.load("recall")
f1 = evaluate.load("f1")

# Compute multiple metrics
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)

    # Calculate each metric
    acc = accuracy.compute(predictions=predictions, references=labels)
    prec = precision.compute(predictions=predictions, references=labels, average="weighted")
    rec = recall.compute(predictions=predictions, references=labels, average="weighted")
    f1_score = f1.compute(predictions=predictions, references=labels, average="weighted")

    # Return a dictionary of all metrics
    return {
        "accuracy": acc["accuracy"],
        "precision": prec["precision"],
        "recall": rec["recall"],
        "f1": f1_score["f1"]
    }

### Training setup

I am using a very small batch size (1) due to compute limitations. By default the model uses the Adam optimizer with a linear schedule for the specified (by us) learning rate. Also, the loss function that is used is binary cross entropy, a golden standard for classification tasks.

In [None]:
# Define training arguments
training_args = TrainingArguments(
    output_dir="test_trainer",
    eval_strategy="epoch",  # Evaluate at the end of each epoch
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    num_train_epochs=2,
    save_strategy="epoch",  # Save the model every epoch
    learning_rate=5e-5,
    logging_dir="./logs",
    logging_steps=50,  # Log every 50 steps
)

# Define trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,  # Replace with your train dataset
    eval_dataset=test_dataset,  # Replace with your eval dataset
    tokenizer=tokenizer,  # Add tokenizer for data collators
    compute_metrics=compute_metrics  # Optional: Function to calculate accuracy, etc.
)

In [None]:
if torch.cuda.is_available():
    print(f"GPU is available: {torch.cuda.get_device_name(0)}")
else:
    print("No GPU found.")

In [None]:
if "COLAB_TPU_ADDR" in os.environ:
    print("TPU is available!")
else:
    print("No TPU found.")

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)
model.to(device)

### Training the model

In [None]:
# Train the model
trainer.train()

In [None]:
trainer.evaluate()

I will address the results in more detail in the future. However, so far they look much better than the LSTM model I designed a few months ago.