# Email spam classification using Feedforward Neural Networks

---


In this project, our goal is to classify emails as either "spam" or "not spam" also known as ham. We will aim to use a feedforward neural network for Email spam classification is a common problem in natural language processing, where the objective is to automatically detect and filter out unwanted or potentially harmful messages.

Usually I think we would use a more linear model such as a linear regression or Naive Bayes algorithm if we think the relationship between the features and the output is linear, or we think that the dataset is small and we worry about overfitting. But I would like to learn more about neural networks so I will try with a feedforward neural network approach, aiming to see if we can capture any complex interactions from features that may affect the output (for example image or speech recognition).


## Loading the dataset


In [1]:
import numpy as np
import pandas as pd
import torch 
import torch.nn as nn
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")


df = pd.read_csv('email.csv')
display(df.head())

# Check for missing values
print(df.isnull().sum())

# Check unique labels
print(df['Category'].value_counts())
df = df.iloc[:-1]
df = df[['Category', 'Message']]  # drop extra cols and keep the cols we want




Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


Category    0
Message     0
dtype: int64
Category
ham               4825
spam               747
{"mode":"full"       1
Name: count, dtype: int64


The dataset has **5572 rows** and there's only two categories 'ham' and 'spam'. Now we want to label each email message and give it their true values under 'label', that way when we calculate the loss we know what the true value is.


In [2]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df['Label'] = le.fit_transform(df['Category'])

display(df.head(10))
display(df.shape)


Unnamed: 0,Category,Message,Label
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0
5,spam,FreeMsg Hey there darling it's been 3 week's n...,1
6,ham,Even my brother is not like to speak with me. ...,0
7,ham,As per your request 'Melle Melle (Oru Minnamin...,0
8,spam,WINNER!! As a valued network customer you have...,1
9,spam,Had your mobile 11 months or more? U R entitle...,1


(5572, 3)

## Split the data to training and testing data (80/20)


In [3]:
from sklearn.model_selection import train_test_split

# Split into training and testing sets (80/20)
X_train, X_test, y_train, y_test = train_test_split(
    df['Message'],     # input: raw messages
    df['Label'],       # target: 0 = ham, 1 = spam
    test_size=0.2,
    random_state=42
)

print("Training set:")
print(X_train[:3], X_train.shape)
print(y_train[:3], y_train.shape)

print("Testing set:")
print(X_test[:3], X_test.shape)
print(y_test[:3], y_test.shape)








Training set:
1978    Reply to win £100 weekly! Where will the 2006 ...
3989    Hello. Sort of out in town already. That . So ...
3935     How come guoyang go n tell her? Then u told her?
Name: Message, dtype: object (4457,)
1978    1
3989    0
3935    0
Name: Label, dtype: int64 (4457,)
Testing set:
3245    Squeeeeeze!! This is christmas hug.. If u lik ...
944     And also I've sorta blown him off a couple tim...
1044    Mmm thats better now i got a roast down me! i...
Name: Message, dtype: object (1115,)
3245    0
944     0
1044    0
Name: Label, dtype: int64 (1115,)


Here, we see that X is the messages (which will be converted to vectorised forms), and y is the label value (1s and 0s)


## Vectorisation (Converting word to vectors/numerical values)


In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english', max_features=3000)
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)
X_train_vec.shape, X_test_vec.shape


#convert to dense matrix because pytorch/tensorflow doesn't support sparse matrices
X_train_dense = X_train_vec.toarray()
X_test_dense = X_test_vec.toarray()


X_train_dense.shape,y_train.shape






'''
sample 1: [0.0042, 0.0042, 0.0042, ...] (3000 features/columns) = 1 label (0 or 1)
sample 2: [0.0042, 0.0042, 0.0042, ...] (3000 features/columns) = 1 label (0 or 1)
sample 3: [0.0042, 0.0042, 0.0042, ...] (3000 features/columns) = 1 label (0 or 1)
...
...
...
sample 4457: [0.0042, 0.0042, 0.0042, ...] (3000 features/columns) = 1 label (0 or 1)






'''
print(X_train_dense.shape,y_train.shape)    







(4457, 3000) (4457,)


# Convert to tensors and batch


In [12]:
#convert to tensors
X_train_tensor = torch.tensor(X_train_dense, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train.values, dtype=torch.float32).unsqueeze(1)  # shape (N, 1)
X_test_tensor = torch.tensor(X_test_dense, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test.values, dtype=torch.float32).unsqueeze(1)    # shape (N, 1)

from torch.utils.data import TensorDataset, DataLoader
#combine tensors into a dataset
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
test_dataset = TensorDataset(X_test_tensor, y_test_tensor)
print(train_dataset[0])

#batch them up
batch_size = 64
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=True)

print(len(train_loader))
print(len(test_loader))

#we can see we've batched it into 70 training batches and 18 testing batches

for X_batch, y_batch in train_loader:
    print(X_batch.shape)
    print(y_batch.shape)  # should now be torch.Size([64, 1])
    break


(tensor([0., 0., 0.,  ..., 0., 0., 0.]), tensor([1.]))
70
18
torch.Size([64, 3000])
torch.Size([64, 1])


Input size: 3000 (Each batch has 64 samples and 3000 features)
First layer: 128 hidden neurons/units/features
Second layer: 64 hidden neurons

Output layer: 1 neuron (since we want to do binary classification). But if we were doing something like letter prediction, we might do 26 neuron output (A-Z)


# Model Architecture FFNN


In [None]:
import torch.nn as nn


class SpamClassifier(nn.Module):


    #input = n_features (3000)
    #hidden layers = n_hidden
    #output = n_output
    def __init__(self,n_features,n_hidden=[128,64],n_output=1):
        #initialize the model
        super(SpamClassifier,self).__init__()

        in_features = n_features
        layers = []

        #for each hidden layer, we add a linear layer and a ReLU activation function. 
        #Input layer will have 3000 features
        #1st hidden layer will have 128 features
        #2nd hidden layer will have 64 features

        for hidden_layer in n_hidden:
            layers.append(nn.Linear(in_features,hidden_layer))
            layers.append(nn.ReLU())
            in_features = hidden_layer

        #output layer will have 1 feature
        layers.append(nn.Linear(in_features,n_output))
        #use sigmoid rather than softmax cos binary classification
        layers.append(nn.Sigmoid())


        #create a sequential model
        self.model = nn.Sequential(*layers)
        
    def forward(self,x):
        x = self.model(x)
        return x




# Set up model, optimiser, and loss function


In [43]:
import torch.optim as optim


#the number of features 
input_size = X_batch.shape[1]
hidden_layers = [128,64]
output_size = 1

#create the model
ffnn_model = SpamClassifier(input_size,hidden_layers,output_size).to(device)

#loss function
loss_fn = nn.BCELoss() 

#optimiser
learning_rate = 0.001
optimiser = optim.AdamW(ffnn_model.parameters(),lr=learning_rate)



# Training Loop


In [44]:
num_epochs = 50
correct = 0
total = 0
for epoch in range(num_epochs):
    ffnn_model.train() #set the model to training mode
    running_loss=0

    for X_batch,y_batch in train_loader:
        #using CUDA
        X_batch = X_batch.to(device)
        y_batch = y_batch.to(device)
        #forward pass
        outputs = ffnn_model(X_batch)
        loss = loss_fn(outputs,y_batch)


        #calculate accuracy
        predicted = (outputs > 0.5).float()
        correct += (predicted == y_batch).sum().item()
        total += y_batch.size(0)

        #backward pass
        #the goal here is to update the weights of the model by calculating the gradient of the loss function with respect to the weights

        optimiser.zero_grad()
        loss.backward()
        optimiser.step()

        running_loss += loss.item()

    train_accuracy = correct / total
    print(f"Epoch {epoch+1}/{num_epochs}, Loss: {running_loss/len(train_loader):.4f}, Accuracy: {train_accuracy*100:.2f}%")








Epoch 1/50, Loss: 0.4450, Accuracy: 86.58%
Epoch 2/50, Loss: 0.1240, Accuracy: 91.09%
Epoch 3/50, Loss: 0.0373, Accuracy: 93.73%
Epoch 4/50, Loss: 0.0162, Accuracy: 95.19%
Epoch 5/50, Loss: 0.0088, Accuracy: 96.11%
Epoch 6/50, Loss: 0.0055, Accuracy: 96.75%
Epoch 7/50, Loss: 0.0044, Accuracy: 97.20%
Epoch 8/50, Loss: 0.0037, Accuracy: 97.54%
Epoch 9/50, Loss: 0.0035, Accuracy: 97.81%
Epoch 10/50, Loss: 0.0027, Accuracy: 98.02%
Epoch 11/50, Loss: 0.0024, Accuracy: 98.20%
Epoch 12/50, Loss: 0.0021, Accuracy: 98.35%
Epoch 13/50, Loss: 0.0019, Accuracy: 98.47%
Epoch 14/50, Loss: 0.0019, Accuracy: 98.58%
Epoch 15/50, Loss: 0.0017, Accuracy: 98.67%
Epoch 16/50, Loss: 0.0017, Accuracy: 98.75%
Epoch 17/50, Loss: 0.0016, Accuracy: 98.83%
Epoch 18/50, Loss: 0.0016, Accuracy: 98.89%
Epoch 19/50, Loss: 0.0016, Accuracy: 98.95%
Epoch 20/50, Loss: 0.0015, Accuracy: 99.00%
Epoch 21/50, Loss: 0.0015, Accuracy: 99.04%
Epoch 22/50, Loss: 0.0015, Accuracy: 99.09%
Epoch 23/50, Loss: 0.0015, Accuracy: 99.1

# Evaluation


In [50]:
ffnn_model.eval()  # set to evaluation mode
test_correct = 0
test_total = 0

with torch.no_grad():  
    for X_batch, y_batch in test_loader:
        X_batch = X_batch.to(device)
        y_batch = y_batch.to(device)

        outputs = ffnn_model(X_batch)
        predicted = (outputs > 0.5).float()
        test_correct += (predicted == y_batch).sum().item()
        test_total += y_batch.size(0)

test_accuracy = test_correct / test_total
print(f"Test Accuracy: {test_accuracy * 100:.2f}%")

Test Accuracy: 98.83%


In [None]:
from sklearn.metrics import classification_report, confusion_matrix

# Collect all predictions and labels
y_preds = []
y_true = []

with torch.no_grad():
    for X_batch, y_batch in test_loader:
        X_batch = X_batch.to(device)
        y_batch = y_batch.to(device)

        outputs = ffnn_model(X_batch)
        predicted = (outputs > 0.5).float()

        y_preds.extend(predicted.cpu().numpy())
        y_true.extend(y_batch.cpu().numpy())

print(classification_report(y_true, y_preds, target_names=['Ham', 'Spam']))


              precision    recall  f1-score   support

         Ham       0.99      0.99      0.99       966
        Spam       0.96      0.95      0.96       149

    accuracy                           0.99      1115
   macro avg       0.98      0.97      0.97      1115
weighted avg       0.99      0.99      0.99      1115



# Save the model

In [None]:
import torch
import pickle

torch.save(ffnn_model.state_dict(), 'ffnn_model.pth')
with open('tfidf_vectoriser.pkl', 'wb') as file:
    pickle.dump(vectorizer, file)




