## Spam Detector Demonstration (Sentiment Classification)
This model performs spam detection on emails, classifying them as spam or not spam.

Dataset from Kaggle: https://www.kaggle.com/datasets/nitishabharathi/email-spam-dataset

Contact: rohan11parekh@gmail.com

### Imports

In [85]:
import numpy as np
import tensorflow as tf
import nltk
import re
import pandas as pd
from collections import Counter
from nltk.tokenize import word_tokenize
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset
from torchvision import datasets
from torchvision.transforms import ToTensor
import torch.nn.functional as F

In [86]:
# Setting Pytorch device
device = torch.device("cuda")
device

device(type='cuda')

### Data preparation
Loading the csv into a pandas dataframe

In [87]:
data = pd.read_csv('datasets/Emails.csv')

In [88]:
data.head()

Unnamed: 0.1,Unnamed: 0,Body,Label
0,0,\nSave up to 70% on Life Insurance.\nWhy Spend...,1
1,1,1) Fight The Risk of Cancer!\nhttp://www.adcli...,1
2,2,1) Fight The Risk of Cancer!\nhttp://www.adcli...,1
3,3,##############################################...,1
4,4,I thought you might like these:\n1) Slim Down ...,1


Visualizing the ratio of spam to non-spam emails

In [89]:
print("Spam:", data['Label'].value_counts()[1] )
print("Not spam:", data['Label'].value_counts()[0])

Spam: 1896
Not spam: 4150


Note: 1 = spam, 0 = legit

Cleaning the dataset of unused columns, null values, etc.

In [90]:
value_to_remove = 'empty'
df = data[~data.apply(lambda row: value_to_remove in row.values, axis=1)]
df.head()

Unnamed: 0.1,Unnamed: 0,Body,Label
0,0,\nSave up to 70% on Life Insurance.\nWhy Spend...,1
1,1,1) Fight The Risk of Cancer!\nhttp://www.adcli...,1
2,2,1) Fight The Risk of Cancer!\nhttp://www.adcli...,1
3,3,##############################################...,1
4,4,I thought you might like these:\n1) Slim Down ...,1


In [91]:
df = df.sample(frac=1).reset_index(drop=True)
df.head()

Unnamed: 0.1,Unnamed: 0,Body,Label
0,3676,use Perl Daily Headline MailerThis Week on per...,0
1,1422,"Easy to make ""Between $200,000 and $500,000 ev...",1
2,4170,"URL: http://www.newsisfree.com/click/-5,855353...",0
3,5924,"\nBuyer's Alert | July 18, 2002When ordering, ...",0
4,5773,"\nForwarded-by: Nev Dull \nForwarded-by: ""Simo...",0


In [92]:
print("Spam:", df['Label'].value_counts()[1])
print("Not spam:", df['Label'].value_counts()[0])

Spam: 1561
Not spam: 3952


In [93]:
temp = df['Body'].to_list()
emails = np.array(temp)
emails[0]



Defining methods to clean emails of stopwords, punctuation, etc. and tokenizing them.

In [94]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

In [95]:
def clean(text):
    text = text.encode('utf-8', 'ignore').decode('utf-8', 'ignore')

    # Using RegEx to remove URLs, punctuation, newline characters
    out = re.sub(r'(https?://\S+|www\.\S+)|[^a-zA-Z\s]', ' ', text)
    out = out.lower()
    out = " ".join(out.split())
    
    # Tokenize and remove stop words
    word_tokens = word_tokenize(out)
    output = [w for w in word_tokens if not w.lower() in stop_words]
    return output

def clean_list(s):
    out = []
    for item in s:
        out.append(clean(item))
    return out

In [96]:
cleaned_emails = clean_list(emails)
cleaned_emails[30][:10]

['winner',
 'dear',
 'traveler',
 'congratulations',
 'may',
 'one',
 'lucky',
 'winners',
 'may',
 'spending']

Creating a vocabulary of all words and their frequencies using Counter()

In [97]:
def count_words(email_list):
    count = Counter()
    for email in email_list:
        for word in email:
            count[word.lower()] += 1
    return count

word_dict = count_words(cleaned_emails)

In [98]:
# Clearing variables for memory
del stop_words
del stopwords

Loading pretrained word2vec embeddings

In [99]:
from gensim.models import KeyedVectors
word_to_index = {"<UNK>": 0, **{word: idx + 1 for idx, word in enumerate(word_dict.keys())}}

word2vec_path = 'GoogleNews-vectors-negative300.bin'
word2vec = KeyedVectors.load_word2vec_format(word2vec_path, binary=True)

In [100]:
# Initialize the embedding matrix with Word2Vec vectors or random vectors
embedding_dim = 300 

vocab_size = len(word_to_index)

In [101]:
vocab_size

60691

In [102]:
# Initialize embedding matrix with random values
embedding_matrix = np.random.normal(0, 1, (vocab_size, embedding_dim))

# Fill embedding matrix with Word2Vec embeddings
for word, idx in word_to_index.items():
    if word in word2vec:
        embedding_matrix[idx] = word2vec[word]
    else:
        embedding_matrix[idx] = np.random.normal(0, 1, embedding_dim)  # For unknown words


In [103]:
# Convert embedding matrix to PyTorch tensor
embedding_matrix = torch.tensor(embedding_matrix, dtype=torch.float32)

In [104]:
embedding_matrix[0][0:100]

tensor([-1.1238, -0.1111,  1.4762, -2.3536, -1.6161,  0.3414, -0.3679, -2.2525,
        -0.4218,  1.8223, -0.5226,  0.1847,  0.7273,  0.8440,  2.3363, -0.3953,
         0.3349, -0.7453, -0.4501, -2.7956,  1.4428,  0.1992, -1.3935,  0.7688,
        -0.6857, -0.0527,  0.6928, -0.5977,  0.9287, -1.4980,  0.3672, -0.9094,
         0.7140, -0.2357, -0.5076, -1.1802,  0.1182, -1.5683,  0.0143,  2.5732,
        -0.6733, -1.8206, -1.0340,  0.0183,  0.1113,  0.0449, -0.2892, -0.1344,
        -0.5470, -0.8952,  0.3009,  1.7594,  1.1873, -0.9170, -2.3336, -1.0710,
         1.4270, -0.0797, -0.2102,  1.2017, -0.5514,  2.0145,  0.0673, -0.2513,
        -0.3447,  0.8609,  0.5684, -0.6716,  0.7632, -0.7360, -0.3928, -1.2352,
         0.5281, -0.3512,  0.8552, -2.3752, -1.6317,  1.0024,  0.4473,  0.4650,
        -0.3945, -0.1444,  0.5996, -0.3501,  0.1139, -0.6309, -0.7605,  1.1564,
         0.1849,  2.3943, -0.1983,  0.3115,  0.1366, -1.1992,  0.8544, -0.2485,
         1.0890,  0.9571, -0.9503,  0.53

In [105]:
# Accounting for unknown words
from collections import defaultdict
from torch.nn.utils.rnn import pad_sequence

vocab = defaultdict(lambda: len(vocab))
UNK = vocab["<UNK>"]

for text in cleaned_emails:
    for word in text:
        _ = vocab[word]

Now the text needs to be converting into tensors so they can be fed into the model. To do this I define a method called text_to_tensor

In [106]:
def text_to_tensor(text):
    indices = [vocab.get(word, UNK) for word in text]  # Convert words to indices
    return torch.tensor(indices, dtype=torch.long)

In [107]:
temp_tensor = []
for text in cleaned_emails:
    temp_tensor.append(text_to_tensor(text))
temp_tensor[0]

tensor([ 1,  2,  3,  4,  5,  6,  2,  7,  8,  9, 10, 11,  8, 12, 13, 14, 15, 16,
        17, 18, 19,  1,  2, 20, 21, 22,  1,  2, 23, 22, 24, 25, 26, 27, 28, 29,
        30, 24, 25])

In [108]:
# Padding the train and test sets so all the inputs are equal length 
emails_pad = [(seq[:500]) for seq in temp_tensor]
emails_pad = pad_sequence(emails_pad, batch_first = True)

In [109]:
# Clearing memory
del cleaned_emails
del data
del temp
del temp_tensor
del word_to_index
del vocab
del text

In [110]:
len(emails_pad)

5513

In [111]:
labels = np.array(df['Label'])
labels

array([0, 1, 0, ..., 1, 0, 0], dtype=int64)

In [112]:
len(labels)

5513

Now that the data is in a readable format, it can be split into train and test sets

In [113]:
X_train = torch.tensor(emails_pad.clone().detach()[:5250])
X_test = torch.tensor(emails_pad.clone().detach()[5250:])
Y_train = torch.tensor(labels[:5250])
Y_test = torch.tensor(labels[5250:])

  X_train = torch.tensor(emails_pad.clone().detach()[:5250])
  X_test = torch.tensor(emails_pad.clone().detach()[5250:])


In [114]:
print(X_train.shape)
print(Y_train.shape)

torch.Size([5250, 500])
torch.Size([5250])


In [115]:
Y_train[100:200]

tensor([0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0,
        0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1,
        0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0,
        0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0,
        0, 1, 0, 0])

In [116]:
labels[100:200]

array([0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0,
       0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1,
       0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0], dtype=int64)

In [117]:
df['Label'].shape

(5513,)

In [118]:
# Pytorch Dataloaders
import torch.utils.data as data_utils
from torch.utils.data import DataLoader
train = data_utils.TensorDataset(X_train, Y_train)
train_loader = DataLoader(train, batch_size=16, shuffle=True)
test = data_utils.TensorDataset(X_test, Y_test)
test_loader = DataLoader(test, batch_size=16, shuffle=True, drop_last=True)

In [119]:
# Listing the top 5 most common words
word_dict.most_common(5)

[('list', 4431), ('one', 3907), ('e', 3779), ('get', 3697), ('email', 3585)]

In [120]:
NUM_EPOCHS = 10
LEARNING_RATE = .01

### Defining the model using PyTorch

In [121]:
# Model Definition
class SpamDetector(nn.Module):
    def __init__(self):
        super(SpamDetector, self).__init__()
        self.emb = nn.Embedding.from_pretrained(embedding_matrix, freeze=False)
        self.LSTM = nn.LSTM(300, 200, batch_first=True)
        self.LSTM2 = nn.LSTM(200, 300, batch_first=True)
        self.fc1 = nn.Linear(300, 1000)
        self.dropout = nn.Dropout(0.5)
        self.fc2 = nn.Linear(1000, 1)
        
    def forward(self, x):
        x = self.emb(x)
        x, _ = self.LSTM(x)
        x, _ = self.LSTM2(x)
        x = x[:, -1, :]  # Taking the last hidden state
        x = self.fc1(x)
        x = self.dropout(x)
        x = self.fc2(x) 
        return x
    # I don't return it with sigmoid because BCEWithLogitsLoss expects raw logits

In [122]:
# Clearing memory
del df
del test
del train

### Training Loop (10 epochs)

In [123]:
# Instantiate Model, Loss Function, and Optimizer
model = SpamDetector().to(device)
criterion = nn.BCEWithLogitsLoss(pos_weight=torch.FloatTensor([4150/1896]).to(device)) 
optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)

model.train()

for epoch in range(NUM_EPOCHS):

    accuracy = 0
    avg_loss = 0
    correct = 0
    total = 0

    for i, (inputs, labels) in enumerate(train_loader):
        inputs, labels = inputs.to(device), labels.to(device)
        labels = labels.float()
        # Forward Pass
        outputs = model(inputs)
        loss = criterion(outputs[:,0], labels)
        
        # Backward and Optimize
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        # Get predictions and compute accuracy
        outputs = torch.tensor((outputs[:, 0].clone().detach() >= 0.5).float())
        total += labels.size(0)  # Total number of labels
        correct += (outputs == labels).sum().item()  # Count correct predictions
        
        avg_loss += loss.item()
        
    # Calculate and print the average loss and accuracy for the epoch
    avg_loss /= len(train_loader)
    accuracy = 100 * correct / total

    print(f'Epoch [{epoch+1}/{NUM_EPOCHS}], Loss: {avg_loss:.4f}, Accuracy: {accuracy:.2f}%')

print('Finished Training')

  outputs = torch.tensor((outputs[:, 0].clone().detach() >= 0.5).float())


Epoch [1/10], Loss: 1.2886, Accuracy: 79.89%
Epoch [2/10], Loss: 0.1514, Accuracy: 97.33%
Epoch [3/10], Loss: 0.2210, Accuracy: 98.11%
Epoch [4/10], Loss: 2.0757, Accuracy: 96.78%
Epoch [5/10], Loss: 1.0634, Accuracy: 96.10%
Epoch [6/10], Loss: 0.4247, Accuracy: 98.19%
Epoch [7/10], Loss: 0.4943, Accuracy: 97.85%
Epoch [8/10], Loss: 0.4954, Accuracy: 98.10%
Epoch [9/10], Loss: 0.3628, Accuracy: 98.88%
Epoch [10/10], Loss: 0.3679, Accuracy: 99.20%
Finished Training


### Running model on test set
Outputs printed for proof. Note: You will have to scroll a bit.

In [124]:
# Evaluation on Test Data
model.eval()
correct = 0
total = 0

with torch.no_grad():
    for inputs, labels in test_loader:
        inputs, labels = inputs.to(device), labels.to(device)  # Move to device
        
        outputs = model(inputs)
        labels = labels.float()
        # Move tensors to CPU and convert to numpy arrays
        outputs = outputs.cpu().numpy()
        labels = labels.cpu().numpy()
        
        predictions = [1.0 if value >= 0.5 else 0.0 for value in outputs]

        total += labels.shape[0]
        correct += (predictions == labels).sum().item()
        
        # Iterate over each sample in the batch
        for i in range(len(predictions)):
            print(f'Predicted: {predictions[i]} | Actual: {labels[i]}')

print(f'Accuracy on test set: {100 * correct / total:.2f}%')

Predicted: 0.0 | Actual: 0.0
Predicted: 0.0 | Actual: 0.0
Predicted: 0.0 | Actual: 0.0
Predicted: 0.0 | Actual: 0.0
Predicted: 0.0 | Actual: 0.0
Predicted: 0.0 | Actual: 0.0
Predicted: 0.0 | Actual: 0.0
Predicted: 1.0 | Actual: 1.0
Predicted: 0.0 | Actual: 0.0
Predicted: 0.0 | Actual: 0.0
Predicted: 0.0 | Actual: 0.0
Predicted: 0.0 | Actual: 0.0
Predicted: 0.0 | Actual: 0.0
Predicted: 0.0 | Actual: 0.0
Predicted: 1.0 | Actual: 1.0
Predicted: 0.0 | Actual: 0.0
Predicted: 0.0 | Actual: 0.0
Predicted: 1.0 | Actual: 1.0
Predicted: 0.0 | Actual: 0.0
Predicted: 0.0 | Actual: 0.0
Predicted: 0.0 | Actual: 0.0
Predicted: 1.0 | Actual: 1.0
Predicted: 0.0 | Actual: 1.0
Predicted: 1.0 | Actual: 0.0
Predicted: 0.0 | Actual: 0.0
Predicted: 0.0 | Actual: 0.0
Predicted: 0.0 | Actual: 0.0
Predicted: 0.0 | Actual: 0.0
Predicted: 0.0 | Actual: 0.0
Predicted: 0.0 | Actual: 0.0
Predicted: 1.0 | Actual: 1.0
Predicted: 0.0 | Actual: 0.0
Predicted: 0.0 | Actual: 0.0
Predicted: 1.0 | Actual: 1.0
Predicted: 0.0

### Displaying two individual results

In [125]:
emails_pad = emails_pad.to(device)
with torch.no_grad():
    out = model(emails_pad[12].unsqueeze(0))
    predicted_class = torch.round(torch.sigmoid(out)) 
    print(emails[12]) # Printing first spam email without fishy links for safety
    print("Predicted class:", predicted_class.item())
    print("Actual class: ", labels[12])

Snoring problems? Let Snore
Eliminator's all natural ingredients help you
sleep!Â Â Â Â Â Â  1r
Is snoring
keeping you up all night? No
more sleepless nights with
Snore Eliminator!!
Let your family sleep again by using the Snore Eliminator!
Do you know someone with a snoring problem? Snore Eliminator will help them
and they will love you for it.
Improve your sexual performance by reducing snoring and thus increasing
oxygen to your body!!People who snore have a higher risk of developing heart attacks, high blood
pressure, or strokes!! Snoring
also causes sleep disturbances that lead to increased anxiety,
...
Predicted class: 1.0
Actual class:  1.0


In [126]:
with torch.no_grad():
    out = model(emails_pad[2].unsqueeze(0))
    predicted_class = torch.round(torch.sigmoid(out))  
    print(emails[2]) # First non-spam email 
    print("Predicted class:", predicted_class.item())
    print("Actual class: ", labels[2])

URL: http://www.newsisfree.com/click/-5,8553538,1440/
Date: Not suppliedViral antibodies are identified in a one-month-old baby, as the US death toll 
rises sharply

Predicted class: 0.0
Actual class:  0.0


### Email: rohan11parekh@gmail.com 

### LinkedIn: https://www.linkedin.com/in/rohan-parekh-39b070225/