### Spam Detector Demonstration (Sentiment Classification)
This model performs spam detection on emails, classifying them as spam or not spam.
Contact: rohan11parekh@gmail.com

Imports

In [2]:
import numpy as np
import tensorflow as tf
import nltk
import re
import pandas as pd
from collections import Counter
from nltk.tokenize import word_tokenize
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset
from torchvision import datasets
from torchvision.transforms import ToTensor
import torch.nn.functional as F

In [3]:
# Setting Pytorch device
device = torch.device("cuda")
device

device(type='cuda')

Reading the data from csv using Pandas

In [4]:
data = pd.read_csv('datasets/Emails.csv')

In [107]:
data.head()

Unnamed: 0.1,Unnamed: 0,Body,Label
0,0,\nSave up to 70% on Life Insurance.\nWhy Spend...,1
1,1,1) Fight The Risk of Cancer!\nhttp://www.adcli...,1
2,2,1) Fight The Risk of Cancer!\nhttp://www.adcli...,1
3,3,##############################################...,1
4,4,I thought you might like these:\n1) Slim Down ...,1


Visualizing the ratio of spam to non-spam emails

In [6]:
print("Spam:", data['Label'].value_counts()[1] )
print("Not spam:", data['Label'].value_counts()[0])

Spam: 1896
Not spam: 4150


Note: 1 = spam, 0 = legit

Now to clean the dataset of unused columns, null values, etc.

In [108]:
value_to_remove = 'empty'
df = data[~data.apply(lambda row: value_to_remove in row.values, axis=1)]
df.head()

Unnamed: 0.1,Unnamed: 0,Body,Label
0,0,\nSave up to 70% on Life Insurance.\nWhy Spend...,1
1,1,1) Fight The Risk of Cancer!\nhttp://www.adcli...,1
2,2,1) Fight The Risk of Cancer!\nhttp://www.adcli...,1
3,3,##############################################...,1
4,4,I thought you might like these:\n1) Slim Down ...,1


In [109]:
df = df.sample(frac=1).reset_index(drop=True)
df.head()

Unnamed: 0.1,Unnamed: 0,Body,Label
0,1275,\nUseful for your Individual and Business inve...,1
1,1860,WE NEED HELP. We are a 14 year old fortune 50...,1
2,4981,\nOh yeh one more thing. None of this would e...,0
3,2617,">\n> Sorry, Shrub, your political newspeak is ...",0
4,4459,"Hi,I wasn't sure if that ever started up.\nwha...",0


In [9]:
print("Spam:", df['Label'].value_counts()[1])
print("Not spam:", df['Label'].value_counts()[0])

Spam: 1561
Not spam: 3952


In [110]:
temp = df['Body'].to_list()
emails = np.array(temp)
emails[0]

"\nUseful for your Individual and Business investigation needs:\nGET  THE TRUTH ABOUT ANYONE AND  ANYTHINGÂ… \nover the internet\n Please\n  click on this link for more  information.\nFinally find,\n    track and learn  anything about\n    anyone just like a professional  Private Eye, but without a license!  Find  \npeople\n    who have moved or changed their  name. Identify addresses, phone  numbers,\n P.O. boxes. \n Locate  \nrelatives,\n    old friends or your deadbeat spouse  you haven't seen in more than a  \ndecade.\nExercise\n    your rights check public records and  obtain FBI,\n    CIA and other government  documents.  Perform  \ndo-it-yourself\n    background checks as often as you  like, through the, Freedom of  \nInformation\n    Act. \nCheck  the \nlicenses,\n    qualifications and disciplinary\n    records of Doctors, Lawyers,  Accountants, Contractors. \nDetect  \nthe identity\n    of birthparents or\n    children given up for adoption as well  as their current whereabou

Defining methods to clean emails of stopwords, punctuation, etc. and tokenizing them.

In [11]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

In [12]:
def clean(text):
    text = text.encode('utf-8', 'ignore').decode('utf-8', 'ignore')

    # Using RegEx to remove URLs, punctuation, newline characters
    out = re.sub(r'(https?://\S+|www\.\S+)|[^a-zA-Z\s]', ' ', text)
    out = out.lower()
    out = " ".join(out.split())
    
    # Tokenize and remove stop words
    word_tokens = word_tokenize(out)
    output = [w for w in word_tokens if not w.lower() in stop_words]
    return output

def clean_list(s):
    out = []
    for item in s:
        out.append(clean(item))
    return out

In [106]:
cleaned_emails = clean_list(emails)
cleaned_emails[30][:10]

['wed',
 'aug',
 'oates',
 'isaac',
 'wrote',
 'new',
 'razor',
 'studied',
 'trust',
 'systems']

Creating a vocabulary of all words and their frequencies using Counter()

In [14]:
def count_words(email_list):
    count = Counter()
    for email in email_list:
        for word in email:
            count[word.lower()] += 1
    return count

word_dict = count_words(cleaned_emails)

In [None]:
# Clearing variables for memory
del stop_words
del stopwords

Loading pretrained word2vec embeddings

In [17]:
from gensim.models import KeyedVectors
word_to_index = {"<UNK>": 0, **{word: idx + 1 for idx, word in enumerate(word_dict.keys())}}

word2vec_path = 'GoogleNews-vectors-negative300.bin'
word2vec = KeyedVectors.load_word2vec_format(word2vec_path, binary=True)

In [18]:
# Initialize the embedding matrix with Word2Vec vectors or random vectors
embedding_dim = 300 

vocab_size = len(word_to_index)

In [19]:
vocab_size

60691

In [20]:
# Initialize embedding matrix with random values
embedding_matrix = np.random.normal(0, 1, (vocab_size, embedding_dim))

# Fill embedding matrix with Word2Vec embeddings
for word, idx in word_to_index.items():
    if word in word2vec:
        embedding_matrix[idx] = word2vec[word]
    else:
        embedding_matrix[idx] = np.random.normal(0, 1, embedding_dim)  # For unknown words


In [21]:
# Convert embedding matrix to PyTorch tensor
embedding_matrix = torch.tensor(embedding_matrix, dtype=torch.float32)

In [104]:
embedding_matrix[0][0:100]

tensor([ 2.3768e-01,  3.0560e-01,  8.3254e-01,  8.4590e-01,  7.6936e-01,
         5.6505e-03,  5.1453e-01,  2.5337e-02, -7.4711e-01, -8.9757e-01,
         9.2696e-01, -6.0390e-01,  4.3864e-01, -1.2269e+00,  1.3824e+00,
        -2.4003e+00, -3.3544e-02, -1.5005e+00, -5.9447e-01, -3.2046e+00,
        -2.0103e+00, -2.5262e-01,  1.4145e+00,  3.1132e-02,  4.6473e-01,
         1.2684e+00, -1.7197e-01, -1.5119e-01, -8.5677e-01,  3.6958e-02,
         1.3570e+00,  7.5293e-01, -1.4560e-01,  7.2415e-01,  4.7461e-03,
        -3.0534e-01,  8.4862e-01, -2.5290e+00,  7.5127e-02, -8.4160e-01,
        -8.6502e-01,  2.0582e-01,  8.6503e-01,  7.5544e-01, -1.2170e+00,
         3.6528e-01,  8.0880e-02,  2.2561e-01, -6.1852e-01, -2.6137e+00,
         2.4308e+00, -1.7626e-01, -8.7198e-01, -6.8907e-01, -2.0709e+00,
        -6.6116e-01,  4.8864e-01, -9.3343e-01,  1.4454e+00, -5.3716e-01,
         1.4327e+00, -1.3743e-01,  4.3624e-01,  4.3783e-01, -2.1283e-01,
         7.1833e-01, -1.1722e+00,  9.1544e-01, -1.0

In [23]:
# Accounting for unknown words
from collections import defaultdict
from torch.nn.utils.rnn import pad_sequence

vocab = defaultdict(lambda: len(vocab))
UNK = vocab["<UNK>"]

for text in cleaned_emails:
    for word in text:
        _ = vocab[word]

Now the text needs to be converting into tensors so they can be fed into the model. To do this I define a method called text_to_tensor

In [24]:
def text_to_tensor(text):
    indices = [vocab.get(word, UNK) for word in text]  # Convert words to indices
    return torch.tensor(indices, dtype=torch.long)

In [105]:
temp_tensor = []
for text in cleaned_emails:
    temp_tensor.append(text_to_tensor(text))
temp_tensor[0]

tensor([  1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,  13,  14,
         15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,   6,  26,  27,
         28,  29,  30,   6,  31,  32,  33,  34,  35,  36,  37,  38,  39,  40,
         41,  11,  42,  43,  44,  45,  46,  47,  48,  49,  50,  19,  51,  52,
         42,  53,  54,  49,  15,  55,  56,  57,  58,  59,  60,  61,  62,  63,
         64,  65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75,   3,   4,
         76,  77,  65,  66,  78,  79,  80,  81,  82,  83,  75,  84,  85,  86,
         87,  88,  89,  72,  73,   5,  90,  91,  92,  93,  94,  67,  76,  95,
         91,  96,  97,  98,  99,   6, 100, 101, 102, 103, 104, 105, 106, 107,
        108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121,
        118, 122, 123, 124, 125, 126,  38, 127, 128, 129, 115, 130, 131, 132,
        133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146,
          4, 147, 148,   4, 149, 150, 134, 151, 152, 153, 154, 1

In [None]:
# Padding the train and test sets so all the inputs are equal length 
emails_pad = [(seq[:500]) for seq in temp_tensor]
emails_pad = pad_sequence(emails_pad, batch_first = True)

In [27]:
# Clearing memory
del cleaned_emails
del data
del temp
del temp_tensor
del word_to_index
del vocab
del text

In [28]:
len(emails_pad)

5513

In [29]:
labels = np.array(df['Label'])
labels

array([0, 0, 1, ..., 1, 0, 0], dtype=int64)

In [30]:
len(labels)

5513

Now that the data is in a readable format, it can be split into train and test sets

In [31]:
X_train = torch.tensor(emails_pad[:5250])
X_test = torch.tensor(emails_pad[5250:])
Y_train = torch.tensor(labels[:5250])
Y_test = torch.tensor(labels[5250:])

  X_train = torch.tensor(emails_pad[:5250])
  X_test = torch.tensor(emails_pad[5250:])


In [32]:
print(X_train.shape)
print(Y_train.shape)

torch.Size([5250, 500])
torch.Size([5250])


In [33]:
Y_train[100:200]

tensor([0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1,
        0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1,
        0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1,
        1, 0, 0, 1])

In [34]:
labels[100:200]

array([0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1,
       1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0,
       0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1], dtype=int64)

In [35]:
df['Label'].shape

(5513,)

In [36]:
# Pytorch Dataloaders
import torch.utils.data as data_utils
from torch.utils.data import DataLoader
train = data_utils.TensorDataset(X_train, Y_train)
train_loader = DataLoader(train, batch_size=16, shuffle=True)
test = data_utils.TensorDataset(X_test, Y_test)
test_loader = DataLoader(test, batch_size=16, shuffle=True, drop_last=True)

In [37]:
# Listing the top 5 most common words
word_dict.most_common(5)

[('list', 4431), ('one', 3907), ('e', 3779), ('get', 3697), ('email', 3585)]

In [38]:
NUM_EPOCHS = 10
LEARNING_RATE = .01

Defining the model using PyTorch, then training it for 10 epochs

In [39]:
# Model Definition
class SpamDetector(nn.Module):
    def __init__(self):
        super(SpamDetector, self).__init__()
        self.emb = nn.Embedding.from_pretrained(embedding_matrix, freeze=False)
        self.LSTM = nn.LSTM(300, 200, batch_first=True)
        self.LSTM2 = nn.LSTM(200, 300, batch_first=True)
        self.fc1 = nn.Linear(300, 1000)
        self.dropout = nn.Dropout(0.5)
        self.fc2 = nn.Linear(1000, 1)
        
    def forward(self, x):
        x = self.emb(x)
        x, _ = self.LSTM(x)
        x, _ = self.LSTM2(x)
        x = x[:, -1, :]  # Taking the last hidden state
        x = self.fc1(x)
        x = self.dropout(x)
        x = self.fc2(x) 
        return x
    # I don't return it with sigmoid because BCEWithLogitsLoss expects raw logits

In [None]:
# Clearing memory
del df
del test
del train

Training Loop

In [42]:
# Instantiate Model, Loss Function, and Optimizer
model = SpamDetector().to(device)
criterion = nn.BCEWithLogitsLoss(pos_weight=torch.FloatTensor([4150/1896]).to(device)) 
optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)

model.train()

for epoch in range(NUM_EPOCHS):

    accuracy = 0
    avg_loss = 0
    correct = 0
    total = 0

    for i, (inputs, labels) in enumerate(train_loader):
        inputs, labels = inputs.to(device), labels.to(device)
        labels = labels.float()
        # Forward Pass
        outputs = model(inputs)
        loss = criterion(outputs[:,0], labels)
        
        # Backward and Optimize
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        # Get predictions and compute accuracy
        outputs = torch.tensor((outputs[:, 0] >= 0.5).float())
        total += labels.size(0)  # Total number of labels
        correct += (outputs == labels).sum().item()  # Count correct predictions
        
        avg_loss += loss.item()
        
    # Calculate and print the average loss and accuracy for the epoch
    avg_loss /= len(train_loader)
    accuracy = 100 * correct / total

    print(f'Epoch [{epoch+1}/{NUM_EPOCHS}], Loss: {avg_loss:.4f}, Accuracy: {accuracy:.2f}%')

print('Finished Training')

  outputs = torch.tensor((outputs[:, 0] >= 0.5).float())


Epoch [1/10], Loss: 1.0381, Accuracy: 69.49%
Epoch [2/10], Loss: 0.9114, Accuracy: 72.82%
Epoch [3/10], Loss: 1.0137, Accuracy: 71.26%
Epoch [4/10], Loss: 0.2942, Accuracy: 94.99%
Epoch [5/10], Loss: 0.1139, Accuracy: 98.21%
Epoch [6/10], Loss: 0.0863, Accuracy: 98.74%
Epoch [7/10], Loss: 0.3395, Accuracy: 97.31%
Epoch [8/10], Loss: 1.1318, Accuracy: 95.12%
Epoch [9/10], Loss: 0.5285, Accuracy: 97.14%
Epoch [10/10], Loss: 0.2696, Accuracy: 98.55%
Finished Training


Running model on test set. I display the predictions to prove the results. Note: You will have to scroll a little bit.

In [46]:
# Evaluation on Test Data
model.eval()
correct = 0
total = 0

with torch.no_grad():
    for inputs, labels in test_loader:
        inputs, labels = inputs.to(device), labels.to(device)  # Move to device
        
        outputs = model(inputs)
        labels = labels.float()
        # Move tensors to CPU and convert to numpy arrays
        outputs = outputs.cpu().numpy()
        labels = labels.cpu().numpy()
        
        predictions = [1.0 if value >= 0.5 else 0.0 for value in outputs]

        total += labels.shape[0]
        correct += (predictions == labels).sum().item()
        
        # Iterate over each sample in the batch
        for i in range(len(predictions)):
            print(f'Predicted: {predictions[i]} | Actual: {labels[i]}')

print(f'Accuracy on test set: {100 * correct / total:.2f}%')

Predicted: 0.0 | Actual: 0.0
Predicted: 0.0 | Actual: 0.0
Predicted: 1.0 | Actual: 1.0
Predicted: 0.0 | Actual: 0.0
Predicted: 0.0 | Actual: 0.0
Predicted: 0.0 | Actual: 0.0
Predicted: 1.0 | Actual: 1.0
Predicted: 1.0 | Actual: 1.0
Predicted: 1.0 | Actual: 1.0
Predicted: 1.0 | Actual: 1.0
Predicted: 0.0 | Actual: 1.0
Predicted: 0.0 | Actual: 0.0
Predicted: 0.0 | Actual: 0.0
Predicted: 0.0 | Actual: 0.0
Predicted: 1.0 | Actual: 1.0
Predicted: 0.0 | Actual: 0.0
Predicted: 0.0 | Actual: 0.0
Predicted: 0.0 | Actual: 0.0
Predicted: 0.0 | Actual: 0.0
Predicted: 0.0 | Actual: 0.0
Predicted: 0.0 | Actual: 0.0
Predicted: 0.0 | Actual: 0.0
Predicted: 0.0 | Actual: 0.0
Predicted: 0.0 | Actual: 0.0
Predicted: 0.0 | Actual: 0.0
Predicted: 0.0 | Actual: 0.0
Predicted: 0.0 | Actual: 0.0
Predicted: 0.0 | Actual: 0.0
Predicted: 1.0 | Actual: 1.0
Predicted: 0.0 | Actual: 0.0
Predicted: 0.0 | Actual: 0.0
Predicted: 1.0 | Actual: 1.0
Predicted: 1.0 | Actual: 1.0
Predicted: 0.0 | Actual: 0.0
Predicted: 1.0

Displaying two individual results

In [None]:
emails_pad = emails_pad.to(device)
with torch.no_grad():
    out = model(emails_pad[12].unsqueeze(0))
    predicted_class = torch.round(torch.sigmoid(out)) 
    print(emails[12]) # Printing first spam email without fishy links for safety
    print("Predicted class:", predicted_class.item())
    print("Actual class: ", labels[12])

Do You Want To Make $1000 Or More Per Week? If you are a motivated and qualified individual - I 
will personally demonstrate to you a system that will 
make you $1,000 per week or more! This is NOT mlm. Call our 24 hour pre-recorded number to get the 
details.   801-296-4210 I need people who want to make serious money.  Make 
the call and get the facts. Invest 2 minutes in yourself now! 801-296-4210 Looking forward to your call and I will introduce you 
to people like yourself who
are currently making $10,000 plus per week! 801-296-42103484lJGv6-241lEaN9080lRmS6-271WxHo7524qiyT5-438rjUv5615hQcf0-662eiDB9057dMtVl72

Predicted class: 1.0
Actual class:  1


In [None]:
with torch.no_grad():
    out = model(emails_pad[2].unsqueeze(0))
    predicted_class = torch.round(torch.sigmoid(out))  
    print(emails[2]) # First non-spam email 
    print("Predicted class:", predicted_class.item())
    print("Actual class: ", labels[2])

I will be out of the office starting  02/08/2002 and will not return until
06/08/2002.I am out of the office until Tuesday 6th August.   I will reply to messages
on my return.Thank you.
DermotImportant Email InformationThe information in this email is confidential and may be legally
privileged. It is intended solely for the addressee. Access to this email
by anyone else is unauthorized. If you are not the intended recipient, any
disclosure, copying, distribution or any action taken or omitted to be
taken in reliance on it, is prohibited and may be unlawful. If you are not
the intended addressee please contact the sender and dispose of this
e-mail.-- 
Irish Linux Users' Group: ilug@linux.ie
http://www.linux.ie/mailman/listinfo/ilug for (un)subscription information.
List maintainer: listmaster@linux.ie

Predicted class: 0.0
Actual class:  0


Email: rohan11parekh@gmail.com 

LinkedIn: https://www.linkedin.com/in/rohan-parekh-39b070225/