### Spam Detector Demonstration (Sentiment Classification)
This model performs spam detection on emails, classifying them as spam or not spam.
Contact: rohan11parekh@gmail.com

Imports

In [1]:
import numpy as np
import tensorflow as tf
import nltk
import re
import pandas as pd
from collections import Counter
from nltk.tokenize import word_tokenize
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset
from torchvision import datasets
from torchvision.transforms import ToTensor
import torch.nn.functional as F

In [None]:
# Setting Pytorch device
device = torch.device("cpu")
device

device(type='cpu')

Reading the data from csv using Pandas

In [50]:
data = pd.read_csv('datasets/Emails.csv')

In [51]:
data

Unnamed: 0.1,Unnamed: 0,Body,Label
0,0,\nSave up to 70% on Life Insurance.\nWhy Spend...,1
1,1,1) Fight The Risk of Cancer!\nhttp://www.adcli...,1
2,2,1) Fight The Risk of Cancer!\nhttp://www.adcli...,1
3,3,##############################################...,1
4,4,I thought you might like these:\n1) Slim Down ...,1
...,...,...,...
6041,6041,empty,0
6042,6042,___ ___ ...,0
6043,6043,IN THIS ISSUE:01. Readers write\n02. Extension...,0
6044,6044,empty,0


Visualizing the ratio of spam to non-spam emails

In [52]:
print("Spam:", data['Label'].value_counts()[1] )
print("Not spam:", data['Label'].value_counts()[0])

Spam: 1896
Not spam: 4150


Note: 1 = spam, 0 = legit

Now to clean the dataset of unused columns, null values, etc.

In [53]:
value_to_remove = 'empty'
df = data[~data.apply(lambda row: value_to_remove in row.values, axis=1)]
df

Unnamed: 0.1,Unnamed: 0,Body,Label
0,0,\nSave up to 70% on Life Insurance.\nWhy Spend...,1
1,1,1) Fight The Risk of Cancer!\nhttp://www.adcli...,1
2,2,1) Fight The Risk of Cancer!\nhttp://www.adcli...,1
3,3,##############################################...,1
4,4,I thought you might like these:\n1) Slim Down ...,1
...,...,...,...
6033,6033,----------------------------------------------...,0
6034,6034,"EFFector Vol. 15, No. 35 November ...",0
6039,6039,\nWe have extended our Free seat sale until Th...,0
6042,6042,___ ___ ...,0


In [7]:
df = df.sample(frac=1).reset_index(drop=True)
df

Unnamed: 0.1,Unnamed: 0,Body,Label
0,5881,\nShopper Newsletter: Alerts1\nCanon PowerShot...,0
1,379,\nNever Pay Retail!\nUnleash \n ...,1
2,1617,When America's top companies compete for your ...,1
3,3927,"URL: http://www.newsisfree.com/click/-4,828978...",0
4,1875,"Friend, We have recently been introduced to an...",1
...,...,...,...
5508,4387,Carlos Luna wrote:>Hi all.\n>Does anyone know ...,0
5509,1652,Hi !My name is Wayne Harrison and I would like...,1
5510,384,\nMarketingonTarget.com has teamed up \n ...,1
5511,4876,Hi i have a phillips head skrew thats holding ...,0


In [54]:
print("Spam:", df['Label'].value_counts()[1])
print("Not spam:", df['Label'].value_counts()[0])

Spam: 1561
Not spam: 3952


In [55]:
temp = df['Body'].to_list()
emails = np.array(temp)
emails

array(["\nSave up to 70% on Life Insurance.\nWhy Spend More Than You Have To?Life Quote Savings\nEnsuring your \n      family's financial security is very important. Life Quote Savings makes \n      buying life insurance simple and affordable. We Provide FREE Access to The \n      Very Best Companies and The Lowest Rates.Life Quote Savings is FAST, EASY and \n            SAVES you money! Let us help you get started with the best values in \n            the country on new coverage. You can SAVE hundreds or even thousands \n            of dollars by requesting a FREE quote from Lifequote Savings. Our \n            service will take you less than 5 minutes to complete. Shop and \n            compare. SAVE up to 70% on all types of Life insurance! Click Here For Your \n            Free Quote!Protecting your family is the best investment you'll ever \n          make!\nIf you are in receipt of this email \n      in error and/or wish to be removed from our list, PLEASE CLICK HERE AND TYPE REM

Defining methods to clean emails of stopwords, punctuation, etc. and tokenizing them.

In [56]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

In [57]:
def clean(text):
    text = text.encode('utf-8', 'ignore').decode('utf-8', 'ignore')

    # Using RegEx to remove URLs, punctuation, newline characters
    out = re.sub(r'(https?://\S+|www\.\S+)|[^a-zA-Z\s]', ' ', text)
    out = out.lower()
    out = " ".join(out.split())
    
    # Tokenize and remove stop words
    word_tokens = word_tokenize(out)
    output = [w for w in word_tokens if not w.lower() in stop_words]
    return output

def clean_list(s):
    out = []
    for item in s:
        out.append(clean(item))
    return out

In [58]:
cleaned_emails = clean_list(emails)
cleaned_emails[30]

['seen',
 'nbc',
 'cbs',
 'cnn',
 'even',
 'oprah',
 'health',
 'discovery',
 'actually',
 'reverses',
 'aging',
 'burning',
 'fat',
 'without',
 'dieting',
 'exercise',
 'proven',
 'discovery',
 'even',
 'reported',
 'new',
 'england',
 'journal',
 'medicine',
 'forget',
 'aging',
 'dieting',
 'forever',
 'guaranteed',
 'click',
 'would',
 'like',
 'lose',
 'weight',
 'sleep',
 'dieting',
 'hunger',
 'pains',
 'cravings',
 'strenuous',
 'exercise',
 'change',
 'life',
 'forever',
 'guaranteed',
 'body',
 'fat',
 'loss',
 'improvement',
 'wrinkle',
 'reduction',
 'improvement',
 'energy',
 'level',
 'improvement',
 'muscle',
 'strength',
 'improvement',
 'sexual',
 'potency',
 'improvement',
 'emotional',
 'stability',
 'improvement',
 'memory',
 'improvement',
 'receiving',
 'email',
 'subscriber',
 'opt',
 'america',
 'mailing',
 'list',
 'unsubscribe',
 'future',
 'offers',
 'click',
 'mailto',
 'affiliateoptout',
 'btamail',
 'net',
 'cn',
 'subject']

Creating a vocabulary of all words and their frequencies using Counter()

In [60]:
def count_words(email_list):
    count = Counter()
    for email in email_list:
        for word in email:
            count[word.lower()] += 1
    return count

word_dict = count_words(cleaned_emails)

In [None]:
# Clearing variables for memory
del emails
del stop_words
del stopwords

Loading pretrained word2vec embeddings

In [61]:
from nltk.tokenize import word_tokenize
from gensim.models import KeyedVectors
word_to_index = {"<UNK>": 0, **{word: idx + 1 for idx, word in enumerate(word_dict.keys())}}

word2vec_path = 'GoogleNews-vectors-negative300.bin'
word2vec = KeyedVectors.load_word2vec_format(word2vec_path, binary=True)

In [None]:
# Initialize the embedding matrix with Word2Vec vectors or random vectors
embedding_dim = 300 

vocab_size = len(word_to_index)

In [63]:
vocab_size

60691

In [None]:
# Initialize embedding matrix with random values
embedding_matrix = np.random.normal(0, 1, (vocab_size, embedding_dim))

# Fill embedding matrix with Word2Vec embeddings
for word, idx in word_to_index.items():
    if word in word2vec:
        embedding_matrix[idx] = word2vec[word]
    else:
        embedding_matrix[idx] = np.random.normal(0, 1, embedding_dim)  # For unknown words


In [65]:
# Convert embedding matrix to PyTorch tensor
embedding_matrix = torch.tensor(embedding_matrix, dtype=torch.float32)

In [66]:
embedding_matrix[0]

tensor([ 8.9507e-01, -5.6883e-01,  1.8751e-02, -8.0835e-01,  8.8713e-01,
        -8.0156e-01,  4.5586e-01,  3.5591e-02, -1.7381e+00,  1.0170e+00,
        -1.0653e+00,  1.5809e+00,  3.7834e-01, -1.2261e+00,  6.7350e-01,
         3.5411e-02,  1.9501e+00, -7.7990e-01,  1.2351e-01,  3.4095e-02,
        -4.1785e-01, -4.2822e-01, -1.1714e+00, -9.4657e-01, -1.0197e+00,
         1.2860e+00,  6.9539e-01, -4.9643e-01, -3.5695e-01,  9.9359e-01,
         1.4023e+00,  6.1561e-01, -5.0612e-01,  6.1453e-01, -4.4246e-02,
        -5.2959e-01,  5.4660e-01, -1.7878e+00,  8.2527e-01, -3.9683e-01,
         1.8633e-01, -2.3181e-01,  7.4346e-01, -4.1271e-01,  4.6692e-01,
        -3.3615e-01,  3.3050e-01,  2.2488e-01,  5.7755e-01,  1.1743e-01,
        -9.0490e-01, -1.3273e+00,  3.8992e-01, -7.1687e-01,  3.4020e-02,
        -1.0477e+00,  1.2991e+00,  1.3571e+00,  2.3975e-01, -1.7906e+00,
        -4.6713e-01, -7.5801e-02,  3.4914e-01,  4.9442e-01, -5.3434e-01,
        -1.0223e+00, -5.4836e-01,  2.2957e-01,  4.8

In [None]:
# Accounting for unknown words
from collections import defaultdict
from torch.nn.utils.rnn import pad_sequence

vocab = defaultdict(lambda: len(vocab))
UNK = vocab["<UNK>"]

for text in cleaned_emails:
    for word in text:
        _ = vocab[word]

Now the text needs to be converting into tensors so they can be fed into the model. To do this I define text_to_tensor

In [None]:
def text_to_tensor(text):
    indices = [vocab.get(word, UNK) for word in text]  # Convert words to indices
    return torch.tensor(indices, dtype=torch.long)

In [None]:
temp_tensor = []
for text in cleaned_emails:
    temp_tensor.append(text_to_tensor(text))
temp_tensor

[tensor([ 1,  2,  3,  4,  2,  5,  6,  7,  8,  9, 10, 11,  2,  5,  6, 12, 13,  2,
          3, 14, 15, 16, 17, 18, 19, 20, 21, 22,  2,  5,  6, 23, 24, 25, 26, 27,
         28, 29, 30, 31, 19, 32, 33, 34, 35,  1, 36, 37, 38, 39, 40, 17,  5, 41,
          6, 42, 43, 44, 45, 46, 47, 48,  1, 49,  2,  3, 50, 17,  5, 51,  8, 19,
         52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 50, 62, 63, 64, 65, 66, 67, 68,
         69,  3, 61, 70, 56]),
 tensor([ 71,  72,  73,  74,  75,  76,  77,  78,  30,  79,  80,  81,  17,  82,
          83,  84,  85,  86,  87,  88,  89,  90,  91,  92,  93,  94,  95,  96,
          97,  98,  99,  58, 100,  60,  61, 101, 102, 103, 104, 105, 106, 107,
         104, 108, 109, 110, 111,  60, 112, 113, 104, 108]),
 tensor([ 71,  72,  73,  74,  75,  76,  77,  78,  30,  79,  80,  81,  17,  82,
          83,  84,  85,  86,  87,  88,  89,  90,  91,  92,  93,  94,  95,  96,
          97,  98,  99,  58, 100,  60,  61, 101, 102]),
 tensor([114, 115, 116,  17, 117, 118,  18, 119, 120

In [None]:
# Padding the train and test sets so all the inputs are equal length 

emails_pad = [(seq[:500]) for seq in temp_tensor]
emails_pad = pad_sequence(emails_pad, batch_first = True)

In [None]:
# Clearing memory
del cleaned_emails
del data
del temp
del temp_tensor
del word_to_index
del vocab
del text

In [26]:
len(emails_pad)

5513

In [71]:
labels = np.array(df['Label'])
labels

array([1, 1, 1, ..., 0, 0, 0], dtype=int64)

In [28]:
len(labels)

5513

Now that the data is in a readable format, it can be split into train and test sets

In [29]:
X_train = torch.tensor(emails_pad[:5250])
X_test = torch.tensor(emails_pad[5250:])
Y_train = torch.tensor(labels[:5250])
Y_test = torch.tensor(labels[5250:])

  X_train = torch.tensor(emails_pad[:5250])
  X_test = torch.tensor(emails_pad[5250:])


In [30]:
print(X_train.shape)
print(Y_train.shape)

torch.Size([5250, 500])
torch.Size([5250])


In [32]:
Y_train[100:200]

tensor([1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,
        0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0,
        0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
        0, 0, 1, 0])

In [33]:
labels[100:200]

array([1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0,
       1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0], dtype=int64)

In [34]:
df['Label'].shape

(5513,)

In [None]:
# Pytorch Dataloaders
import torch.utils.data as data_utils
from torch.utils.data import DataLoader
train = data_utils.TensorDataset(X_train, Y_train)
train_loader = DataLoader(train, batch_size=16, shuffle=True)
test = data_utils.TensorDataset(X_test, Y_test)
test_loader = DataLoader(test, batch_size=16, shuffle=True, drop_last=True)

In [36]:
# Listing the top 5 most common words
word_dict.most_common(5)

[('list', 4431), ('one', 3907), ('e', 3779), ('get', 3697), ('email', 3585)]

In [37]:
NUM_EPOCHS = 10
LEARNING_RATE = .01

Defining the model using PyTorch, then training it for 10 epochs

In [None]:
# Model Definition
class SpamDetector(nn.Module):
    def __init__(self):
        super(SpamDetector, self).__init__()
        self.emb = nn.Embedding.from_pretrained(embedding_matrix, freeze=False)
        self.LSTM = nn.LSTM(300, 200, batch_first=True)
        self.LSTM2 = nn.LSTM(200, 300, batch_first=True)
        self.fc1 = nn.Linear(300, 1000)
        self.dropout = nn.Dropout(0.5)
        self.fc2 = nn.Linear(1000, 1)
        
    def forward(self, x):
        x = self.emb(x)
        x, _ = self.LSTM(x)
        x, _ = self.LSTM2(x)
        x = x[:, -1, :]  # Taking the last hidden state
        x = self.fc1(x)
        x = self.dropout(x)
        x = self.fc2(x) 
        return x
    # I don't return it with sigmoid because BCEWithLogitsLoss expects raw logits

In [None]:
# Clearing memory
del df
del emails_pad
del test
del train

Training Loop

In [None]:
# Instantiate Model, Loss Function, and Optimizer
model = SpamDetector().to(device)
criterion = nn.BCEWithLogitsLoss(pos_weight=torch.FloatTensor([4150/1896])) 
optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)

model.train()

for epoch in range(NUM_EPOCHS):

    accuracy = 0
    avg_loss = 0
    correct = 0
    total = 0

    for i, (inputs, labels) in enumerate(train_loader):
        inputs, labels = inputs.to(device), labels.to(device)
        labels = labels.float()
        # Forward Pass
        outputs = model(inputs)
        loss = criterion(outputs[:,0], labels)
        
        # Backward and Optimize
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        # Get predictions and compute accuracy
        outputs = torch.tensor((outputs[:, 0] >= 0.5).float())
        total += labels.size(0)  # Total number of labels
        correct += (outputs == labels).sum().item()  # Count correct predictions
        
        avg_loss += loss.item()
        
    # Calculate and print the average loss and accuracy for the epoch
    avg_loss /= len(train_loader)
    accuracy = 100 * correct / total

    print(f'Epoch [{epoch+1}/{NUM_EPOCHS}], Loss: {avg_loss:.4f}, Accuracy: {accuracy:.2f}%')

print('Finished Training')

  outputs = torch.tensor((outputs[:, 0] >= 0.5).float())


Epoch [1/10], Loss: 0.8750, Accuracy: 76.04%
Epoch [2/10], Loss: 0.3861, Accuracy: 92.76%
Epoch [3/10], Loss: 0.1933, Accuracy: 96.57%
Epoch [4/10], Loss: 0.1612, Accuracy: 97.24%
Epoch [5/10], Loss: 0.1419, Accuracy: 97.14%
Epoch [6/10], Loss: 0.3585, Accuracy: 94.91%
Epoch [7/10], Loss: 0.2041, Accuracy: 96.51%
Epoch [8/10], Loss: 0.2522, Accuracy: 97.24%
Epoch [9/10], Loss: 0.1090, Accuracy: 98.23%
Epoch [10/10], Loss: 0.1626, Accuracy: 97.64%
Finished Training


Running model on test set

In [None]:
# Evaluation on Test Data
model.eval()
correct = 0
total = 0

with torch.no_grad():
    for inputs, labels in test_loader:
        inputs, labels = inputs.to(device), labels.to(device)  # Move to device
        inputs = torch.clamp(inputs, max=vocab_size - 1)
        outputs = model(inputs)
        labels = labels.float()
        # Move tensors to CPU and convert to numpy arrays
        outputs = outputs.cpu().numpy()
        labels = labels.cpu().numpy()
        
        predictions = [1.0 if value >= 0.5 else 0.0 for value in outputs]

        total += labels.shape[0]
        correct += (predictions == labels).sum().item()
        
        # Iterate over each sample in the batch
        for i in range(len(predictions)):
            print(f'Predicted: {predictions[i]} | Actual: {labels[i]}')

print(f'Accuracy on test set: {100 * correct / total:.2f}%')

Predicted: 0.0 | Actual: 0.0
Predicted: 0.0 | Actual: 0.0
Predicted: 1.0 | Actual: 1.0
Predicted: 0.0 | Actual: 0.0
Predicted: 0.0 | Actual: 0.0
Predicted: 0.0 | Actual: 0.0
Predicted: 0.0 | Actual: 0.0
Predicted: 1.0 | Actual: 1.0
Predicted: 0.0 | Actual: 0.0
Predicted: 0.0 | Actual: 0.0
Predicted: 0.0 | Actual: 0.0
Predicted: 0.0 | Actual: 0.0
Predicted: 0.0 | Actual: 0.0
Predicted: 0.0 | Actual: 0.0
Predicted: 0.0 | Actual: 0.0
Predicted: 0.0 | Actual: 0.0
Predicted: 0.0 | Actual: 0.0
Predicted: 1.0 | Actual: 0.0
Predicted: 0.0 | Actual: 0.0
Predicted: 0.0 | Actual: 1.0
Predicted: 0.0 | Actual: 0.0
Predicted: 0.0 | Actual: 0.0
Predicted: 0.0 | Actual: 0.0
Predicted: 1.0 | Actual: 1.0
Predicted: 1.0 | Actual: 1.0
Predicted: 0.0 | Actual: 1.0
Predicted: 0.0 | Actual: 0.0
Predicted: 0.0 | Actual: 0.0
Predicted: 1.0 | Actual: 1.0
Predicted: 1.0 | Actual: 1.0
Predicted: 1.0 | Actual: 1.0
Predicted: 1.0 | Actual: 0.0
Predicted: 0.0 | Actual: 0.0
Predicted: 0.0 | Actual: 0.0
Predicted: 0.0

In [97]:
index = 8
with torch.no_grad():
    out = model(emails_pad[index].unsqueeze(0))
    predicted_class = torch.round(torch.sigmoid(out))  # or use .argmax() for multi-class
    print(emails[index])
    print("Predicted class:", predicted_class.item())
    print("Actual class: ", labels[index])

TIRED OF THE BULL OUT THERE?
Want To Stop Losing Money?WANT A REAL MONEY MAKER?
RECEIVE $1,000-$5,000 TODAY!
EXPERTS ARE CALLING THIS THE FASTEST WAY TO HUGE CASH FLOW EVER CONCEIVED!A POWERHOUSE Gifting Program You Don't Want To Miss!
We work as a TEAM! This is YOUR Private Invitation GET IN WITH THE FOUNDERS! This is where the BIG BOYS PLAY! The MAJOR PLAYERS are on This ONE For ONCE be where the Players areThis is a system that will drive $1,000's to your doorstep 
In a short period of time!Leverage $1000.00 into $50,000, Over and Over Again THE QUESTION HERE IS:YOU EITHER WANT TO BE WEALTHY OR YOU DON'T!!!WHICH ONE ARE YOU?I am tossing you a financial lifeline and for your sake I Hope you GRAB onto it and hold on tight For the Ride of your life!TestimonialsHear what average people are doing their first few days:
ï¿½We've received 8,000 in 1 day and we are doing that over and over again!' Q.S. in AL
 ï¿½I'm a single mother in FL and I've received 12,000 in the last 4 days.ï¿½ D. S. 

In [121]:
index = 5263
with torch.no_grad():
    out = model(emails_pad[index].unsqueeze(0))
    predicted_class = torch.round(torch.sigmoid(out))  # or use .argmax() for multi-class
    print(emails[index])
    print("Predicted class:", predicted_class.item())
    print("Actual class: ", labels[index])

FreeBSD-SA-02:36.nfs                                        Security Advisory
                                                          The FreeBSD ProjectTopic:          Bug in NFS server code allows remote denial of serviceCategory:       core
Module:         nfs
Announced:      2002-08-05
Credits:        Mike Junk 
Affects:        All releases prior to 4.6.1-RELEASE-p7
                4.6-STABLE prior to the correction date
Corrected:      2002-07-19 17:19:53 UTC (RELENG_4)
                2002-08-01 19:31:55 UTC (RELENG_4_6)
                2002-08-01 19:31:54 UTC (RELENG_4_5)
                2002-08-01 19:31:54 UTC (RELENG_4_4)
FreeBSD only:   NOI.   BackgroundThe Network File System (NFS) allows a host to export some or all of
its filesystems, or parts of them, so that other hosts can access them
over the network and mount them as if they were on local disks.  NFS is
built on top of the Sun Remote Procedure Call (RPC) framework.II.  Problem DescriptionA part of the NFS server cod

Email: rohan11parekh@gmail.com 

LinkedIn: https://www.linkedin.com/in/rohan-parekh-39b070225/