<a href="https://colab.research.google.com/github/mmonm17/SpamDetection/blob/main/SpamDetection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Instructions

The code in this notebook relies on a csv stored in a shared Google Drive. In order to access this, the notebook has to be opened in Google Colaboratory (not Jupyter Notebook). Alternatively, there is an option to download the csv through this link:  https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset. Afterwhich, the path may be changed to a local one at the user's discretion. After opening it in Colaboratory, each code in this notebook must be run step by step in order to run it properly. Errors may sometimes occur if one part of the code was not run.



# Opening the Dataset


In [None]:
# Mount Google Drive
from google.colab import drive, data_table

drive.mount('/content/drive')

Mounted at /content/drive


### Load file
Load the dataset from a shared Google Drive

In [None]:
# Global
import pandas as pd
path = "/content/drive/Shared drives/KEDS (M)/spam.csv"

# Understanding the Dataset

In [None]:
df = pd.read_csv(path, encoding='latin1')

# check total number of documents
doc_num = df.shape[0]
print("Total number of documents: %d" % doc_num)

print(f"ham: {(df['v1'] == 'ham').sum()} ({(df['v1'] == 'ham').sum() / doc_num})")
print(f"spam: {(df['v1'] == 'spam').sum()} ({(df['v1'] == 'spam').sum() / doc_num})")

Total number of documents: 5572
ham: 4825 (0.8659368269921034)
spam: 747 (0.13406317300789664)


According to the Kaggle dataset description, the messages labeled 'spam' are those that are spam messages, and the messages labeled 'ham' are those that are not spam messages. There are a total of 5772 documents, and 87% of them are ham messages. The other 13% are spam messages. Thus, the distribution of the data is incredibly imbalanced, and this is important to note for the succeeding program, particularly for the testing of the model.

# Pre-processing
To know what pre-processing must be done before training, first view a few samples from the dataset.


In [None]:
df[82:98]

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
82,ham,Ok i am on the way to home hi hi,,,
83,ham,You will be in the place of that man,,,
84,ham,Yup next stop.,,,
85,ham,"I call you later, don't have network. If urgnt...",,,
86,ham,For real when u getting on yo? I only need 2 m...,,,
87,ham,Yes I started to send requests to make it but ...,,,
88,ham,I'm really not up to it still tonight babe,,,
89,ham,"Ela kano.,il download, come wen ur free..",,,
90,ham,Yeah do! DonÛ÷t stand to close tho- youÛ÷ll ...,,,
91,ham,Sorry to be a pain. Is it ok if we meet anothe...,,,


There are multiple columns for the features, and the columns are random and serve no purpose. Thus, these columns are to be concatenated for ease of processing.

In [None]:
import re

# For readability
df.rename(columns={"v1": "label", "v2":"message"}, inplace=True)

# Join the strings found in all the feature columns
df["message"] = df[["message", "Unnamed: 2", "Unnamed: 3", "Unnamed: 4"]].astype(str).apply("-".join, axis=1)
df["label"] = df["label"].astype(str)

# removing the other columns
df.drop(columns=["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], inplace=True)

df[:5]

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...-nan-nan-nan
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


Although the columns have been concatenated, the data is now populated with -nan, which is not a feature in the messages that should be taken into consideration when training it. Thus, the unwanted token will be removed thorugh RegEx.

In [None]:
df["message"] = df["message"].str.replace(r'\-nan', '', regex=True)

df[:5]

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


The last step prior to splitting the data for training and testing is to assign all the labels and messages into two separate arrays to be able to use it in the machine learning models.

In [None]:
# Define feature columns
x = df["message"].values
# Define label columns
y = df["label"].values

# Splitting the Data

30% of the data is assigned to be test data, and the other 70% will be the training data. The distribution of the labels in each dataset is also displayed below. Both sets have significantly more ham messages, which follows the distribution of the original dataset.

Stratify is used so the data proportions are similar to that of the original dataset. Moreover, the model is trained and tested on different variations of data by altering the random_state value in order to better assess the performance of the model and ensure it is not biased towards one variation.

In [None]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 20, stratify=y)

print("Training Data")
print(f"Total number of documents: {len(y_train)}")
print(f"Documents for ham messages: {(y_train == 'ham').sum()}")
print(f"Documents for spam messages: {(y_train == 'spam').sum()}")

print("\nTesting Data")
print(f"Total number of documents: {len(y_test)}")
print(f"Documents for ham messages: {(y_test == 'ham').sum()}")
print(f"Documents for spam messages: {(y_test == 'spam').sum()}")

Training Data
Total number of documents: 3900
Documents for ham messages: 3377
Documents for spam messages: 523

Testing Data
Total number of documents: 1672
Documents for ham messages: 1448
Documents for spam messages: 224


# Text Vectorization: Bag of Words Model



The features are extracted first using the Bag of Words (BoW) model and then trained using two machine learning models: Naïve Bayes and Logistic Regression. For further comparison and experimentation of machine learning pre-processing for natural language processing, TF-IDF is also used for feature extraction in the next section.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
# lowercasing of words is done by default
count_vec = CountVectorizer(ngram_range = (1,1))

x_train = count_vec.fit_transform(x_train)
x_test = count_vec.transform(x_test)

### Training: Naïve Bayes

In [None]:
from sklearn.naive_bayes import MultinomialNB

nb = MultinomialNB()
nb.fit(x_train, y_train)

prediction_nb = nb.predict(x_test)

### Training: Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()
logreg.fit(x_train, y_train)

prediction_logreg = logreg.predict(x_test)

### Testing

The model is tested using four different evaluation metrics, but ultimately the F1 Score is the basis for the evaluation. The reasons for this are further justified and assessed in the paper submitted alongside this notebook.


In [None]:
from sklearn.metrics import f1_score, accuracy_score, precision_score, recall_score

print("Naïve Bayes: %f" % f1_score(y_test, prediction_nb, average='macro'))
print("Logistic Regression: %f" % f1_score(y_test, prediction_logreg, average='macro'))
print("--------------------------------")

print("Naive Bayes (accuracy): %f" % accuracy_score(y_test, prediction_nb))
print("Logistic Regression (accuracy): %f" % accuracy_score(y_test, prediction_logreg))
print("--------------------------------")

print("Naive Bayes (precision): %f" % precision_score(y_test, prediction_nb, average = 'macro'))
print("Logistic Regression (precision): %f" % precision_score(y_test, prediction_logreg, average = 'macro'))
print("--------------------------------")

print("Naive Bayes (recall): %f" % recall_score(y_test, prediction_nb, average = 'macro'))
print("Logistic Regression (recall): %f" % recall_score(y_test, prediction_logreg, average = 'macro'))



Naïve Bayes: 0.960127
Logistic Regression: 0.947838
--------------------------------
Naive Bayes (accuracy): 0.982057
Logistic Regression (accuracy): 0.977273
--------------------------------
Naive Bayes (precision): 0.975318
Logistic Regression (precision): 0.980363
--------------------------------
Naive Bayes (recall): 0.946244
Logistic Regression (recall): 0.920839


# Text Vectorization: TF-IDF

To explore a different pipeline, the dataset features were also extracted using the TF-IDF , which is used to downweight frequently occurring words in a dataset. The entire code from splitting to testing was replicated to avoid having to re-run each step in the notebook.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, accuracy_score, precision_score, recall_score

# Splitting data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 16, stratify=y)

# Vectorization
tfidf_vec = TfidfVectorizer(ngram_range = (1,1))

x_train = tfidf_vec.fit_transform(x_train)
x_test = tfidf_vec.transform(x_test)

# Training
nb = MultinomialNB()
nb.fit(x_train, y_train)

prediction_nb = nb.predict(x_test)

logreg = LogisticRegression()
logreg.fit(x_train, y_train)

# Testing

prediction_logreg = logreg.predict(x_test)

print("Naïve Bayes: %f" % f1_score(y_test, prediction_nb, average='macro'))
print("Logistic Regression: %f" % f1_score(y_test, prediction_logreg, average='macro'))
print("--------------------------------")

print("Naive Bayes (accuracy): %f" % accuracy_score(y_test, prediction_nb))
print("Logistic Regression (accuracy): %f" % accuracy_score(y_test, prediction_logreg))
print("--------------------------------")

print("Naive Bayes (precision): %f" % precision_score(y_test, prediction_nb, average = 'macro'))
print("Logistic Regression (precision): %f" % precision_score(y_test, prediction_logreg, average = 'macro'))
print("--------------------------------")

print("Naive Bayes (recall): %f" % recall_score(y_test, prediction_nb, average = 'macro'))
print("Logistic Regression (recall): %f" % recall_score(y_test, prediction_logreg, average = 'macro'))



Naïve Bayes: 0.886842
Logistic Regression: 0.907536
--------------------------------
Naive Bayes (accuracy): 0.955144
Logistic Regression (accuracy): 0.962321
--------------------------------
Naive Bayes (precision): 0.975378
Logistic Regression (precision): 0.979153
--------------------------------
Naive Bayes (recall): 0.832589
Logistic Regression (recall): 0.859375


Based on all the evaluation metrics, TF-IDF performs much worse for both models. The reason for why TF-IDF has worse results than BoW is expounded more in the paper.