Download the "spam.csv" dataset and load it into a pandas DataFrame

In [1]:
import pandas as pd
df= pd.read_csv("E:/downld/spam.csv", low_memory=False)

Import the required libraries: pandas, scikit-learn, and numpy

In [3]:
!pip install pycaret

Collecting pycaret
  Downloading pycaret-3.3.2-py3-none-any.whl (486 kB)
     -------------------------------------- 486.1/486.1 kB 6.1 MB/s eta 0:00:00
Collecting xxhash
  Downloading xxhash-3.4.1-cp39-cp39-win_amd64.whl (29 kB)
Collecting schemdraw==0.15
  Downloading schemdraw-0.15-py3-none-any.whl (106 kB)
     -------------------------------------- 106.8/106.8 kB 6.0 MB/s eta 0:00:00
Collecting deprecation>=2.1.0
  Downloading deprecation-2.1.0-py2.py3-none-any.whl (11 kB)
Collecting pmdarima>=2.0.4
  Downloading pmdarima-2.0.4-cp39-cp39-win_amd64.whl (614 kB)
     -------------------------------------- 615.0/615.0 kB 9.6 MB/s eta 0:00:00
Collecting scikit-plot>=0.3.7
  Downloading scikit_plot-0.3.7-py3-none-any.whl (33 kB)
Collecting imbalanced-learn>=0.12.0
  Downloading imbalanced_learn-0.12.3-py3-none-any.whl (258 kB)
     -------------------------------------- 258.3/258.3 kB 8.0 MB/s eta 0:00:00
Collecting yellowbrick>=1.4
  Downloading yellowbrick-1.5-py3-none-any.whl (282 k

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
anaconda-project 0.11.1 requires ruamel-yaml, which is not installed.


In [2]:
import numpy as np

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

Approach 1 - CountVectorizer with SVM

Use the CountVectorizer method from the scikit-learn library to define the features from
the email text

In [10]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import svm
from sklearn.svm import SVC
from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

In [2]:
df.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [3]:
df = df.loc[:, ~df.columns.str.contains('Unnamed')]

In [4]:
df.head()

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


Defining the features from email list

In [5]:
df.columns=['Label','EmailText']

In [6]:
df.head()

Unnamed: 0,Label,EmailText
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


Split the dataset into training and testing sets

In [28]:
# Transform text data into count vector features
# Extract features using CountVectorizer
vectorizer = CountVectorizer()

X = vectorizer.fit_transform(df['EmailText'])
y = df['Label']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Train an SVM (Support Vector Machine) classifier on the training data, Evaluate the performance of the classifier by calculating accuracy, recall, precision, and
F1-score on the test set

In [29]:

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a pipeline with SVM
pipeline = make_pipeline(StandardScaler(with_mean=False), SVC(kernel='linear'))

# Train the model
pipeline.fit(X_train, y_train)

# Evaluate the model
accuracy = pipeline.score(X_test, y_test)
print(f"Model Accuracy: {accuracy}")

Model Accuracy: 0.9748878923766816


In [34]:
# Predict on the test set
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("precision:", accuracy )

precision: 0.9748878923766816


In [35]:
# Evaluate the model
recall = recall_score(y_test, y_pred, average='macro')
print("Recall:", recall )

Recall: 0.9066666666666667


In [36]:
# Evaluate the f1 score
f1 = f1_score(y_test, y_pred, average='macro')
print("f1-score:", f1 )

f1-score: 0.941379258547137


Step 3: Approach 2 - Cleaning Text and SVM

Preprocess the email text by performing cleaning-up techniques such as removing
stopwords, punctuation, and performing stemming or lemmatization.

Use the cleaned text as features and repeat steps similar to Approach 1: splitting,
training an SVM classifier, and evaluating performance metrics

In [35]:
#removing stopwords
from sklearn.feature_extraction import _stop_words
import string
from nltk.stem import PorterStemmer
import nltk

# Initialize the stemmer
stemmer = PorterStemmer()

# Define a function to clean and stem text
def clean(text):
    # Convert text to lowercase
    text = text.lower()
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation + '“”'))
    # Remove stop words and perform stemming
    stemmed_words = [stemmer.stem(word) for word in text.split() if word not in _stop_words.ENGLISH_STOP_WORDS]
    # Join words back into a single string
    stemmed_text = ' '.join(stemmed_words)
    # Remove extra spaces and strip leading/trailing spaces
    stemmed_text = ' '.join(stemmed_text.split())
    return stemmed_text

# Apply the function to the 'EmailText' column
df['EmailText'] = df['EmailText'].apply(clean)

In [36]:
df.head()

Unnamed: 0,Label,EmailText
0,ham,jurong point crazi avail bugi n great world la...
1,ham,ok lar joke wif u oni
2,spam,free entri 2 wkli comp win fa cup final tkt 21...
3,ham,u dun say earli hor u c say
4,ham,nah dont think goe usf live


In [34]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\kevin\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

training the SVM and evaluating

In [44]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df['EmailText'])
y = df['Label']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


# Create a pipeline with SVM
pipeline = make_pipeline(StandardScaler(with_mean=False), SVC(kernel='linear'))

# Train the model
pipeline.fit(X_train, y_train)

# Evaluate the model
accuracy = pipeline.score(X_test, y_test)
print(f"Model Accuracy: {accuracy}")

Model Accuracy: 0.9721973094170404


In [45]:
# Predict on the test set
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("precision:", accuracy )

precision: 0.9721973094170404


In [46]:
# Evaluate the model
recall = recall_score(y_test, y_pred, average='macro')
print("Recall:", recall )

Recall: 0.8966666666666667


In [47]:
# Evaluate the f1 score
f1 = f1_score(y_test, y_pred, average='macro')
print("f1-score:", f1 )

f1-score: 0.9344750516104938


Approach 3 - TF-IDF Vectors and SVM

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()
# Fit and transform the cleaned text data
tfidf_matrix = tfidf_vectorizer.fit_transform(df['EmailText'])

# Convert the TF-IDF matrix to a DataFrame
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

Split the data, train an SVM classifier, and evaluate the performance metrics as in the
previous approaches.

In [15]:
# Initialize the TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer(min_df=0.1, max_df=0.9)

# Fit and transform the text data
tfidf_matrix = tfidf_vectorizer.fit_transform(df['EmailText'])

# Extract labels
y = df['Label']  # Assuming the label column is named 'Label'

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(tfidf_matrix, y, test_size=0.2, random_state=42)

# Train an SVM classifier
svm_classifier = SVC(kernel='linear')
svm_classifier.fit(X_train, y_train)

# Predict on the test set
y_pred = svm_classifier.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

Accuracy: 0.8654708520179372


In [22]:
# Predict on the test set
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("precision:", accuracy )

precision: 0.9748878923766816


In [23]:
# Evaluate the model
recall = recall_score(y_test, y_pred, average='macro')
print("Recall:", recall )

Recall: 0.9066666666666667


In [24]:
# Evaluate the f1 score
f1 = f1_score(y_test, y_pred, average='macro')
print("f1-score:", f1 )

f1-score: 0.941379258547137


Compare Performance

Compare the accuracy, recall, precision, and F1-score obtained from each approach

Model Accuracy: 0.9748878923766816
precision: 0.9748878923766816
Recall: 0.9066666666666667
f1-score: 0.941379258547137


Cleaned text
Model Accuracy: 0.9721973094170404
precision: 0.9721973094170404
Recall: 0.8966666666666667
f1-score: 0.9344750516104938

TF-IDF
Model Accuracy: 0.8654708520179372
precision: 0.9748878923766816
Recall: 0.9066666666666667
f1-score: 0.941379258547137

Due to the slight variation of the text due to the contents being cleaned, there was a slight reduction in the statistics due to the fact that the LLM was initially trained on different data. This is due to the fact that the LLM was trying to predict based on uncleaned text, highlighting the reduction in quality.

The TF-IDF model is extroardinarily close to the initial model accuracy due to how the email texts does not have a large quantity of frequent similar terms to influence LLM training but instead, the email texts mostly contain different key phrases. This explains why there was minimum change within the quality of the score.

Analyze and discuss the results, identifying the most effective approach for spam
detection

The most effective approach for SPAM detection is approach 2. Where text cleaning and preprocessing has occured before processing. This is due to how it would allow the training of the LLM with properlly cleaned text which allows for more efficient text analysis in a general sense for upcoming email texts and the LLM would not be trained on misspellings. 

TF-IDF would not be effective due to how email texts would contain a large quantity of variation of words and key phrases that analyzing based on term frequency would be ineffective.