<a href="https://colab.research.google.com/github/jackson-trader/CS462-Email-Spam-Classifier/blob/main/CS462_Email_Spam_Classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Spam Likely | An Email Spam Classifier**

#### By Jackson Trader, Diego Lara, Mark Pack, & Alex Bryant
#### CS 46200 - Introduction To Artificial Intelligence

##### [Dataset from Kaggle](https://www.kaggle.com/datasets/ashfakyeafi/spam-email-classification)
##### [Link to GitHub repository](https://github.com/jackson-trader/CS462-Email-Spam-Classifier)


---

## **Import Libraries**

In [89]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split # To split data into training and testing sets
from sklearn.feature_extraction.text import TfidfVectorizer # TF-IDF

## **Load the Data**

ham = legitimate email <br>
spam = spam email

In [77]:
# Loading the data from csv file on GitHub to a pandas DataFrame
raw_df = pd.read_csv('https://raw.githubusercontent.com/jackson-trader/CS462-Email-Spam-Classifier/refs/heads/main/email_spam_data.csv') # DataFrame
raw_df

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...
5571,ham,Rofl. Its true to its name


Look at how the data is structured

In [78]:
print('Shape: ', raw_df.shape)
print('Columns: ', raw_df.columns.tolist())

print('\nRatio between ham and spam: ')
print(raw_df['Category'].value_counts())

# Currently it is not balanced because there are more ham than spam
# It should be almost a 1:1 ratio between the two

Shape:  (5573, 2)
Columns:  ['Category', 'Message']

Ratio between ham and spam: 
Category
ham               4825
spam               747
{"mode":"full"       1
Name: count, dtype: int64


## **Preprocess the Data**

In [79]:
df = raw_df.copy()

# Drop the rows with no message or category
df = df.dropna(subset=['Message', 'Category'])

## **Splitting the Data**

Splitting the data into a train and test sets: <br>
Train set contains 80% of data <br>
Test set contains 20% of data

In [80]:
X = df['Message'] # Independent variable
y = df['Category'] # Dependent variable (What we are predicting)

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size = 0.2,
    random_state = 100
)

print('Training set size: ', X_train.shape[0])
print('Testing set size: ', X_test.shape[0])

Training set size:  4458
Testing set size:  1115


## **Feature Extraction**

In [81]:
# Transform the message data to feature vectors that can be used as input into the logistic regression

feature_extraction = TfidfVectorizer(
    min_df = 2, # If word appears once then don't include it
    stop_words = 'english',
    lowercase = True
)

X_train_features = feature_extraction.fit_transform(X_train)
X_test_features = feature_extraction.transform(X_test)

In [None]:
print(X_train_features)

## **Train the Model**

### **Logistic Regression**

In [87]:
from sklearn.linear_model import LogisticRegression # Logistic Regression Model

model = LogisticRegression()

In [98]:
# Training the Logistic Regression model with the training data
model.fit(X_train_features, y_train)

## **Evaluate the Trained Model**

In [94]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Make predictions on the training data

y_train_pred = model.predict(X_train_features)
training_data_accuracy = accuracy_score(y_train, y_train_pred)

In [95]:
print('Accuracy score on training data: ', training_data_accuracy)

Accuracy score on training data:  0.9816061013907582


In [96]:
# Make predictions on the test data
y_test_pred = model.predict(X_test_features)
test_data_accuracy = accuracy_score(y_test, y_test_pred)

In [97]:
print('Accuracy score on test data: ', test_data_accuracy)

Accuracy score on test data:  0.9775784753363229


## **Testing Out New Messages**

In [99]:
new_messages = [
    ""
]

In [None]:
# Predictions and probabilities
y_train_pred = clf.predict(X_train)
y_test_pred = clf.predict(X_test)

print("Train accuracy:", accuracy_score(y_train, y_train_pred))
print("Test accuracy :", accuracy_score(y_test, y_test_pred))
print("\nClassification report (test):\n", classification_report(y_test, y_test_pred))
print("Confusion matrix (test):\n", confusion_matrix(y_test, y_test_pred))