# Introduction
In this project we'll use logistic regression to classify messages as spam or not. The dataset for this project is from Kaggle: https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset/data

## Preprocessing

In [30]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [31]:
data = pd.read_csv('../data/spam.csv', encoding='latin-1')
data.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


Ham is the label for messages that aren't spam. v1 will be what was the message labeled as. v2 is the message. Let's see what the Unnamed columns are about.

In [32]:
data.describe()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
count,5572,5572,50,12,6
unique,2,5169,43,10,5
top,ham,"Sorry, I'll call later","bt not his girlfrnd... G o o d n i g h t . . .@""","MK17 92H. 450Ppw 16""","GNT:-)"""
freq,4825,30,3,2,2


In [33]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   v1          5572 non-null   object
 1   v2          5572 non-null   object
 2   Unnamed: 2  50 non-null     object
 3   Unnamed: 3  12 non-null     object
 4   Unnamed: 4  6 non-null      object
dtypes: object(5)
memory usage: 217.8+ KB


Just in this case, we are going to drop the last 3 columns.

In [34]:
data.drop(['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis=1, inplace=True)

Let's also re-label the columns and the possible values for v1, so they are easier to understand and to work with. Ham = 0, Spam = 1.

In [35]:
data.rename(columns={'v1': 'label', 'v2': 'message'}, inplace=True)
data['label'] = data['label'].map({'ham': 0, 'spam': 1})

In [36]:
data.head()

Unnamed: 0,label,message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


Since the data that we have is text, we need to convert all of it into a numeric format. To achieve that, we'll use CountVectorizer (from scikit-learn), which transforms text into a vector where we have a column for each unique word in the original text, with the row values being the frequency of each unique word in text.

In [37]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

X = vectorizer.fit_transform(data['message'])
y = data['label']

X

<5572x8672 sparse matrix of type '<class 'numpy.int64'>'
	with 73916 stored elements in Compressed Sparse Row format>

After using vectorizer.fit_transform, we get in return a sparse (most elements are 0) matrix, which is a compatible data type for train_test_split from sklearn, function that we use for splitting our dataset.

## Building our Model and Evaluation

In [38]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [39]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

model.fit(X_train, y_train)

y_pred = model.predict(X_test)

In [40]:
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

Accuracy: 0.9838565022421525


Our predictions seem to be very good on unseen data. Let's analyze our model using metrics.

In [41]:
from sklearn.metrics import classification_report, confusion_matrix

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.98      1.00      0.99       974
           1       0.98      0.89      0.93       141

    accuracy                           0.98      1115
   macro avg       0.98      0.95      0.96      1115
weighted avg       0.98      0.98      0.98      1115



In [42]:
print(confusion_matrix(y_test, y_pred))

[[971   3]
 [ 15 126]]


This states that:
- 971 messages were correctly classified as 'non-spam/ham'
- 126 messages were correctly classified as 'spam'
- 15 'non-spam/ham' messages were wrongly labeled as 'spam'
- only 3 'spam' messages were wrongly labeled as 'non-spam/ham'

Let's try it out with a personalized message.

In [45]:
message = "Congratulations!!! You have won a lottery ticket worth $1,000,000. Please click the link below to claim your prize."
result = model.predict(vectorizer.transform([message]))
if result:
    print("Spam")
else:
    print("Ham")

Spam


In [46]:
message = "How are you doing? Please, let me know about our meeting tomorrow."
result = model.predict(vectorizer.transform([message]))
if result:
    print("Spam")
else:
    print("Ham")

Ham
