# Building a spam classifier using naive bayes

This notebook follows an exercise in Udacity's Machine Learning Using Pytorch Nanodegree program. We use the Naive Bayes algorithm to create a model that can classify dataset SMS messages as spam or not spam. The dataset is originally compiled and posted on the UCI Machine Learning repository (https://archive.ics.uci.edu/ml/machine-learning-databases/00228/)

In [21]:
# Import packages and data
import pandas as pd

df = pd.read_table('C:/Users/lucie/Downloads/smsspamcollection/SMSSpamCollection',sep='\t', header=None, names=['label', 'sms_message'])

# Output printing out first 5 rows
df.head()

Unnamed: 0,label,sms_message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


Let's convert the label column to binary variables.

In [22]:
df['label'] = df.label.map({'ham':0, 'spam':1})
df.head()

Unnamed: 0,label,sms_message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [23]:
# Split into training and testing sets

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df['sms_message'], 
                                                    df['label'], 
                                                    random_state=1)

print('Number of rows in the total set: {}'.format(df.shape[0]))
print('Number of rows in the training set: {}'.format(X_train.shape[0]))
print('Number of rows in the test set: {}'.format(X_test.shape[0]))

Number of rows in the total set: 5572
Number of rows in the training set: 4179
Number of rows in the test set: 1393


Create a frequency matrix in the sms_message column using the CountVectorizer from scikit-learn.

In [24]:
from sklearn.feature_extraction.text import CountVectorizer
count_vector = CountVectorizer()
X_train = count_vector.fit_transform(X_train)
X_test = count_vector.transform(X_test)

Aside: Alternatively, we can implement Bag of words manually and the use CountVectorize to convert the data into a matrix as follows. We use a small document set as an example.

In [25]:
documents = ['Hello, how are you!', 'Win money, win from home.','Call me now.', 'Hello, Call hello you tomorrow?']

# Convert strings into lower case.
lower_case_documents = []
for i in documents:
    lower_case_documents.append(i.lower())

# Remove punctuations
sans_punctuation_documents = []
import string
for i in lower_case_documents:
    sans_punctuation_documents.append(i.translate(str.maketrans('', '', string.punctuation)))

# Tokenize the strings
preprocessed_documents = []
for i in sans_punctuation_documents:
    preprocessed_documents.append(i.split(' '))

# Count frequencies
frequency_list = []
import pprint
from collections import Counter
for i in preprocessed_documents:
    frequency_counts = Counter(i)
    frequency_list.append(frequency_counts)

count_vector.fit(documents)
doc_array = count_vector.transform(documents).toarray()
frequency_matrix = pd.DataFrame(doc_array)
frequency_matrix

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,1,0,0,1,0,1,0,0,0,0,0,1
1,0,0,1,0,1,0,0,1,0,0,2,0
2,0,1,0,0,0,0,1,0,1,0,0,0
3,0,1,0,2,0,0,0,0,0,1,0,1


In [28]:
# Implement Naive Bayes using scikit-learn
from sklearn.naive_bayes import MultinomialNB
naive_bayes = MultinomialNB()
naive_bayes.fit(X_train, y_train)
y_pred = naive_bayes.predict(X_test)

In [30]:
# Evaluating the model
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print('Accuracy score: ', format(accuracy_score(y_test, y_pred)))
print('Precision score: ', format(precision_score(y_test, y_pred)))
print('Recall score: ', format(recall_score(y_test, y_pred)))
print('F1 score: ', format(f1_score(y_test, y_pred)))

Accuracy score:  0.9885139985642498
Precision score:  0.9720670391061452
Recall score:  0.9405405405405406
F1 score:  0.9560439560439562
