# Introduction

In this project, I'll be building a spam filter for SMS messages using the Naive Bayes algorithm.

To classify messsages as spam or non-spam, the algorithm

1. Learns how humans classify messages

2. Uses that human knowledge to estimate probabilites for new messages - probabilities for spam and non-spam

3. Classifies a new message based on those probability values - if the probability for spam is greater, then the message is classified as spam and vice-versa. 

To teach the algorithm, I'll be using a dataset of 5,572 SMS messages already classified by humans which was downloaded from [The UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection).

If a text is spam it is classified as 'spam' in this dataset. If the text is not spam it is classified as 'ham'.

Our goal is to create a spam filter that classifies new messages with an accuracy of greater than 80%.

# Reading and Exploring the dataset

In [1]:
import pandas as pd
import re

In [2]:
sms_df = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['Label', 'SMS'])
sms_df

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
5,spam,FreeMsg Hey there darling it's been 3 week's n...
6,ham,Even my brother is not like to speak with me. ...
7,ham,As per your request 'Melle Melle (Oru Minnamin...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...


In [3]:
sms_df['Label'].value_counts(normalize=True) * 100 #Which percentage of messages are ham and which are spam

ham     86.593683
spam    13.406317
Name: Label, dtype: float64

We can see that around 13% of the messages are spam, and 87% are classified as ham.  Before we design the spam filter, let's split the data up into a training and test set - because we need to test how good the spam filter is at classifying messages in an unbiased way.

# Splitting the Data into a Training and Test Set

We're going to begin by randomizing the whole dataset to make sure the spam and ham messages are split evenly throughout the dataset, and then split data in an 80/20 proportion for the training set and test set respectively.

In [4]:
sms_df = sms_df.sample(frac=1, random_state=1)

In [5]:
sms_df_test = sms_df[:1114]
sms_df_test = sms_df_test.reset_index(drop=True)
sms_df_test.shape

(1114, 2)

In [6]:
sms_df_train = sms_df[1114:]
sms_df_train = sms_df_train.reset_index(drop=True)
sms_df_train.shape

(4458, 2)

In [7]:
sms_df_test['Label'].value_counts(normalize=True) * 100

ham     86.804309
spam    13.195691
Name: Label, dtype: float64

In [8]:
sms_df_train['Label'].value_counts(normalize=True) * 100

ham     86.54105
spam    13.45895
Name: Label, dtype: float64

The percentages of ham and spam in both the training set and test set are similar to what we had in the full dataset, which is good as the data is representative.

# Cleaning the Dataset

To make the Naive Bayes calculation easier, we want to transform our datasets so that the number of times each word appears in a text is counted for each row, and we'll do that by splitting up the words in the SMS column for every row of data.  First we'll remove all punctuation and make every word in every text lower case.

In [9]:
sms_df_train['SMS'] = sms_df_train['SMS'].str.replace('\W', ' ').str.lower()

sms_df_train.head(10)

Unnamed: 0,Label,SMS
0,ham,yeah do don t stand to close tho you ll catc...
1,ham,hi where are you we re at and they re not ...
2,ham,if you r home then come down within 5 min
3,ham,when re you guys getting back g said you were...
4,ham,tell my bad character which u dnt lik in me ...
5,ham,i m leaving my house now
6,ham,hey they r not watching movie tonight so i ll ...
7,ham,s da al r above lt gt
8,spam,camera you are awarded a sipix digital camer...
9,ham,fyi i m gonna call you sporadically starting a...


Next, we need to create a list which contains all of the unique words in the training set.  This is known as a vocabulary, and will be an important part of the Naive Bayes calculation.

In [10]:
sms_df_train['SMS'] = sms_df_train['SMS'].str.split()

In [11]:
vocabulary  = []
for row in sms_df_train['SMS']:
    for word in row:
        vocabulary.append(word)
        
vocabulary = list(set(vocabulary))
len(vocabulary)

7753

We can see there are 7753 unique words in this dataset. Now we're ready to transform our data!

In [12]:
#initialising a dictionary where each key is a unique word, and each value is a list of the length of the training set
word_counts_per_sms = {unique_word: [0] * len(sms_df_train['SMS']) for unique_word in vocabulary}

#incrementing the value of each word found in each row by 1
for index, sms in enumerate(sms_df_train['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

In [13]:
words_df = pd.DataFrame(word_counts_per_sms)
words_df

Unnamed: 0,0,00,000,008704050406,0121,01223585236,01223585334,0125698789,02,0207,...,zindgi,zoe,zoom,zouk,zyada,èn,é,ü,〨ud,鈥
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [14]:
train_df_clean = pd.concat([sms_df_train, words_df], axis=1)
train_df_clean.head(5)

Unnamed: 0,Label,SMS,0,00,000,008704050406,0121,01223585236,01223585334,0125698789,...,zindgi,zoe,zoom,zouk,zyada,èn,é,ü,〨ud,鈥
0,ham,"[yeah, do, don, t, stand, to, close, tho, you,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[hi, where, are, you, we, re, at, and, they, r...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[if, you, r, home, then, come, down, within, 5...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,"[when, re, you, guys, getting, back, g, said, ...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[tell, my, bad, character, which, u, dnt, lik,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


# Calculating factors in Naive Bayes Equation

Now we're done with data cleaning and have a cleaned training set, we can begin creating the spam filter.  To begin with, we'll calculate some of the factors in the equation:

* P(Spam) and P(Ham)

* N<sub>Spam</sub>, N<sub>Ham</sub>, and N<sub>Vocabulary</sub>

These factors will retain a constant value in the equations regardless of the message or word.


In [15]:

#Isolating Spam Messages and Ham Messages
spam_messages = train_df_clean[train_df_clean['Label'] == 'spam']
ham_messages = train_df_clean[train_df_clean['Label'] == 'ham']

#Calculating P(Spam) and P(Ham)
p_spam = len(spam_messages)/len(train_df_clean)
p_ham = len(ham_messages)/len(train_df_clean)

#Calculating NSpam, NHam, and NVocabulary

n_spam = spam_messages['SMS'].apply(len).sum()
n_ham = ham_messages['SMS'].apply(len).sum()

n_vocabulary = len(vocabulary)

alpha = 1




# Calculating Parameters in Naive Bayes Equation

Now we're ready to calculate the conditional probabilites associated for each word given that they're either Spam or Ham - P(w<sub>i</sub>|Spam) and P(w<sub>i</sub>|Ham)

In [16]:
#initialising dictionaries for parameters
spam_params = {unique_word: 0 for unique_word in vocabulary}
ham_params = {unique_word: 0 for unique_word in vocabulary}

#calculating parameters
for word in vocabulary:
    n_word_given_spam = spam_messages[word].sum()
    p_word_given_spam = (n_word_given_spam + alpha) / ((n_spam) + (alpha * n_vocabulary))
    spam_params[word] = p_word_given_spam
    
    n_word_given_ham = ham_messages[word].sum()
    p_word_given_ham = (n_word_given_ham + alpha) / ((n_ham) + (alpha * n_vocabulary))
    ham_params[word] = p_word_given_ham

# Creating the Spam Filter

Now all the constants and parameters are calculated, we can start to create the spam filter.

It can be understood as a function that:

* Takes as input a new message (w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>)

* Calculates P(Spam|w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>) and P(Ham|w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>)

* Compares the above values, and:
    * If P(Ham|w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>) > P(Spam|w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>) the message is classified as ham.
    * If P(Spam|w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>) > P(Ham|w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>) the message is classified as spam.
    * If the values are equal then the algorithm may request human help

In [17]:
def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split() #Cleaning the SMS data
    
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham # initialising the variables
    
    for word in message: #Iterating over words, if the word exists in our parameters, we multiply the initial probability by the probablities belong to that specific word
        if word in spam_params:
            p_spam_given_message *= spam_params[word]
            
        if word in ham_params:
            p_ham_given_message *= ham_params[word]


    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

In [18]:
classify('WINNER!! This is the secret code to unlock the money: C3421.')

P(Spam|message): 1.2784957584472927e-25
P(Ham|message): 2.5841428475044265e-27
Label: Spam


In [19]:
classify("Sounds good, Tom, then see u there")

P(Spam|message): 4.774748444294843e-25
P(Ham|message): 3.455584370145657e-21
Label: Ham


It seems as if the filter has classified these messages correctly!

# Testing the accuracy of our Spam Filter

Now we'll use the spam filter on our test set to see how well it classifies the messages, and to see whether it achieves our target of 80% or higher accuracy.

We'll first begin by modifying the function above to return values rather than print them, because then we can create a new column in the dataset which identifies what our algorithm predicted the SMS messages to be.

In [28]:
def classify_test_set(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split() #Cleaning the SMS data
    
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham # initialising the variables
    
    for word in message: #Iterating over words, if the word exists in our parameters, we multiply the initial probability by the probablities belong to that specific word
        if word in spam_params:
            p_spam_given_message *= spam_params[word]
            
        if word in ham_params:
            p_ham_given_message *= ham_params[word]


    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_ham_given_message < p_spam_given_message:
        return 'spam'
    else:
        return('Equal proabilities, have a human classify this!')

In [29]:
sms_df_test['predicted'] = sms_df_test['SMS'].apply(classify_test_set)
sms_df_test.head(20)

Unnamed: 0,Label,SMS,predicted
0,ham,"Yep, by the pretty sculpture",ham
1,ham,"Yes, princess. Are you going to make me moan?",ham
2,ham,Welp apparently he retired,ham
3,ham,Havent.,ham
4,ham,I forgot 2 ask ü all smth.. There's a card on ...,ham
5,ham,Ok i thk i got it. Then u wan me 2 come now or...,ham
6,ham,I want kfc its Tuesday. Only buy 2 meals ONLY ...,ham
7,ham,No dear i was sleeping :-P,ham
8,ham,Ok pa. Nothing problem:-),ham
9,ham,Ill be there on &lt;#&gt; ok.,ham


We'll measure the accuracy of the spam filter by calculating the number of correctly classified messages divided by the total number of classified messages.

In [38]:
correct = 0
total = sms_df_test.shape[0]
    
for row in sms_df_test.iterrows():
    row = row[1]
    if row['Label'] == row['predicted']:
        correct += 1
        
print('Correct:', correct)
print('Incorrect:', total - correct)
print('Accuracy (%):', correct/total * 100)


Correct: 1101
Incorrect: 13
Accuracy (%): 98.8330341113106


Our Spam Filter was able to classify 98.8% of messages correctly from the test set, which is very good!

# Conclusion

We were able to successfully create a spam filter using the training data, and apply it successfully on the test set.  The filter correctly classified 98.8% of messages, which was far better than our 80% target.