# Building a Spam Filter with Naive Bayes

Through our learning thus far, we understand the theory of how to classify messages as spam or non-spam using probability. More specifically, we know how the Naive Bayes algorithm operates to achieve this. Our goal with this dataset is to display that operation here.

We are using a dataset put together by Tiago A. Almeida and José María Gómez Hidalgo containing 5,572 SMS messages. We want to "teach" the computer how to classify these messages using the Niave Bayes algorithm. 

# Exploring the data

Our first order of business is to read in the data and take a look at what we are working with.

In [1]:
import pandas as pd
sms = pd.read_csv("SMSSpamCollection", sep='\t',header=None, names=['Label','SMS'])

In [2]:
sms.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
sms.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
Label    5572 non-null object
SMS      5572 non-null object
dtypes: object(2)
memory usage: 87.1+ KB


In [4]:
sms['Label'].value_counts(normalize=True) * 100

ham     86.593683
spam    13.406317
Name: Label, dtype: float64

We can see that from the entire dataset, we have roughly 13% of messages classified as spam and 87 classified as ham (non-spam).

# Splitting the data

Now that we have taken a look at the data, we can move on to building the spam filter. 

But before doing that, we need to establish a way to test how well it works. This way, once the spam filter is built we can test how good it is with classifying new messages.

So we start with splitting the data into a training set and a test set. 
We use the training set to "train" the computer on how to classify messages.
We use the test set to see how good the spam filter is with classifying the new messages. 

With the data we have, we will use about 80% for the training set (4,458 messages) and the remaining 20% for the test set (1,114 messages).

Our goal for the project is to create a spam filter that classifies new messages with an accuracy greater than 80%.

In [5]:
sms_rand = sms.sample(frac=1,random_state=1)

In [6]:
training_set = sms_rand.sample(frac=0.80,random_state=1,replace=False)
test_set = sms_rand.sample(frac=0.20, random_state=1,replace=False)

training_set = training_set.reset_index(drop=True)
test_set = test_set.reset_index(drop=True)

In [7]:
print(training_set.shape)
print(test_set.shape)

(4458, 2)
(1114, 2)


In [8]:
training_set['Label'].value_counts(normalize=True)*100

ham     86.675639
spam    13.324361
Name: Label, dtype: float64

In [9]:
test_set['Label'].value_counts(normalize=True)*100

ham     86.175943
spam    13.824057
Name: Label, dtype: float64

So far we randomized the dataset and then split it into a training set containing 80% of the data and a test_set containing 20% of the data. 
Each of the training and test set have a distribution of spam and non-spam (ham) that is comparable to the original dataset.

# Prepping the training set data

The next thing we need to do is to use the training set to teach the algorithm to classify new messages.

We do this using the Niave Bayes algorithm. The algorithm is based on the number of times a specific word is used in each SMS. So we will need to do some cleaning in order to get the data in the correct format (lower case, no punctuation).

We will then replace the SMS column with a series of new columns, each corresponding to a unique word from the vocabulary. Each row will represent a new message and the columns will show the amount of times a specific word is used in the message.

Lets first clean the data before we separate columns.

In [10]:
#We start by removing all the punctuation from the SMS column and making words lower case
training_set['SMS'] = training_set['SMS'].str.replace('\W', ' ')
training_set['SMS'] = training_set["SMS"].str.lower()

In [11]:
training_set.head()

Unnamed: 0,Label,SMS
0,ham,good night my dear sleepwell amp take care
1,ham,sen told that he is going to join his uncle fi...
2,ham,thank you baby i cant wait to taste the real ...
3,ham,when can ü come out
4,ham,no thank you you ve been wonderful


In [12]:
# We split each SMS into individual words and append the different words into a list
vocabulary = []

training_set['SMS'] = training_set['SMS'].str.split()
for message in training_set['SMS']:
    for word in message:
        vocabulary.append(word)

In [13]:
# We use the set function to remove duplicates from the vocabulary list and convert back into a list
vocabulary = list(set(vocabulary))
len(vocabulary)

7712

# Finalizing the training set

Now that we've created a list containing all of the words used in the messages from the training set, we can use this list to make the data transformation we need.

In [14]:
# We start by creating a dictionary where the key is a unique word from the vocabulary list
word_counts_per_sms = {unique_word: [0] * len(training_set['SMS']) for unique_word in vocabulary}

# Next, we loop over the SMS column and use the enumerate function to get an index and the SMS message
# We then add this into the dictionary we just created
for index, sms in enumerate(training_set['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1
        
# Once the dictionary is filled, we transform it into a dataframe and concatenate it with the dataframe containing the training set        
word_counts = pd.DataFrame(word_counts_per_sms)
word_counts = pd.concat([training_set,word_counts],axis=1)

In [15]:
word_counts = pd.DataFrame(word_counts_per_sms)
word_counts.head()

Unnamed: 0,0,00,000,000pes,008704050406,0089,0121,01223585236,01223585334,0125698789,...,zoe,zogtorius,zoom,zouk,èn,é,ú1,ü,〨ud,鈥
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [16]:
training_set_clean = pd.concat([training_set, word_counts], axis=1)
training_set_clean.head()

Unnamed: 0,Label,SMS,0,00,000,000pes,008704050406,0089,0121,01223585236,...,zoe,zogtorius,zoom,zouk,èn,é,ú1,ü,〨ud,鈥
0,ham,"[good, night, my, dear, sleepwell, amp, take, ...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[sen, told, that, he, is, going, to, join, his...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[thank, you, baby, i, cant, wait, to, taste, t...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,"[when, can, ü, come, out]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,ham,"[no, thank, you, you, ve, been, wonderful]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


# Creating the Spam Filter

Now that the data cleaning is done and we have a training set we can work with, we can start creating the spam fiter.

In order to use the Naive Bayes algorithm, we need to know:
  - The P(wi | Spam) 
  - The P(wi | Ham(non-spam))
So we seek to identify those in our first step.
We look to calculate P(spam), P(Ham), Nspam, Nham, and Nvocabulary to find the above values.

In [17]:
# We first caluculate P(spam) and P(ham)
p_spam = len(training_set_clean[training_set_clean['Label'] == 'spam'])/len(training_set_clean)
p_ham = len(training_set_clean[training_set_clean['Label'] == 'ham'])/len(training_set_clean)

# Next, we calculate Nspam, Nham, and Nvocabulary
spam = training_set_clean[training_set_clean['Label'] == 'spam']
ham = training_set_clean[training_set_clean['Label'] == 'ham']

# Number of words in the spam messages
n_spam = sum(spam['SMS'].apply(len))

# Number of words in the non-spam messages
n_ham = sum(ham['SMS'].apply(len))

# Number of unique words in the training dataset
n_vocab = len(vocabulary)

# Laplace smoothing variable
alpha = 1

# Calculating Parameters

Now that we've established the variables involved in calculating P(w|spam) and P(w|ham) we can go ahead and solve for those values.

In [18]:
# We first create two dictionaries each containing all of the unique words
p_w_spam = {unique_word:0 for unique_word in vocabulary}
p_w_ham = {unique_word:0 for unique_word in vocabulary}

# Then we isolate the spam and ham messsages in the training set into two different dataframes
spam = training_set_clean[training_set_clean['Label'] == 'spam']
ham = training_set_clean[training_set_clean['Label'] == 'ham']

# For each unique word, we calculate p_w_ham and p_w_spam
for word in vocabulary:
    n_word_given_spam = spam[word].sum()
    p_word_given_spam = ((n_word_given_spam) + alpha) / (n_spam + (alpha*n_vocab))
    p_w_spam[word] = p_word_given_spam
    
    n_word_given_ham = ham[word].sum()
    p_word_given_ham = (n_word_given_ham + alpha)/(n_ham + alpha*n_vocab)
    p_w_ham[word] = p_word_given_ham

Our two dictionaries should now be filled with the unique words and the probability that the given word is either spam or ham(non-spam).

# Finalizing the Spam Filter

Now that we have all of the constants and parameters that we need, we can construct the spam filter.

The spam filter can be understood as a function that:
  - Takes in a new input message (wn)
  - Calculates P(spam | wn) and P(ham | wn)
  - compares the values of P(spam | wn) and P(ham | wn) and:
      - If P(ham | wn) > P(spam | wn), classifies the message as ham.
      - If P(ham | wn) < P(spam | wn), classifies the message as spam.
      - If P(ham | wn) = P(spam | wn), refers to human help.

In [19]:
import re

def classify(message):

# We can start by cleaning the input message (just like we did with the training set earlier)

    message = re.sub('\W', ' ',message)
    message = message.lower()
    message = message.split()
    
# Then we define p_spam_given message and p_ham_given message using the parameters we found earlier
# We can start with a placeholder for the two variables
    
    p_spam_given_message = p_spam #We just need to add in the second half of the equation
    p_ham_given_message = p_ham
    
# We complete the second half of the equation below
# We iterate through the words in the message and for each word, multiply the probability value that the message is spam/ham based on the word (found earlier) 
    
    for word in message:
        if word in p_w_spam:
            p_spam_given_message *= p_w_spam[word]
        if word in p_w_ham:
            p_ham_given_message *= p_w_ham[word]
            
# After identifying the probability values, we can compare them to identify the entire message as spam or ham.
    
    if p_spam_given_message > p_ham_given_message:
        return 'spam'
    if p_spam_given_message < p_ham_given_message:
        return 'ham'
    else:
        return 'needs human classification'

In [20]:
classify("Sounds good, Tom, then see u there")

'ham'

In [21]:
classify('WINNER!! This is the secret code to unlock the money: C3421.')

'spam'

# Trying out the spam filter on the test set

Now that we have a function that returns labels and classifies each SMS message in our data, we can use it on our test set data.

In [22]:
test_set['predicted'] = test_set['SMS'].apply(classify)
test_set.head()

Unnamed: 0,Label,SMS,predicted
0,ham,Good night my dear.. Sleepwell&amp;Take care,ham
1,ham,Sen told that he is going to join his uncle fi...,ham
2,ham,Thank you baby! I cant wait to taste the real ...,ham
3,ham,When can ü come out?,ham
4,ham,No. Thank you. You've been wonderful,ham


Now we can compare the predicted values with the Label values to measure how good our spam filter is with classifying new messages.
We use accuracy as the metric.

In [49]:
correct = 0
total = len(test_set)

# We iterate through the dataframe to determine how many rows have the correct label
incorrect = []
for index, row in test_set.iterrows():
    if row['Label'] == row['predicted']:
        correct += 1
    else:
        incorrect.append(row)
        
print('Correct:', correct)
print('Incorrect:', total - correct)
print('Accuracy:', correct/total)

Correct: 1106
Incorrect: 8
Accuracy: 0.992818671454219


We can see that we get an accuracy value of 99.2%, which is higher than anticipated, so our spam filter works very well!

# Looking into incorrect data

Additionally, We can look at the 8 messages that the filter labeled incorrectly and try to understand why the mislabeling occured.

In [52]:
incorrect = pd.DataFrame(incorrect)
incorrect

Unnamed: 0,Label,SMS,predicted
96,ham,Anytime...,spam
115,spam,FreeMsg Hey there darling it's been 3 week's n...,ham
159,ham,The last thing i ever wanted to do was hurt yo...,needs human classification
413,ham,No calls..messages..missed calls,spam
415,spam,Hello darling how are you today? I would love ...,ham
467,spam,Not heard from U4 a while. Call me now am here...,ham
482,spam,Missed call alert. These numbers called but le...,ham
613,spam,Would you like to see my XXX pics they are so ...,ham


In [63]:
incorrect['SMS']

96                                            Anytime...
115    FreeMsg Hey there darling it's been 3 week's n...
159    The last thing i ever wanted to do was hurt yo...
413                     No calls..messages..missed calls
415    Hello darling how are you today? I would love ...
467    Not heard from U4 a while. Call me now am here...
482    Missed call alert. These numbers called but le...
613    Would you like to see my XXX pics they are so ...
Name: SMS, dtype: object

In looking at the incorrectly labeled data, we see that there are various reasons the filter may have mislabeled the message.
  - The filter labeled the message as 'needs human classification' meaning the probabilities between spam and ham were equal. This may have happened because the message itself was very long and many of the words used would be found in both spam and ham messages. 
  - The message contained an explicit reference or illegal offers. These may have been labeled ham because seperately the words may not seem to be spam since a lot of the words in the message are regular words that are probably frequent in many ham messages, but a few words that aren't point to the explicit nature of the message and highlight the need for the message to be spam. 
  - The punctuation in the message was not filtered out, which may cause a ham message to be labeled as spam.

# Conclusion

As mentioned above, we got an accuracy value of 99.2% on our test set, indicating that our spam filter works very well!

A recap of our project:
  - We took our dataset of 5572 SMS messages and split it into a test set and a training set.
  - We used our training set to identify probabilities that a message was spam or ham (non-spam) based on the words in that message. This was defined by identifying parameters of the Naive Bayes algorithm and then using those parameters to create a dictionary for both spam and ham labels that each contained the amount of times a word was used in a message with that label.
  - We then created a function based on these findings that identifies a message as spam or ham based on the probability values that the words in the message appear in either spam or ham messages in our training set data.
  - Lastly, we applied this function to our test data and were able to measure the accuracy of our spam filter.