# Building a Spam Filter with Naive Bayes

In this project, we will build a multinomial Naive Bayes algorithm to create a SMS spam filter with 80% accuracy. We will 5,572 messages from a dataset created by the UCI Machine Learning Repository to train and test our filter. 

Let's start by reading in and exploring the data.

In [32]:
import pandas as pd
import re

In [2]:
#reading in data
sms_data = pd.read_csv('SMSSpamCollection', sep='\t',
                      header=None,
                      names=['Label', 'SMS'])

In [3]:
#finding number of rows and columns
sms_data.shape

(5572, 2)

In [4]:
#finding percentage of spam and not-spam (aka ham)

sms_data['Label'].value_counts(normalize=True)*100

ham     86.593683
spam    13.406317
Name: Label, dtype: float64

The dataset includes two types of messages: spam and ham (otherwise known as not spam). The ham messages make up 87% of the data and the spam messages make up 13% of the data.

## Training and Test Set

Before we can build our spam filter we need to divide the data into a training set and a test set. The data in the training set will be used to develop the spam filter. We'll use the test set to test the accuracy of our training set later in the project.

The training set will comprise 80% of the data. We will use the remaining 20% of the data as test data. 

In [5]:
#randomizing the entire dataset
sms_data = sms_data.sample(frac=1, random_state=1)

In [6]:
#spliting dataset into two 
split_80 = int(round(len(sms_data)*.80, 0))

train = sms_data[:split_80].reset_index(drop=True)
test = sms_data[split_80:].reset_index(drop=True)

print(train.shape)
print(test.shape)

(4458, 2)
(1114, 2)


In [7]:
#finding percentage of spam and ham for both datasets

print(train['Label'].value_counts(normalize=True)*100)
print(test['Label'].value_counts(normalize=True)*100)

ham     86.54105
spam    13.45895
Name: Label, dtype: float64
ham     86.804309
spam    13.195691
Name: Label, dtype: float64


Both the training and testing datasets match the breakdown of ham (non-spam) and spam messages in the original data. This means our data is representative and we can move forward with building our spam filter. 

## Cleaning Test Data

Before we can build a naive bayes algorithm to classify messages, we need to clean the messages in training data. We will:
- Remove all punctuation from the messages.
- Ensure all messages are in lower case.

In [8]:
#Removing punctuation from messages

train['SMS'] = train['SMS'].str.replace(r"\W",' ')
train.head()

  train['SMS'] = train['SMS'].str.replace(r"\W",' ')


Unnamed: 0,Label,SMS
0,ham,Yep by the pretty sculpture
1,ham,Yes princess Are you going to make me moan
2,ham,Welp apparently he retired
3,ham,Havent
4,ham,I forgot 2 ask ü all smth There s a card on ...


In [9]:
#Setting every message to lower case
train['SMS'] = train['SMS'].str.lower()
train.sample(5)

Unnamed: 0,Label,SMS
4453,ham,sorry i ll call later in meeting any thing re...
1965,ham,y lei
2508,ham,i dunno they close oredi not ü v ma fan
3053,spam,123 congratulations in this week s competit...
2682,ham,he needs to stop going to bed and make with th...


## Creating the Vocabulary

In order to build our spam filter we need to have a vocabulary, or a list of all unique words in the training set. We will built that now. 

In [10]:
#Splitting SMS column into list
train['SMS'] = train['SMS'].str.split()
train.sample(5)

Unnamed: 0,Label,SMS
2895,ham,"[my, sister, got, placed, in, birla, soft, da]"
2525,ham,"[reckon, need, to, be, in, town, by, eightish,..."
2486,ham,"[also, that, chat, was, awesome, but, don, t, ..."
680,ham,"[height, of, recycling, read, twice, people, s..."
1290,ham,"[i, can, call, in, lt, gt, min, if, thats, ok]"


In [11]:
#Adding words from training to vocab list
vocabulary = []

for row in train['SMS']:
    for word in row:
        if word not in vocabulary:
            vocabulary.append(word)
            
len(vocabulary)

7783

## Updating Training Set

To create the spam filter, we also need our training set to include a count of words in each message. We'll create a dictionary with word counts for each message and then merge it with the training dataset. 

In [12]:
#creating dictionary 

word_counts_per_sms = {unique_word: [0] * len(train['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(train['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

In [13]:
#transforming dictionary into dataframe

word_counts_df = pd.DataFrame(word_counts_per_sms)

In [14]:
#Concatenating the new dataframe to training data

train = pd.concat([train, word_counts_df], axis=1)
train.head()

Unnamed: 0,Label,SMS,yep,by,the,pretty,sculpture,yes,princess,are,...,beauty,hides,secrets,n8,jewelry,related,trade,arul,bx526,wherre
0,ham,"[yep, by, the, pretty, sculpture]",1,1,1,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,1,1,1,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Calculating Constants

For our naive bayes algorithm, we'll need to calculate the following constant variables:
- The probability of a message being spam
- The probability of a message being ham
- The number of words in all spam messages
- The number of words in all ham messages
- The number of words in the vocabulary 
- An variable for smoothing (which will ensure we don't have any zeros in our probability calculations) equal to 1

In [15]:
#spliting train data into spam and ham sets
spam_messages = train[train['Label'] == 'spam']
ham_messages = train[train['Label'] == 'ham']

In [16]:
#calculating the probability of spam and ham

p_spam = len(spam_messages)/len(train)
p_ham = len(ham_messages)/len(train)

print(p_spam)
print(p_ham)

0.13458950201884254
0.8654104979811574


In [17]:
#calculating the number of word in spam and ham messages

count_of_spam_n = spam_messages['SMS'].apply(len)
num_spam = count_of_spam_n.sum()

count_of_ham_n = ham_messages['SMS'].apply(len)
num_ham = count_of_ham_n.sum()

print(num_spam)
print(num_ham)

15190
57237


In [18]:
#calculating n_vocab
n_vocab = len(vocabulary)
print(n_vocab)

7783


In [19]:
#initalizing variable alpha
alpha = 1

## Calculating Parameters

Now that we've established values for our constants, we need to calculate the parameters for our spam filter. We will create dictionaries containing the probability of every word in the vocabulary given both spam and ham. 

In [20]:
#creating empty dictionary with vocab words
para_spam = {unique_word: [0] for unique_word in vocabulary}
para_ham = {unique_word: [0] for unique_word in vocabulary}

In [21]:
#Looping through vocabulary to create parameters
for word in vocabulary:
    n_word_given_spam = spam_messages[word].sum()  
    p_word_given_spam = (n_word_given_spam + alpha) / (num_spam + alpha*n_vocab)
    para_spam[word] = p_word_given_spam
    
    n_word_given_ham = ham_messages[word].sum()   
    p_word_given_ham = (n_word_given_ham + alpha) / (num_ham + alpha*n_vocab)
    para_ham[word] = p_word_given_ham

## Classifying A New Message

Now that we have both our parameters and constants defined, we can build our naive bayes algorithm. To do this, we will create a function that accepts and classifies new messages according to the training data.

In [22]:
def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    for word in message:
        if word in para_spam:
            p_spam_given_message *= para_spam[word]
        if word in para_ham:
            p_ham_given_message *= para_ham[word]

    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

In [23]:
#Testing function with spam message
message = 'WINNER!! This is the secret code to unlock the money: C3421.'
classify(message)

P(Spam|message): 1.3481290211300841e-25
P(Ham|message): 1.9368049028589875e-27
Label: Spam


In [24]:
#Tesitng function with ham message
message = "Sounds good, Tom, then see u there"
classify(message)

P(Spam|message): 2.4372375665888117e-25
P(Ham|message): 3.687530435009238e-21
Label: Ham


## Measuring Filter Accuracy

Next we need to determine the filter's accuracy by using the filter on the test data. To measure the accuracy, we will tweak the classify function to run on the test data. Then, we will compare the human inputed labels to the results of the classifcation function.

In [25]:
#tweaking classification function

def classify_test_set(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in para_spam:
            p_spam_given_message *= para_spam[word]

        if word in para_ham:
            p_ham_given_message *= para_ham[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'

In [26]:
#applying updated classifier to testing data

test['predicted'] = test['SMS'].apply(classify_test_set)
test.sample(5)

Unnamed: 0,Label,SMS,predicted
641,ham,Dear how is chechi. Did you talk to her,ham
485,ham,Hey mr whats the name of that bill brison book...,ham
395,ham,Id have to check but there's only like 1 bowls...,ham
608,ham,I got it before the new year cos yetunde said ...,ham
712,ham,I'm going for bath will msg you next &lt;#&gt...,ham


In [31]:
#measuring accuracy of filter
correct = 0
total = len(test)

for row in test.iterrows():
    row = row[1]
    if row['Label'] == row['predicted']:
        correct += 1
        
accuracy = round((correct/total)*100,2)
print('The filter is {}% accurate.'.format(accuracy))

The filter is 98.74% accurate.


A filter with 98% accuracy is pretty good. For future iterations, it might be useful to explore the 2% of messages that were incorrectly classified to see if we can make further improvements. Another interesting project might use a similar approach to build a filter for spam text messages rather than spam emails. Unfortunately, spammers are constantly iterating so updating and improving filters is always necessary!

Thanks for checking out my project.