# Building a Spam Filter with Naive Bayes

In this project we'll use the multinomial Naive Bayes algorithm along with a dataset of 5,572 SMS messages that are already classified by humans to create a spam filter for messages.

The dataset was put together by Tiago A. Almeida and José María Gómez Hidalgo, and it can be downloaded from the The UCI Machine Learning Repository. The datasetcan be downloaded directly from here: https://dq-content.s3.amazonaws.com/433/SMSSpamCollection. 

In [2]:
import pandas as pd

In [3]:
#read in the dataset and give it headers
messages = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['Label', 'SMS'])

In [4]:
messages.describe()

Unnamed: 0,Label,SMS
count,5572,5572
unique,2,5169
top,ham,"Sorry, I'll call later"
freq,4825,30


In [5]:
messages['Label'].value_counts(normalize=True)

ham     0.865937
spam    0.134063
Name: Label, dtype: float64

As we can see above, 87% of messages are not spam ('Ham') and the other 13% are spam. Total there are 5,572 rows

In [6]:
messages.head(10)

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
5,spam,FreeMsg Hey there darling it's been 3 week's n...
6,ham,Even my brother is not like to speak with me. ...
7,ham,As per your request 'Melle Melle (Oru Minnamin...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...


Above we can see the first ten rows

## Split into training and test set

We're going to keep 80% of our dataset for training, and 20% for testing (we want to train the algorithm on as much data as possible, but we also want to have enough test data). The dataset has 5,572 messages, which means that:

- The training set will have 4,458 messages (about 80% of the dataset).
- The test set will have 1,114 messages (about 20% of the dataset).

In [7]:
randomized_data = messages.sample(frac=1, random_state=1)
training_data = randomized_data[:4458].reset_index()
test_data = randomized_data[4459:].reset_index()

In [8]:
#Let's find the percentage of spam and ham in both the training and the test set. 
print(training_data['Label'].value_counts(normalize=True))
print(test_data['Label'].value_counts(normalize=True))

ham     0.86541
spam    0.13459
Name: Label, dtype: float64
ham     0.867925
spam    0.132075
Name: Label, dtype: float64


As we can see above the percentages for training and test are roughly equal

## Data Cleaning

We'll first need to perform a bit of data cleaning to bring the data in a format that will allow us to extract easily all the information we need. The SMS column will be replaced by a series of new columns, where each column represents a unique word from the vocabulary. All will be set to lower case and puncuation marks will be removed.

In [9]:
print(training_data['SMS'].head(5))
training_data['SMS'] = training_data['SMS'].str.replace('\W', ' ')
print(training_data['SMS'].head(5))
training_data['SMS'] = training_data['SMS'].str.lower()
print(training_data['SMS'].head(5))

0                         Yep, by the pretty sculpture
1        Yes, princess. Are you going to make me moan?
2                           Welp apparently he retired
3                                              Havent.
4    I forgot 2 ask ü all smth.. There's a card on ...
Name: SMS, dtype: object
0                         Yep  by the pretty sculpture
1        Yes  princess  Are you going to make me moan 
2                           Welp apparently he retired
3                                              Havent 
4    I forgot 2 ask ü all smth   There s a card on ...
Name: SMS, dtype: object
0                         yep  by the pretty sculpture
1        yes  princess  are you going to make me moan 
2                           welp apparently he retired
3                                              havent 
4    i forgot 2 ask ü all smth   there s a card on ...
Name: SMS, dtype: object


In [10]:
#Create a vocabulary for the messages in the training set.

training_data['SMS'] = training_data['SMS'].str.split()

vocabulary = [] #Initiate an empty list named vocabulary.
for message in training_data['SMS']: #Iterate over the the SMS column
    for word in message: #iterate each message in the SMS column
        vocabulary.append(word) #append each string (word) to the vocabulary list
vocabulary = set(vocabulary) #This will remove the duplicates from the vocabulary list.
vocabulary = list(vocabulary) #Transform the vocabulary set back into a list using the list() function
vocabulary[:10]

['flag',
 'algarve',
 'labor',
 'students',
 'receiving',
 'come',
 'novelty',
 'convincing',
 'studyn',
 'tomorrow']

In [11]:
#We'll first build a dictionary that we'll then convert to the DataFrame we need.
word_counts_per_sms = {unique_word: [0] * len(training_data['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(training_data['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

In [12]:
word_counts = pd.DataFrame(word_counts_per_sms) #Transform word_counts_per_sms into a DataFrame using pd.DataFrame()
training_data = pd.concat([training_data, word_counts], axis=1)

In [13]:
training_data.head(10)

Unnamed: 0,index,Label,SMS,0,00,000,000pes,008704050406,0089,01223585334,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,1078,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,4028,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,958,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4642,ham,[havent],0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,4674,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0
5,5461,ham,"[ok, i, thk, i, got, it, then, u, wan, me, 2, ...",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,4210,ham,"[i, want, kfc, its, tuesday, only, buy, 2, mea...",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,4216,ham,"[no, dear, i, was, sleeping, p]",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,1603,ham,"[ok, pa, nothing, problem]",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,1504,ham,"[ill, be, there, on, lt, gt, ok]",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


# Creating the spam filter

Now that the data is cleaned and we have a training set to work with, we can begin creating the spam filter. We will first calculate:

- P(Spam) and P(Ham)
- NSpam, NHam, NVocabulary

Important to remember:
- NSpam is equal to the number of words in all the spam messages — it's not equal to the number of spam messages, and it's not equal to the total number of unique words in spam messages.
- NHam is equal to the number of words in all the non-spam messages — it's not equal to the number of non-spam messages, and it's not equal to the total number of unique words in non-spam messages.

We'll also use Laplace smoothing and set α=1

In [21]:
probabilities = training_data['Label'].value_counts(normalize=True)
print(probabilities)
p_ham = probabilities[0]
p_spam = probabilities[1]

ham     0.86541
spam    0.13459
Name: Label, dtype: float64
0.13458950201884254


In [18]:
# Isolating spam and ham messages first
spam_messages = training_data[training_data['Label'] == 'spam']
ham_messages = training_data[training_data['Label'] == 'ham']

# N_Spam
n_words_per_spam_message = spam_messages['SMS'].apply(len)
n_spam = n_words_per_spam_message.sum()

# N_Ham
n_words_per_ham_message = ham_messages['SMS'].apply(len)
n_ham = n_words_per_ham_message.sum()

# N_Vocabulary
n_vocabulary = len(vocabulary)

# Laplace smoothing
alpha = 1

In [28]:
#Initiate parameters
parameters_spam  = {}
parameters_ham  = {}

for word in vocabulary:
    parameters_spam[word] = 0
    parameters_ham[word] = 0
    
#Calculate parameters
for word in vocabulary:
    n_word_given_spam = spam_messages[word].sum()   # spam_messages already defined in a cell above
    p_word_given_spam = (n_word_given_spam + alpha) / (n_spam + alpha*n_vocabulary)
    parameters_spam[word] = p_word_given_spam
    
    n_word_given_ham = ham_messages[word].sum()   # ham_messages already defined in a cell above
    p_word_given_ham = (n_word_given_ham + alpha) / (n_ham + alpha*n_vocabulary)
    parameters_ham[word] = p_word_given_ham

# Classifying a new message

Now that we have all our parameters calculated, we can start creating the spam filter. The spam filter can be understood as a function that:

- Takes in as input a new message (w1, w2, ..., wn).
- Calculates P(Spam|w1, w2, ..., wn) and P(Ham|w1, w2, ..., wn).
- Compares the values of P(Spam|w1, w2, ..., wn) and P(Ham|w1, w2, ..., wn), and:
    - If P(Ham|w1, w2, ..., wn) > P(Spam|w1, w2, ..., wn), then the message is classified as ham.
    - If P(Ham|w1, w2, ..., wn) < P(Spam|w1, w2, ..., wn), then the message is classified as spam.
    - If P(Ham|w1, w2, ..., wn) = P(Spam|w1, w2, ..., wn), then the algorithm may request human help.

In [29]:
#Create a spam filter function  for new messages

import re

def classify(message):
    '''
    message: a string
    '''
    
    message = re.sub('\W', ' ', message)
    message = message.lower().split()
    
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]
            
        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]
            
    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)
    
    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

In [30]:
classify('WINNER!! This is the secret code to unlock the money: C3421.')

P(Spam|message): 1.3476009873135234e-25
P(Ham|message): 1.9365368329766623e-27
Label: Spam


In [31]:
classify("Sounds good, Tom, then see u there")

P(Spam|message): 2.4364950561289247e-25
P(Ham|message): 3.687133462921691e-21
Label: Ham


## Test algorithm on test data

In the previous step we managed to create a spam filter, and we classified two new messages. We'll now try to determine how well the spam filter does on our test set of 1,114 messages.

The algorithm will output a classification label for every message in our test set, which we'll be able to compare with the actual label (given by a human). Note that, in training, our algorithm didn't see these 1,114 messages, so every message in the test set is practically new from the perspective of the algorithm.

First off, we'll change the classify() function that we wrote previously to return the labels instead of printing them. Below, note that we now have return statements instead of print() functions

In [32]:
def classify_test_set(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]

        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'

In [34]:
test_data['predicted'] = test_data['SMS'].apply(classify_test_set)
test_data.head()

Unnamed: 0,index,Label,SMS,predicted
0,3418,ham,But i haf enuff space got like 4 mb...,ham
1,3424,spam,Had your mobile 10 mths? Update to latest Oran...,spam
2,1538,ham,All sounds good. Fingers . Makes it difficult ...,ham
3,5393,ham,"All done, all handed in. Don't know if mega sh...",ham
4,2744,ham,But my family not responding for anything. Now...,ham


Now we can compare the predicted values with the actual values to measure how good our spam filter is with classifying new messages. To make the measurement, we'll use accuracy as a metric

In [39]:
correct = 0
total = len(test_data)

for index, row in test_data.iterrows():
    if row['Label'] == row['predicted']:
        correct += 1
print('Accuracy is {0}%'.format(round(correct/total*100, 0)))

Accuracy is 99.0%


## Conclusion

In this project, we managed to build a spam filter for SMS messages using the multinomial Naive Bayes algorithm. The filter had an accuracy of 99% on the test set, which is an excellent result. We initially aimed for an accuracy of over 80%, but we managed to do way better than that.

## Future work

- Isolate the 14 messages that were classified incorrectly and try to figure out why the algorithm reached the wrong conclusions.
- Make the filtering process more complex by making the algorithm sensitive to letter case.