# Building a Spam Filter with Naive Bayes
In this project, we will attempt to create a spam filter using Naive Bayes to classify SMS messages as either spam or ham (non-spam). The dataset consists of 5,572 SMSs that have already been classified by humans. 

The dataset can be found [here](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection), with the collection details described [here](http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/#composition).

# Exploring the data

In [1]:
import pandas as pd
messages = pd.read_csv('SMSSpamCollection', sep = '\t', header=None, names=['Label','SMS'])

In [2]:
messages.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
messages.tail(5)

Unnamed: 0,Label,SMS
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...
5571,ham,Rofl. Its true to its name


In [4]:
messages.shape

(5572, 2)

In [5]:
messages['Label'].value_counts(normalize=True)

ham     0.865937
spam    0.134063
Name: Label, dtype: float64

So around 87% of messages are real and 13% are spam.

We will now randomize the dataset and then split it into a training (80%) and test (20%) set.

In [6]:
#randomize data
messages = messages.sample(frac=1, random_state=1)

In [7]:
#find index to split
ind = round(len(messages)*0.8)
print(ind)

4458


In [8]:
#create the splits
train = messages[:ind].reset_index(drop=True)
test = messages[ind:].reset_index(drop=True)

Now we'll make sure the splits are representative of the original data.

In [9]:
train['Label'].value_counts(normalize=True)

ham     0.86541
spam    0.13459
Name: Label, dtype: float64

In [10]:
test['Label'].value_counts(normalize=True)

ham     0.868043
spam    0.131957
Name: Label, dtype: float64

Percentages of train and test are also around 87% and 13% for ham and spam, respectively. In this case, we can proceed to clean the data.

# Data cleaning

We're going to transfrom every SMS into a frequency matrix containing the words within the text. We'll strip out the punctuation for the purposes of this exercise, although for future it might be worth including the punctuation since excessive or incorrect punctuation may indicate a spam message.

In [11]:
train['SMS'] = train['SMS'].str.replace('\W', ' ').str.lower()

In [12]:
test['SMS'] = test['SMS'].str.replace('\W', ' ').str.lower()

Next up, we split the SMSs into word lists and create a vocabulary

In [13]:
train['SMS'] = train['SMS'].str.split()
vocabulary = []
for message in train['SMS']:
    for words in message:
        vocabulary.append(words) 

In [14]:
#remove duplicate words
vocabulary = set(vocabulary)
vocabulary = list(vocabulary)

In [15]:
len(vocabulary)

7783

Time to make the dataframe containing word frequencies. 

In [16]:
word_counts_per_sms = {unique_word: [0] * len(train['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(train['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

In [17]:
word_count = pd.DataFrame(word_counts_per_sms)

In [18]:
word_count.head()

Unnamed: 0,0,00,000,000pes,008704050406,0089,01223585334,02,0207,02072069400,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


In [19]:
clean_training = pd.concat([train,word_count],axis=1)

In [20]:
clean_training.head()

Unnamed: 0,Label,SMS,0,00,000,000pes,008704050406,0089,01223585334,02,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


# Creating the spam filter

Now that the data is clean, we can start to create the spam filter using a Naive Bayes algorithm. To do this we must answer two probability questions:

$P(Spam|w1,w2,...,wn)∝P(Spam)⋅n∏i=1P(wi|Spam)$
$P(Ham|w1,w2,...,wn)∝P(Ham)⋅n∏i=1P(wi|Ham)$

We first need to calculate probabilities for spam and ham as well as the numbers of each.

Additionally, we'll use Laplace smoothing and set alpha to 1.

In [21]:
# first isolate ham and spam messages in separate lists
spam_list = clean_training[clean_training['Label'] == 'spam']
ham_list = clean_training[clean_training['Label'] == 'ham']

In [22]:
# P(spam)
p_spam = len(spam_list)/len(clean_training)
# P(ham)
p_ham = len(ham_list)/len(clean_training)

In [23]:
# N(spam)
num_words_per_spam_message = spam_list['SMS'].apply(len)
n_spam = num_words_per_spam_message.sum()
#N(ham)
num_words_per_ham_message = ham_list['SMS'].apply(len)
n_ham = num_words_per_ham_message.sum()

In [24]:
n_spam

15190

In [25]:
n_ham

57237

In [26]:
n_vocab = len(vocabulary)

In [27]:
n_vocab

7783

In [28]:
#for Laplace smoothing
alpha = 1

Since the probability for each individual word is constant in the training set, it makes sense to calculate the probability of a given word from our vocabulary as being either spam or ham just once and store it, rather than re-calculating for every new message that comes in. In effect, this means we calculate the probability P(w1|Spam) and P(w1|ham) for each word in the vocabulary. So, 7783 x 2 = 15,566 probabilities. In the long run this will significantly speed up the algorithm for classifying new messages.

In [33]:
#initiate parameters
spam_dict = {}
ham_dict = {}

for word in vocabulary:
    spam_dict[word] = 0
    ham_dict[word] = 0

In [34]:
#calculate parameters
for word in vocabulary:
    num_word_given_spam = spam_list[word].sum()
    p_word_given_spam = (num_word_given_spam + alpha)/(n_spam+alpha*n_vocab)
    #update parameter
    spam_dict[word] = p_word_given_spam
    
    num_word_given_ham = ham_list[word].sum()
    p_word_given_ham = (num_word_given_ham + alpha)/(n_ham+alpha*n_vocab)
    #update parameter
    ham_dict[word] = p_word_given_ham
    
    

Next up we're going to define a function for classifying new messages. We initially clean the messages as we did the test set and create a list of words, then calculate the probabilities given the words in the message

In [39]:
import re

def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    for word in message:
        if word in spam_dict:
            p_spam_given_message *= spam_dict[word]
        if word in ham_dict:
            p_ham_given_message *= ham_dict[word]
        
        
    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')
        


In [41]:
#now we test some obvious ones
classify('WINNER!! This is the secret code to unlock the money: C3421.')

P(Spam|message): 1.3481290211300841e-25
P(Ham|message): 1.9368049028589875e-27
Label: Spam


In [42]:
#and an obvious ham
classify("Sounds good, Tom, then see u there")

P(Spam|message): 2.4372375665888117e-25
P(Ham|message): 3.687530435009238e-21
Label: Ham


Cool, looks like it passes the initial test. Let's move on to the test set that we put aside at the beginning. We'll first define a new function that will add the predicted label to the dataframe so that we can directly compare the human labels to our algorithm.

# Classifying the test set: measuring accuracy

In [43]:
def classify_test_set(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    for word in message:
        if word in spam_dict:
            p_spam_given_message *= spam_dict[word]
        if word in ham_dict:
            p_ham_given_message *= ham_dict[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_ham_given_message < p_spam_given_message:
        return 'spam'
    else:
        return 'needs human classification'
        


In [44]:
test['predicted'] = test['SMS'].apply(classify_test_set)
test.head()

Unnamed: 0,Label,SMS,predicted
0,ham,later i guess i needa do mcat study too,ham
1,ham,but i haf enuff space got like 4 mb,ham
2,spam,had your mobile 10 mths update to latest oran...,spam
3,ham,all sounds good fingers makes it difficult ...,ham
4,ham,all done all handed in don t know if mega sh...,ham


In [46]:
#now we can measure the accuracy
correct = 0
total = test.shape[0]
for row in test.iterrows():
    row = row[1]
    if row['Label'] == row['predicted']:
        correct += 1


In [47]:
accuracy = correct/total

In [52]:
print('Correct:', correct)
print('Incorrect:', total-correct)
print(accuracy*100,'%')

Correct: 1100
Incorrect: 14
98.74326750448833 %


14 out of 1114 were classified incorrectly, giving us an accuracy of close to 99%, which is really quite satisfying. Let's now have a quick look at those 14 that were incorrectly classified.

In [53]:
incorrect_list = test[test['predicted'] != test['Label']]

In [54]:
incorrect_list

Unnamed: 0,Label,SMS,predicted
114,spam,not heard from u4 a while call me now am here...,ham
135,spam,more people are dogging in your area now call...,ham
152,ham,unlimited texts limited minutes,spam
159,ham,26th of july,spam
284,ham,nokia phone is lovly,spam
293,ham,a boy loved a gal he propsd bt she didnt mind...,needs human classification
302,ham,no calls messages missed calls,spam
319,ham,we have sent jd for customer service cum accou...,spam
504,spam,oh my god i ve found your number again i m s...,ham
546,spam,hi babe its chloe how r u i was smashed on s...,ham


Looks like one of the messages requires human classification. Of the other 13, it might be that these words were not seen in the training data set and so they didn't have a probability of being spam. For example, row 885 just contains a random mixture of mostly numbers. If this exact phrase wasn't in the training set then of course it is unlikely to be classified as spam. On the otherhand "2" is sometimes used in real messages to substitute "to" or "too", which may incorrectly give a better probability of being a real message. Let's check the probabilities of a couple of these.

In [66]:
classify('unlimited texts limited minutes')

P(Spam|message): 1.0545851553531082e-11
P(Ham|message): 1.3852684406008066e-12
Label: Spam


In [67]:
classify('2 2 146tf150p')

P(Spam|message): 7.027355709792959e-06
P(Ham|message): 1.4159232914931763e-05
Label: Ham


In [68]:
classify('rct thnq adrian for u text rgds vatian')

P(Spam|message): 2.0890554864799025e-08
P(Ham|message): 5.885659287002676e-08
Label: Ham


In [69]:
classify('nokia phone is lovly')

P(Spam|message): 3.2302276335356862e-09
P(Ham|message): 3.3416452792129453e-10
Label: Spam


In [70]:
classify('26th of july')

P(Spam|message): 5.328430258626231e-12
P(Ham|message): 1.3191533559357677e-12
Label: Spam


One thing we notice is that these incorrectly classified messages have bigger probabilities than those we tested earlier. With this in mind, some further steps are:
- introduce a probability cutoff that would require human intervention;
- include capitalization in the algorithm to make the filter sensitive to case;
- obviously we can also gather more data, but given the high accuracy already achieved, it isn't time-efective for the minimal gains that would achieve. 