# **Building a Spam Filter with Naive Bayes**
The aim of this project is to build a spam filter using a multinomial Naive Bayes Algorithm. The dataset used to 'teach' the computer how to classify messages as spam or not consists of 5,572 SMS messages & was put together by Tiago A. Almeida & Jose Maria Gomez Hidalgo and can be accesssed at the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection) & downloaded here [directly](https://dq-content.s3.amazonaws.com/433/SMSSpamCollection).

Goal = create spam filter with an accuray of > 80%.

In [1]:
import pandas as pd
import re
sms = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['Label', 'SMS'])
sms.shape

(5572, 2)

In [2]:
sms.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
sms.tail()

Unnamed: 0,Label,SMS
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...
5571,ham,Rofl. Its true to its name


As can be inferred from the two table above : 'ham' = 'non-spam'

In [4]:
100 * sms['Label'].value_counts(normalize=True)

ham     86.593683
spam    13.406317
Name: Label, dtype: float64

### **Splitting the Dataset**
Next the dataset needs to split into two sets:
1. Training : Data used to train the algorthim so that it 'learns' how to classify messages as Spam or not-Spam.
2. Test : Data used to verify if the training process worked.

The training dataset is made up of 80% of the whole sample thus, the test set is 20%.

In [5]:
sample = sms.sample(frac=1, random_state=1)
top_80 = round(len(sample) * 0.8)
training = sample[:top_80].reset_index(drop=True)
test = sample[top_80:].reset_index(drop=True)
print(training.shape)
print(test.shape)

(4458, 2)
(1114, 2)


In [6]:
100 * training['Label'].value_counts(normalize=True)

ham     86.54105
spam    13.45895
Name: Label, dtype: float64

In [7]:
100 * test['Label'].value_counts(normalize=True)

ham     86.804309
spam    13.195691
Name: Label, dtype: float64

Both datasets have the same proportion (2.sf) of spam & non-spam messages as the whole sample thus, can be used.

In [8]:
fixed_training = training.copy()
fixed_training['SMS'] = fixed_training['SMS'].str.replace('\W', ' ').str.lower()
fixed_training

  fixed_training['SMS'] = fixed_training['SMS'].str.replace('\W', ' ').str.lower()


Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...
...,...,...
4453,ham,sorry i ll call later in meeting any thing re...
4454,ham,babe i fucking love you too you know fuck...
4455,spam,u ve been selected to stay in 1 of 250 top bri...
4456,ham,hello my boytoy geeee i miss you already a...


In [9]:
fixed_training['SMS'] = fixed_training['SMS'].str.split()
fixed_training['SMS']

0                       [yep, by, the, pretty, sculpture]
1       [yes, princess, are, you, going, to, make, me,...
2                         [welp, apparently, he, retired]
3                                                [havent]
4       [i, forgot, 2, ask, ü, all, smth, there, s, a,...
                              ...                        
4453    [sorry, i, ll, call, later, in, meeting, any, ...
4454    [babe, i, fucking, love, you, too, you, know, ...
4455    [u, ve, been, selected, to, stay, in, 1, of, 2...
4456    [hello, my, boytoy, geeee, i, miss, you, alrea...
4457                              [wherre, s, my, boytoy]
Name: SMS, Length: 4458, dtype: object

In [29]:
vocabulary = []
for message in fixed_training['SMS']:
    for word in message:
        vocabulary.append(word)
vocab = list(set(vocabulary))
vocab[10:20] ##Example of values in vocab; only showing 10 words to save space.

['thought',
 'top',
 '330',
 'she',
 'img',
 'stop',
 'thoughts',
 'games',
 'rwm',
 'life']

In [11]:
word_counts_per_sms = {unique_word :[0] * len(fixed_training['SMS']) for unique_word in vocab}

for index, sms in enumerate(fixed_training['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1
        
word_count_df = pd.DataFrame(word_counts_per_sms)
word_count_df.head()

Unnamed: 0,mix,steering,alert,ujhhhhhhh,c52,mum,midnight,shining,club4,81010,...,oreo,1hr,bullshit,pt2,177,b4,sports,mush,river,budget
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [12]:
training_word_count = pd.concat([fixed_training, word_count_df], axis=1)
training_word_count.dropna(inplace = True, subset = ['Label'])
training_word_count.head()

Unnamed: 0,Label,SMS,mix,steering,alert,ujhhhhhhh,c52,mum,midnight,shining,...,oreo,1hr,bullshit,pt2,177,b4,sports,mush,river,budget
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [13]:
p_ham = training_word_count['Label'].value_counts(normalize=True).iloc[0]
p_spam = training_word_count['Label'].value_counts(normalize=True).iloc[1]
print(p_ham, p_spam)

0.8654104979811574 0.13458950201884254


In [14]:
spam_word_total = training_word_count.loc[training_word_count['Label'] == 'spam']['SMS'].apply(len).sum()
spam_word_total

15190

In [15]:
ham_word_total = training_word_count.loc[training_word_count['Label'] == 'ham']['SMS'].apply(len).sum()
ham_word_total

57237

In [16]:
n_vocab = len(vocab)
n_vocab

7783

In [17]:
alpha = 1

In [18]:
spam_parameters = {unique_word:0 for unique_word in vocab}
ham_parameters = {unique_word:0 for unique_word in vocab}
spam_messages = training_word_count[training_word_count['Label'] == 'spam']
ham_messages = training_word_count[training_word_count['Label'] == 'ham']

In [19]:
for word in vocab:
    #SPAM!
    n_word_spam = spam_messages[word].sum()
    p_w_given_spam = (n_word_spam + alpha) / (spam_word_total + alpha*n_vocab)
    spam_parameters[word] = p_w_given_spam
    
    #HAM!
    n_word_ham = ham_messages[word].sum()
    p_w_given_ham = (n_word_ham + alpha) / (ham_word_total + alpha*n_vocab)
    ham_parameters[word] = p_w_given_ham

In [20]:
def classify(message):
    message = re.sub('\W', ' ', message)
    message = message.lower().split()
    
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    for word in message:
        if word in spam_parameters:
            p_spam_given_message *= spam_parameters[word]
        if word in ham_parameters:
            p_ham_given_message *= ham_parameters[word]
            
    print('P(Spam|message):', p_spam_given_message)
    print('P(ham|message):', p_ham_given_message)
    
    if p_ham_given_message > p_spam_given_message:
        return 'Ham'
    elif p_ham_given_message < p_spam_given_message:
        return'Spam'    
    else:
        return 'Equal probabilities, classify by a human!'

In [21]:
classify('WINNER!! This is the secret code to unlock the money: C3421.')

P(Spam|message): 1.3481290211300841e-25
P(ham|message): 1.9368049028589875e-27


'Spam'

In [22]:
classify('Sounds good, Tom, then see u there')

P(Spam|message): 2.4372375665888117e-25
P(ham|message): 3.687530435009238e-21


'Ham'

In [23]:
def classify_test_set(message):
    message = re.sub('\W', ' ', message)
    message = message.lower().split()
    
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    for word in message:
        if word in spam_parameters:
            p_spam_given_message *= spam_parameters[word]
        if word in ham_parameters:
            p_ham_given_message *= ham_parameters[word]
    
    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_ham_given_message < p_spam_given_message:
        return'spam'    
    else:
        return 'Classify by a human!'

In [24]:
test['Predicted'] = test['SMS'].apply(classify_test_set)
test.head()

Unnamed: 0,Label,SMS,Predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham


In [25]:
correct = 0
total = len(test)
for index, row in test.iterrows():
    if row['Label'] == row['Predicted']:
        correct+=1
accuracy = correct/total
print('This Naive Bayes Algorithm has an accuracy of: {0:0.2f}%'.format(100*accuracy))

This Naive Bayes Algorithm has an accuracy of: 98.74%


## **Success!**
The aim of an over 80% accurate spam filter has been achieved!