## Building a Spam filter with Naive Bayes

To classify messages as spam or non-spam, we saw in the previous mission that the computer:

Learns how humans classify messages.
Uses that human knowledge to estimate probabilities for new messages — probabilities for spam and non-spam.
Classifies a new message based on these probability values — if the probability for spam is greater, then it classifies the message as spam. Otherwise, it classifies it as non-spam (if the two probability values are equal, then we may need a human to classify the message).

In [2]:
import pandas as pd
import numpy as np

dataset_spam=pd.read_csv\
('SMSSpamCollection',sep='\t',names=['Label', 'SMS'])

In [3]:
dataset_spam.head(5)

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [4]:
dataset_spam['Label'].value_counts(normalize=True)

ham     0.865937
spam    0.134063
Name: Label, dtype: float64

In [5]:
random_dataset=dataset_spam.sample(frac=1,random_state=1)
training_len=round(random_dataset.shape[0]*0.80)
training_len

4458

In [6]:
training_dataset=random_dataset.iloc[:training_len].reset_index(drop=True)
test_dataset=random_dataset.iloc[training_len:].reset_index(drop=True)
print(len(training_dataset))
print(len(test_dataset))

4458
1114


In [7]:
training_dataset.head(2)

Unnamed: 0,Label,SMS
0,ham,"Yep, by the pretty sculpture"
1,ham,"Yes, princess. Are you going to make me moan?"


In [8]:
test_dataset.head(2)

Unnamed: 0,Label,SMS
0,ham,Later i guess. I needa do mcat study too.
1,ham,But i haf enuff space got like 4 mb...


In [9]:
training_dataset['SMS']=training_dataset['SMS'].str.replace('\W',' ')
training_dataset['SMS']=training_dataset['SMS'].str.lower()
training_dataset.head(5)

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...


In [10]:
training_dataset_copy=training_dataset.copy()
training_dataset_copy['SMS']=training_dataset_copy['SMS'].str.split()
training_dataset_copy.head(5)

Unnamed: 0,Label,SMS
0,ham,"[yep, by, the, pretty, sculpture]"
1,ham,"[yes, princess, are, you, going, to, make, me,..."
2,ham,"[welp, apparently, he, retired]"
3,ham,[havent]
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,..."


In [11]:
vocabulary=[]
for row in training_dataset_copy['SMS']:
    for sms in row:
        vocabulary.append(sms) 
vocabulary=list(set(vocabulary))
len(vocabulary)

7783

In [12]:
training_dataset.head(2)

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan


In [13]:
word_counts_per_sms = {unique_word: [0] * len(training_dataset_copy['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(training_dataset_copy['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

In [14]:
word_counts = pd.DataFrame(word_counts_per_sms)


In [15]:
word_counts.columns[1000:1007]

Index(['another', 'ans', 'ansr', 'answer', 'answered', 'answerin',
       'answering'],
      dtype='object')

In [16]:
training_set_clean = pd.concat([training_dataset_copy, word_counts], axis=1)

In [17]:
training_set_clean.head(3)

Unnamed: 0,Label,SMS,0,00,000,000pes,008704050406,0089,01223585334,02,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [21]:
spam_messages = training_set_clean[training_set_clean['Label'] == 'spam']
ham_messages = training_set_clean[training_set_clean['Label'] == 'ham']

In [22]:
# Probability of a spam message in a training set
p_spam = len(spam_messages) / len(training_set_clean)
# Probability of a non spam message in a training set
p_ham = len(ham_messages) / len(training_set_clean)

In [26]:
# N_spam is equal to the number of words in all the spam messages
n_words_per_spam_message = spam_messages['SMS'].apply(len)
n_spam=n_words_per_spam_message.sum()
n_words_per_ham_message = ham_messages['SMS'].apply(len)
n_ham=n_words_per_ham_message.sum()
n_vocabulary = len(vocabulary)
# Laplace smoothing
alpha = 1

Calculated - P(Spam), P(Ham) , NSpam, NHam, NVocabulary

In [28]:
parameters_spam = {unique: 0 for unique in vocabulary}
parameters_ham = {unique: 0 for unique in vocabulary}

In [29]:
parameters_spam

{'ifink': 0,
 'pobox84': 0,
 'centre': 0,
 'doubletxt': 0,
 'buffet': 0,
 'camp': 0,
 'checkmate': 0,
 'why': 0,
 'ringtoneking': 0,
 'hsbc': 0,
 'mk45': 0,
 'directly': 0,
 'weight': 0,
 'bawling': 0,
 '6pm': 0,
 'butting': 0,
 'mth': 0,
 'equally': 0,
 'even': 0,
 'nokias': 0,
 'afford': 0,
 'insurance': 0,
 'collapsed': 0,
 'dress': 0,
 'falls': 0,
 'gei': 0,
 'breathe': 0,
 'avenue': 0,
 'messy': 0,
 'm6': 0,
 'advice': 0,
 'audrey': 0,
 'accidentally': 0,
 'hv': 0,
 'jenne': 0,
 'happenin': 0,
 'matra': 0,
 'bmw': 0,
 'fuckin': 0,
 'forwarding': 0,
 'fondly': 0,
 'wrking': 0,
 'totes': 0,
 'cann': 0,
 'samus': 0,
 'sex': 0,
 'greatest': 0,
 'cd': 0,
 'cutie': 0,
 'height': 0,
 'tcr': 0,
 'breather': 0,
 'psychiatrist': 0,
 'courage': 0,
 'obviously': 0,
 'inst': 0,
 'sol': 0,
 'accident': 0,
 'indeed': 0,
 'iyo': 0,
 'monoc': 0,
 'oble': 0,
 'subscriptn3gbp': 0,
 'will': 0,
 'wkend': 0,
 'drinks': 0,
 'understand': 0,
 'mobilesvary': 0,
 'tue': 0,
 'terms': 0,
 'december': 0,
 'vi

In [31]:
for word in vocabulary:
    n_word_given_spam=spam_messages[word].sum()
    p_word_given_spam = (n_word_given_spam + alpha) / (n_spam + alpha*n_vocabulary)
    parameters_spam[word] = p_word_given_spam
    n_word_given_ham=ham_messages[word].sum()
    p_word_given_spam = (n_word_given_ham + alpha) / (n_ham + alpha*n_vocabulary)
    parameters_ham[word] = p_word_given_spam

In [32]:
import re

def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    '''    
    This is where we calculate:

    p_spam_given_message = ?
    p_ham_given_message = ?
    '''   
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]
        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]
            
            

    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

In [33]:
classify('WINNER!! This is the secret code to unlock the money: C3421.')

P(Spam|message): 1.3481290211300841e-25
P(Ham|message): 1.9368049028589875e-27
Label: Spam


In [34]:
classify('Sounds good, Tom, then see u there')

P(Spam|message): 2.4372375665888117e-25
P(Ham|message): 3.687530435009238e-21
Label: Ham


In [35]:
def classify_test_set(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]

        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'

In [37]:
test_dataset['predicted'] = test_dataset['SMS'].apply(classify_test_set)
test_dataset.head()

Unnamed: 0,Label,SMS,predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham


In [43]:
correct=0
total=len(test_dataset)
total

1114

In [44]:
correct=len(test_dataset[test_dataset['Label']==test_dataset['predicted']])

In [47]:
accuracy=correct/total
accuracy*100

98.74326750448833

The accuracy is close to 98.74%, which is really good. Our spam filter looked at 1,114 messages that it hasn't seen in training, and classified 1,100 correctly.