## Building a Spam Filter with Naive Bayes

A project with an aim to practice probablity calculus and the Naive Bayes algorithm.

We will classify messages as spam and non-spam, using the dataset downloaded from 
[The UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection).

Our goal is to create a spam filter that classifies new messages with an accuracy greater than 80% 

In [1]:
import pandas as pd
sms = pd.read_csv('SMSSpamCollection', 
            sep='\t', 
            header=None,
            names=['Label', 'SMS'])

In [2]:
sms.shape

(5572, 2)

In [3]:
sms.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [4]:
sms['Label'].value_counts(normalize=True)*100

ham     86.593683
spam    13.406317
Name: Label, dtype: float64

## Splitting the Data Set

80% for training set and 20% for test set

In [5]:
sms_random = sms.sample(frac=1, random_state=1)

In [6]:
training_test_index = round(len(sms_random) * 0.8)

In [7]:
training = sms_random[:training_test_index].reset_index(drop=True)
test = sms_random[training_test_index:].reset_index(drop=True)

In [8]:
print(training.shape)
training['Label'].value_counts(normalize=True)*100

(4458, 2)


ham     86.54105
spam    13.45895
Name: Label, dtype: float64

In [9]:
print(test.shape)
test['Label'].value_counts(normalize=True)*100

(1114, 2)


ham     86.804309
spam    13.195691
Name: Label, dtype: float64

## Cleaning the Text

In [10]:
training.head()

Unnamed: 0,Label,SMS
0,ham,"Yep, by the pretty sculpture"
1,ham,"Yes, princess. Are you going to make me moan?"
2,ham,Welp apparently he retired
3,ham,Havent.
4,ham,I forgot 2 ask ü all smth.. There's a card on ...


In [11]:
training['SMS'] = training['SMS'].str.replace('\W', ' ').str.lower()
training.head()

  training['SMS'] = training['SMS'].str.replace('\W', ' ').str.lower()


Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...


## Creating a Vocabulary

In [12]:
vocabulary = []
training['SMS'] = training['SMS'].str.split()

In [13]:
for message in training['SMS']:
    for word in message:
        vocabulary.append(word)
vocabulary = list(set(vocabulary))

In [14]:
len(vocabulary)

7783

## The Final Training Set

In [15]:
word_counts_per_sms = {unique_word: [0] * len(training['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(training['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

In [16]:
word_count = pd.DataFrame(word_counts_per_sms)
word_count.head()

Unnamed: 0,switch,cr9,sausage,eggs,ghodbandar,linux,water,ing,08706091795,burger,...,cat,relaxing,amy,100p,help08718728876,kath,mila,groovying,bcm4284,tells
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [17]:
training_clean = pd.concat([training, word_count], axis=1)
training_clean.head()

Unnamed: 0,Label,SMS,switch,cr9,sausage,eggs,ghodbandar,linux,water,ing,...,cat,relaxing,amy,100p,help08718728876,kath,mila,groovying,bcm4284,tells
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Calculating Constants

In [18]:
# Calculating P(Spam) and P(Ham)
spam_training_clean = training_clean[training_clean['Label']=='spam']
ham_training_clean = training_clean[training_clean['Label']=='ham']

p_spam = len(spam_training_clean)/len(training_clean)
p_ham = len(ham_training_clean)/len(training_clean)
print('P(spam) =',round(p_spam*100,2),'%')
print('P(ham) =',round(p_ham*100,2),'%')

P(spam) = 13.46 %
P(ham) = 86.54 %


In [19]:
# Calculating total number of words in spam messages
n_spam = 0
for message in spam_training_clean['SMS']:
    n_spam +=len(message)
print(n_spam)

15190


In [20]:
# Calculating total number of words in ham messages
n_ham = 0
for message in ham_training_clean['SMS']:
    n_ham +=len(message)
print(n_ham)

57237


In [21]:
# Number of words in vocabulary
n_voc = len(vocabulary)
print(n_voc)

7783


## Calculating Parameters

In [22]:
spam_word_p = {word:0 for word in vocabulary}
ham_word_p = {word:0 for word in vocabulary}
alpha = 1

In [23]:
# Calculating probability for each word
for word in vocabulary:
    n_w_if_spam = spam_training_clean[word].sum()
    p_w_if_spam = (n_w_if_spam + alpha) / (n_spam + alpha*n_voc)
    spam_word_p[word] = p_w_if_spam
    n_w_if_ham = ham_training_clean[word].sum()
    p_w_if_ham = (n_w_if_ham + alpha) / (n_ham + alpha*n_voc)
    ham_word_p[word] = p_w_if_ham

In [24]:
i=0
for word in spam_word_p:
    print('spam_p for', word, 'is', spam_word_p[word])
    print('ham_p for', word, 'is', ham_word_p[word])
    i += 1
    if i>6: break

spam_p for switch is 8.705872110738693e-05
ham_p for switch is 3.075976622577668e-05
spam_p for cr9 is 0.0003482348844295477
ham_p for cr9 is 1.537988311288834e-05
spam_p for sausage is 4.3529360553693465e-05
ham_p for sausage is 3.075976622577668e-05
spam_p for eggs is 4.3529360553693465e-05
ham_p for eggs is 4.6139649338665025e-05
spam_p for ghodbandar is 4.3529360553693465e-05
ham_p for ghodbandar is 3.075976622577668e-05
spam_p for linux is 4.3529360553693465e-05
ham_p for linux is 3.075976622577668e-05
spam_p for water is 4.3529360553693465e-05
ham_p for water is 0.00013841894801599507


## Creating a Spam Filter

In [25]:
import re

def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    for word in message:
        if word in spam_word_p:
            p_spam_given_message *= spam_word_p[word]
        if word in ham_word_p:
            p_ham_given_message *= ham_word_p[word]

    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

In [26]:
classify('WINNER!! This is the secret code to unlock the money: C3421.')

P(Spam|message): 1.3481290211300841e-25
P(Ham|message): 1.9368049028589875e-27
Label: Spam


In [27]:
classify('Sounds good, Tom, then see u there')

P(Spam|message): 2.4372375665888117e-25
P(Ham|message): 3.687530435009238e-21
Label: Ham


In [28]:
classify('I´ll be there')

P(Spam|message): 5.796259616204653e-14
P(Ham|message): 1.1789144783618678e-09
Label: Ham


## Testing with the Test Set

In [29]:
def classify_test_set(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in spam_word_p:
            p_spam_given_message *= spam_word_p[word]

        if word in ham_word_p:
            p_ham_given_message *= ham_word_p[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'

In [30]:
test['predicted'] = test['SMS'].apply(classify_test_set)

In [31]:
test.head()

Unnamed: 0,Label,SMS,predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham


In [32]:
# Measuring the accuracy
correct = 0
total = len(test)
for i, row in test.iterrows():
    if row['Label']==row['predicted']:
        correct += 1
accuracy = correct / total
print('Accuracy of spam filter is:', round(accuracy*100,2), '%')

Accuracy of spam filter is: 98.74 %
