# Building a Spam Filter With Naive Bayes

Using a dataset from the UCI Machine Learning Repository, specifically the [SMS Spam Collection Data Set
](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection), we'll be building a spam filter in Python using statistics and the Naive Bayes algorithm.

***Note in our dataset, messages marked "Spam" are spam messages, and messages marked "Ham" are not spam messages.***

In [1]:
import pandas as pd

data = pd.read_csv("SMSSpamCollection", sep="\t", header=None, names=["Label","SMS"])



In [2]:
data.shape

(5572, 2)

In [3]:
data.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [4]:
data.groupby("Label").count()

Unnamed: 0_level_0,SMS
Label,Unnamed: 1_level_1
ham,4825
spam,747


## Percentage of spam vs non-spam messages

* As seen above, ~ 4825 out of 5572, or ~87% of messages are non-spam
* ~ 747 out of 5572, or ~13% of messages are spam

## Train / Test Split our data

In order to effectively build and test our algorithm, we'll need to split our data into 2 sets:
* 80% of our messages will be used for training our algorithm (~4,458 messages)
* 20% of our messages will be used for testing our algorithm'(~1,114 messages)

This train/test split is done very often with machine learning and data science work.

### Our goal

Our goal is to create a spam filter that correctly classifies messages with an accuracy greater than 80%. So we expect that more than 80% of new messages will be classified correctly as spam or "ham" by our algorithm.

In [5]:
# Randomize our training set and make the results reproducible
# by seeding via the random state argument
randomized_data = data.sample(frac=1, random_state=1)

# Calculate the index for splitting our train/test data
training_test_index = round(len(randomized_data) * 0.8)

train_data = randomized_data[:training_test_index].reset_index(drop=True)
test_data = randomized_data[training_test_index:].reset_index(drop=True)

In [6]:
print(train_data.shape)
print(test_data.shape)

(4458, 2)
(1114, 2)


In [7]:
train_data.groupby("Label").count()

Unnamed: 0_level_0,SMS
Label,Unnamed: 1_level_1
ham,3858
spam,600


In [8]:
test_data.groupby("Label").count()

Unnamed: 0_level_0,SMS
Label,Unnamed: 1_level_1
ham,967
spam,147


## Train test split results:

Based on our results above, the train and test set looks representative of the overall dataset

* 3858 / 4459 (87%) of our training data is non-spam
* 967 / 1113 (13%) of our test data is spam

## Transforming our data

To be able to classify spam data with our dataset, we'll need to clean and transform our data so that:

1. Each column represents a unique word
1. The value in each column will be a count of the number of times a word appears in a message
1. All words will be lowercase
1. Punctuation will be ignored



In [9]:
train_data.head()

Unnamed: 0,Label,SMS
0,ham,"Yep, by the pretty sculpture"
1,ham,"Yes, princess. Are you going to make me moan?"
2,ham,Welp apparently he retired
3,ham,Havent.
4,ham,I forgot 2 ask ü all smth.. There's a card on ...


In [10]:
train_data["SMS"] = train_data["SMS"].str.replace('\W', ' ')
train_data["SMS"] = train_data["SMS"].str.lower()



In [11]:
train_data.head()

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...


In [12]:
vocabulary = []

train_data["SMS"] = train_data["SMS"].str.split(' ')

for sms in train_data["SMS"]:
    for word in sms:
        vocabulary.append(word)
        
vocabulary = list(set(vocabulary))

In [13]:
vocabulary

['',
 'unhappiness',
 'anyhow',
 'cars',
 '07946746291',
 'chances',
 'apparently',
 'meeting',
 'stamps',
 'route',
 'mids',
 'cbe',
 'browse',
 'dis',
 'phd',
 'arguing',
 'abeg',
 'en',
 '3rd',
 'mutai',
 '6zf',
 'store',
 'bleak',
 'thinked',
 'z',
 'function',
 'honeymoon',
 'servs',
 'same',
 'shy',
 'kaila',
 'lodging',
 'holy',
 'old',
 '50pmmorefrommobile2bremoved',
 'gage',
 'liking',
 'darling',
 '3x',
 'wedding',
 'others',
 'duo',
 '864233',
 'sender',
 'steamboat',
 'matters',
 'village',
 'sitter',
 'grownup',
 'centre',
 'image',
 'personality',
 'vaazhthukkal',
 'vu',
 '3000',
 'falling',
 'velachery',
 '50pm',
 'matric',
 'noice',
 'and',
 'selling',
 'sts',
 'snickering',
 'ppm150',
 '08718726270',
 'dramatic',
 'doggin',
 'telediscount',
 'emigrated',
 'thankyou',
 'whenever',
 '08719181513',
 'inclusive',
 'natwest',
 'dark',
 'rum',
 'wine',
 '08719181503',
 'mobiles',
 'doubletxt',
 'try',
 'garage',
 'thout',
 'clearing',
 '195',
 'maneesha',
 'turning',
 'beer'

## Word counts

Now that we have our vocabulary, below we'll count how many times a word appears in a particular message, which will help with our spam classifier later on

In [14]:
word_counts_per_sms = {unique_word: [0] * len(train_data["SMS"]) for unique_word in vocabulary}

for index, sms in enumerate(train_data["SMS"]):
    for word in sms:
        word_counts_per_sms[word][index] += 1


In [15]:
word_counts = pd.DataFrame(word_counts_per_sms)

In [16]:
word_counts_clean = pd.concat([train_data, word_counts], axis=1)

In [17]:
word_counts_clean.head()

Unnamed: 0,Label,SMS,Unnamed: 3,0,00,000,000pes,008704050406,0089,01223585334,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,ham,"[yep, , by, the, pretty, sculpture]",1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, , princess, , are, you, going, to, make,...",3,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,"[havent, ]",1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, , , there, s...",7,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


## Calculating Probabilities

Now that we have clean data, we'll start calculating probabilities based on our training set.

We'll also use Laplace smoothing and set alpha = 1.

In [18]:
train_data["Label"].value_counts()

ham     3858
spam     600
Name: Label, dtype: int64

In [19]:
train_spam = word_counts_clean["Label"] == "spam"
train_spam = word_counts_clean[train_spam]

train_ham = word_counts_clean["Label"] == "ham"
train_ham = word_counts_clean[train_ham]

In [20]:
p_ham = len(train_ham) / len(train_data)
p_spam = len(train_spam) / len(train_data)

In [21]:
train_spam

Unnamed: 0,Label,SMS,Unnamed: 3,0,00,000,000pes,008704050406,0089,01223585334,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
16,spam,"[freemsg, why, haven, t, you, replied, to, my,...",5,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
18,spam,"[congrats, , 2, mobile, 3g, videophones, r, yo...",10,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
56,spam,"[free, message, activate, your, 500, free, tex...",3,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
60,spam,"[call, from, 08702490080, , , tells, u, 2, cal...",10,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
61,spam,"[someone, has, conacted, our, dating, service,...",1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
62,spam,"[ree, entry, in, 2, a, weekly, comp, for, a, c...",4,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
70,spam,"[ur, cash, balance, is, currently, 500, pounds...",3,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
71,spam,"[this, is, the, 2nd, time, we, have, tried, 2,...",7,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
84,spam,"[final, chance, , claim, ur, , 150, worth, of,...",9,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
89,spam,"[urgent, , we, are, trying, to, contact, you, ...",6,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [22]:
train_ham

Unnamed: 0,Label,SMS,Unnamed: 3,0,00,000,000pes,008704050406,0089,01223585334,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,ham,"[yep, , by, the, pretty, sculpture]",1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, , princess, , are, you, going, to, make,...",3,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,"[havent, ]",1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, , , there, s...",7,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0
5,ham,"[ok, i, thk, i, got, it, , then, u, wan, me, 2...",2,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,ham,"[i, want, kfc, its, tuesday, , only, buy, 2, m...",5,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,ham,"[no, dear, i, was, sleeping, , , p]",2,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,ham,"[ok, pa, , nothing, problem, , , ]",4,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,ham,"[ill, be, there, on, , , lt, , , gt, , , ok, ]",7,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [23]:
n_spam = train_spam["SMS"].apply(len)
n_spam = n_spam.sum()
n_spam

17956

In [24]:
n_ham = train_ham["SMS"].apply(len)
n_ham = n_ham.sum()
n_ham

71219

In [25]:
n_vocabulary = len(vocabulary)

In [26]:
n_vocabulary

7784

In [27]:
alpha = 1

## Applying the Naive Bayes Algorithm

Below we'll be using the Naive Bayes formula with Laplace smoothing to classify our messages in the training set.

Below is the algorithm we'll be using [(source)](https://en.wikipedia.org/wiki/Naive_Bayes_classifier):

![Naive Bayes Algorithm Equation](https://wikimedia.org/api/rest_v1/media/math/render/svg/52bd0ca5938da89d7f9bf388dc7edcbd546c118e "Naive Bayes Algorithm Equation")

In [28]:
spam_parameters = {unique_word:0 for unique_word in vocabulary}
ham_parameters = {unique_word:0 for unique_word in vocabulary}

for word in vocabulary:
    # Find the number of occurences of the word in our spam training set
    n_word_given_spam = train_spam[word].sum()
    # Find the number of occurences of the word in our spam training set
    n_word_given_ham = train_ham[word].sum()
    
    p_word_given_spam = (n_word_given_spam + alpha) / (n_spam + alpha * n_vocabulary)
    p_word_given_ham = (n_word_given_ham + alpha) / (n_ham + alpha * n_vocabulary)

    spam_parameters[word] = p_word_given_spam
    ham_parameters[word] = p_word_given_ham

    

## Creating our spam filter

Now that we have all the probabilities and parameters sorted out from our training set, we can now create our function to predict and filter spam messages based on our test set.

We want to create a function that:

1. Takes in a new message.
1. Calculates the probability of "spam" and "ham".
1. Compares the two values and classifies it as "spam" or "ham" depending on which value is greater.
1. If the probabilities are equal, then the algorithm will request human help determining whether the message is spam or not.

In [29]:
import re

def classify_spam(message):
    
    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()
    
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    for word in message:
        if word in spam_parameters:
            p_spam_given_message *= spam_parameters[word]
        if word in ham_parameters:
            p_ham_given_message *= ham_parameters[word]
        

    print('P(Spam|message): ', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)
    
    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal probabilities, have a human classify this!')
        


## Testing our function

Below we'll use the two test strings to test our function:

* **Spam** - 'WINNER!! This is the secret code to unlock the money: C3421.'

* **Ham** - 'Sounds good, Tom, then see u there'


In [30]:
spam_test = 'WINNER!! This is the secret code to unlock the money: C3421.'
ham_test = 'Sounds good, Tom, then see u there'

classify_spam(spam_test)

P(Spam|message):  4.8441099043219405e-26
P(Ham|message): 3.3551837806850006e-28
Label: Spam


In [31]:
classify_spam(ham_test)

P(Spam|message):  1.099416001503406e-25
P(Ham|message): 9.431033496608998e-22
Label: Ham


## Using our function with our test data

Below, we'll redefine our function to return the classifying labels instead of printing them, and use it in our test data set.

We'll also see how accurate our function is by measuring accuracy, which will be defined as:

* Number of correctly classified messages / total number of classified messages

In [34]:
import re

def classify_spam_data(message):
    
    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()
    
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    for word in message:
        if word in spam_parameters:
            p_spam_given_message *= spam_parameters[word]
        if word in ham_parameters:
            p_ham_given_message *= ham_parameters[word]
        
    
    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_ham_given_message < p_spam_given_message:
        return 'spam'
    else:
        return 'needs human classification'
        


In [35]:
test_data['predicted'] = test_data['SMS'].apply(classify_spam_data)

In [38]:
test_data.head()

Unnamed: 0,Label,SMS,predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham


In [57]:
correct = 0
total = len(test_data)

for row in test_data.iterrows():
    row = row[1]
    if row["Label"] == row["predicted"]:
        correct += 1
        
spam_classifier_accuracy = correct / total

print("Correct # of classifications: ", correct)
print("Incorrect # of classifications", total - correct)
print("Accuracy rate: {:2%}".format(spam_classifier_accuracy))

Correct # of classifications:  1099
Incorrect # of classifications 15
Accuracy rate: 98.653501%


## Results

Overall our spam filter accuracy is 98.65% accurate, based on the 1,114 messages in our test set, which is a great result.

Our initial goal was an 80% or greater success rate, so this great exceeds our goal.

## Next steps

For next steps we should.

1. Explore the 15 mis-labeled messages to understand why they were classified incorrectly
1. Tweak our algorithm to take into account casing (i.e. uppercase vs lowercase letters)