## BUILDING A SPAM FILTER WITH NAIVE BAYES

In this project, I'm going to build a spam filter for SMS messages by using the Multinomial Naive Bayes Algorithm.
My goal is to build a spam filter that classifies the new SMS messages with an accuracy greater than 80%.
To train the algorithm, I'll use this [dataset](https://data.world/lylepratt/sms-spam), created by Lyle Pratt, from [data.world](https://data.world/). The dataset is in .txt format. So I have converted it into a .csv file. The dataset contains 5,574 SMS that are already classified by humans. 

### THE DATASET

In [1]:
#Importing necessary packages
import pandas as pd

#Reading in the dataset
data = pd.read_csv("sms.csv", encoding = "unicode_escape")
print(data.head())
data.info()

  Label                                               Text
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5574 entries, 0 to 5573
Data columns (total 2 columns):
Label    5574 non-null object
Text     5574 non-null object
dtypes: object(2)
memory usage: 87.2+ KB


In [2]:
data['Label'].value_counts(normalize=True)

ham     0.865985
spam    0.134015
Name: Label, dtype: float64

About 87% of the messages are ham and 13% are ham. Since, in reality, most messages that people receive are ham, the sample is considered to be representative

### TRAINING AND TEST SET
Now I'm going to split the dataset into training set and test set where training set will contain 80 % of the data, and the test set, 20 % of the data.

In [3]:
#Randomizing the dataset
random_data = data.sample(frac=1, random_state=1)

#Index for splitting
index = round(len(random_data)*0.8)

#Splitting the dataset
training_set = random_data[:index].reset_index(drop=True)
test_set = random_data[index:].reset_index(drop=True)

print("Number of rows in training Set = ",training_set.shape[0])
print("Number of rows in test Set = ",test_set.shape[0])


Number of rows in training Set =  4459
Number of rows in test Set =  1115


I'll now calculate the percentage of spam and ham emails in training and test sets. The percentages should be closer to what was calculated for the full dataset. Let's analyze.

In [4]:
training_set["Label"].value_counts(normalize=True)

ham     0.865216
spam    0.134784
Name: Label, dtype: float64

In [5]:
test_set["Label"].value_counts(normalize=True)

ham     0.869058
spam    0.130942
Name: Label, dtype: float64

The spam and ham percentages are close to the percentages in the full dataset. So let's move on to the next step.

### DATA CLEANING

#### 1. CONVERTING TO LOWERCASE

In [6]:
training_set["Text"] = training_set["Text"].str.lower()
training_set.head()

Unnamed: 0,Label,Text
0,ham,looks like u wil b getting a headstart im leav...
1,ham,"i noe la... u wana pei bf oso rite... k lor, o..."
2,ham,2mro i am not coming to gym machan. goodnight.
3,spam,todays vodafone numbers ending with 4882 are s...
4,ham,"hi. hope ur day * good! back from walk, table ..."


#### 2. REMOVING PUNCTUATION

In [7]:
import re
def punctuation(text):
    return re.sub('\W',' ',text)
training_set["Text"] = training_set["Text"].apply(punctuation)
training_set.head()



Unnamed: 0,Label,Text
0,ham,looks like u wil b getting a headstart im leav...
1,ham,i noe la u wana pei bf oso rite k lor o...
2,ham,2mro i am not coming to gym machan goodnight
3,spam,todays vodafone numbers ending with 4882 are s...
4,ham,hi hope ur day good back from walk table ...


### VOCABULARY

I'll now create a vocabulary, which is a list of all the unique words in the training set

In [8]:
#Splitting each line into words
training_set["Text"] = training_set["Text"].str.split()

#Creating an empty list
vocab = []
for text in training_set["Text"]:
    for word in text:
        vocab.append(word)
print("Number of total words = ",len(vocab))        

#converting list to set removes all the duplicate values.
#So I'm converting the list to set and then converting the set to list, to get only the unique values
vocab = list(set(vocab))
print("Number of unique words = ",len(vocab))


Number of total words =  72469
Number of unique words =  7798


Now I'm going to create a dataframe that shows whether the message contains the words in the vocabulary

In [9]:
word_count = {uniqueword : [0]*len(training_set["Text"]) for uniqueword in vocab}
for index,text in enumerate(training_set["Text"]):
    for word in text:
        word_count[word][index] += 1

In [10]:
word_count = pd.DataFrame(word_count)
word_count.head()

Unnamed: 0,studying,sign,09095350301,singapore,high,mas,if,tickets,18,skirt,...,la1,talks,raji,pa,smidgin,sheets,atural,responding,loneliness,missions
0,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [11]:
#Concatenating training_set with word_count
final_training_set = pd.concat([training_set,word_count],axis=1)
final_training_set.head()

Unnamed: 0,Label,Text,studying,sign,09095350301,singapore,high,mas,if,tickets,...,la1,talks,raji,pa,smidgin,sheets,atural,responding,loneliness,missions
0,ham,"[looks, like, u, wil, b, getting, a, headstart...",0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[i, noe, la, u, wana, pei, bf, oso, rite, k, l...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[2mro, i, am, not, coming, to, gym, machan, go...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,spam,"[todays, vodafone, numbers, ending, with, 4882...",0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[hi, hope, ur, day, good, back, from, walk, ta...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### BUILDING SPAM FILTER

When a new message comes in, the multinomial Naive Bayes algorithm will classify it based on the result of the two equations below

1. $P(Spam|{ w }_{ 1 },{ w }_{ 2 },....,{ w }_{ n })\propto P(Spam) \cdot\prod _{ i=1 }^{ n }{ P({ w }_{ i }|Spam) } $

2. $P(Ham|{ w }_{ 1 },{ w }_{ 2 },....,{ w }_{ n })\propto P(Ham) \cdot\prod _{ i=1 }^{ n }{ P({ w }_{ i }|Ham) } $

If $P(Spam|{ w }_{ 1 },{ w }_{ 2 },....,{ w }_{ n })$ is greater than $P(Ham|{ w }_{ 1 },{ w }_{ 2 },....,{ w }_{ n })$, then the message is considered to be spam

To calculate $P({ w }_{ i }|Spam)$ and $P({ w }_{ i }|Ham)$, the following two equations will be used

3. $P({ w }_{ i }|Spam)=\frac { { N }_{ { w }_{ i }|Spam }+\alpha  }{ { N }_{ Spam }+\alpha \cdot { N }_{ Vocabulary } }$

4. $P({ w }_{ i }|Ham)=\frac { { N }_{ { w }_{ i }|Ham }+\alpha  }{ { N }_{ Ham }+\alpha \cdot { N }_{ Vocabulary } }$

where,

$P({ w }_{ i }|Spam)$ = Number of times the word ${ w }_{ i }$ occurs in spam messages

$P({ w }_{ i }|Ham)$ = Number of times the word ${ w }_{ i }$ occurs in Ham messages

${ N }_{ Spam }$ = Total number of words in Spam messages

${ N }_{ Ham }$ = Total number of words in Ham messages

${ N }_{ Vocabulary }$ = Total number of words in vocabulary

$\alpha$ = Laplace smoothing. Set $\alpha$ = 1


Now I'll calculate the value of some terms(constant terms) that will be repeatedly used to avoid calculating it again and again


In [12]:
#Isolating Spam and Ham messages

spam = final_training_set[final_training_set["Label"]=="spam"]
ham = final_training_set[final_training_set["Label"]=="ham"]

#Calculating P(Spam) and P(Ham)

p_spam = len(spam)/len(final_training_set)
p_ham = len(ham)/len(final_training_set)

#Calculating N_Spam and N_Ham

words_per_spam_message = spam["Text"].apply(len)
n_spam = words_per_spam_message.sum()

words_per_ham_message = ham["Text"].apply(len)
n_ham = words_per_ham_message.sum()

#N_Vocabulary

n_vocab = len(vocab)

#Setting the value of Laplace Smoothing

alpha = 1


The next step will be calculating the value of parameters $P({ w }_{ i }|Spam)$ and $P({ w }_{ i }|Ham)$

In [13]:
spam_parameters = {uniqueword:0 for uniqueword in vocab}
ham_parameters = {uniqueword:0 for uniqueword in vocab}

for word in vocab:
    n_word_given_spam = spam[word].sum()
    p_word_given_spam = (n_word_given_spam + alpha) / (n_spam + alpha * n_vocab)
    spam_parameters[word] = p_word_given_spam
    
    n_word_given_ham = ham[word].sum()
    p_word_given_ham = (n_word_given_ham + alpha) / (n_ham + alpha * n_vocab)
    ham_parameters[word] = p_word_given_ham

### CLASSIFYING A NEW MESSAGE

Now that I have calculated all the parameters, I'll start building the spam filter.

The spam filter is a function that:

1. Takes in a message as an input
2. Converts the line into words by removing punctuations and converting the words to lower case ${ w }_{ 1 },{ w }_{ 2 },....,{ w }_{ n }$
3. Calculates $P(Spam|{ w }_{ 1 },{ w }_{ 2 },....,{ w }_{ n })$ and $P(Ham|{ w }_{ 1 },{ w }_{ 2 },....,{ w }_{ n })$
3. Compares the value of $P(Spam|{ w }_{ 1 },{ w }_{ 2 },....,{ w }_{ n })$ and $P(Ham|{ w }_{ 1 },{ w }_{ 2 },....,{ w }_{ n })$ and:
    1. If $P(Spam|{ w }_{ 1 },{ w }_{ 2 },....,{ w }_{ n })$ > $P(Ham|{ w }_{ 1 },{ w }_{ 2 },....,{ w }_{ n })$, the message will be classified as Spam
    2. If $P(Spam|{ w }_{ 1 },{ w }_{ 2 },....,{ w }_{ n })$ < $P(Ham|{ w }_{ 1 },{ w }_{ 2 },....,{ w }_{ n })$, the message will be classified as Ham.
    3. $P(Spam|{ w }_{ 1 },{ w }_{ 2 },....,{ w }_{ n })$ = $P(Ham|{ w }_{ 1 },{ w }_{ 2 },....,{ w }_{ n })$, the algorithm may need human help.



In [14]:
def classification(text):
    text = re.sub('\W',' ',text)
    text = text.lower().split()
    
    p_spam_given_text = p_spam
    p_ham_given_text = p_ham
    
    for word in text:
        if word in spam_parameters:
            p_spam_given_text *= spam_parameters[word]
        if word in ham_parameters:
            p_ham_given_text *= ham_parameters[word]
    
    print("P(Spam|message) = ",p_spam_given_text)
    print("P(Ham|message) = ",p_ham_given_text)
    
    if p_spam_given_text > p_ham_given_text:
        print("Spam")
    elif p_spam_given_text < p_ham_given_text:
        print("Ham")
    else:
        print("Equal")

In [15]:
classification('CONGRATS!! WINNER!! Code to unlock : 12345')

P(Spam|message) =  8.490646913435681e-13
P(Ham|message) =  1.1052417705809237e-15
Spam


In [16]:
classification("I'll meet you there")


P(Spam|message) =  1.0548661294420755e-16
P(Ham|message) =  7.209864739040585e-12
Ham


### MEASURING THE ACCURACY OF THE FILTER

In [17]:
def classification_test_set(text):
    text = re.sub('\W',' ',text)
    text = text.lower().split()
    
    p_spam_given_text = p_spam
    p_ham_given_text = p_ham
    
    for word in text:
        if word in spam_parameters:
            p_spam_given_text *= spam_parameters[word]
        if word in ham_parameters:
            p_ham_given_text *= ham_parameters[word]
    
    if p_spam_given_text > p_ham_given_text:
        return "spam"
    elif p_spam_given_text < p_ham_given_text:
        return "ham"
    else:
        return "equal"

In [18]:
#Creating a new column in test set
test_set["Predicted"] = test_set["Text"].apply(classification_test_set)
test_set.head()

Unnamed: 0,Label,Text,Predicted
0,ham,Wherre's my boytoy ? :-(,ham
1,ham,Later i guess. I needa do mcat study too.,ham
2,ham,But i haf enuff space got like 4 mb...,ham
3,spam,Had your mobile 10 mths? Update to latest Oran...,spam
4,ham,All sounds good. Fingers . Makes it difficult ...,ham


In [19]:
#Function to calculate the accuracy
correct = 0
total = test_set.shape[0]
for row in test_set.iterrows():
    row = row[1]
    if row["Label"] == row["Predicted"]:
        correct += 1
print("Correct = ", correct)
print("Incorrect = ", total-correct)
print("Accuracy = ", correct/total)

Correct =  1103
Incorrect =  12
Accuracy =  0.989237668161435


The accuracy is 98.9%, which indicates that the filter is really good. Out of the 1115 messages that the spam filter has not seen in training, it has classified 1103 messages correctly. 