## Creating a Spam Filter using Naive Bayes

The purpose of this project is to create a spam filter using the multinomial Naive Bayes algorithm. We will use a data set of 5,572 SMS messages that have already been classified by humans. We will use this __[data](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection)__ collected by Tiago A. Almeida and José María Gómez Hidalgo to help "teach" the computer how to classify messages into spam and non-spam. Our goal is to write a program that is over 80% accurate.

## Exploring the Dataset

Let's start by reading in the data and getting familiar with it.

In [1]:
import pandas as pd

#sep ='\t' for tab separated data, not csv, no header, names create column labels
sms_spam = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names =['Label', 'SMS'])

print(sms_spam.shape)
sms_spam.head()

(5572, 2)


Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


Below we can see that about 87% of the SMS messages are non-spam, with the remaining being spam. This sample seems representative, since in practice most messages a person receives are ham.

In [2]:
#ham means non-spam
#percentage of each type
sms_spam['Label'].value_counts(normalize = True) * 100

ham     86.593683
spam    13.406317
Name: Label, dtype: float64

## Training and Test Set

We will now split our SMS data into a training and a test data set. We will use 80% of the data for training and 20% for testing our spam filter.

In [3]:
#randomizing rows of the data
sms_random = sms_spam.sample(frac=1, random_state = 1)

#calculating number of rows to use for training
training_set_index = round(len(sms_spam.index) * .8)

#splitting data
training_set = sms_random[0:training_set_index].reset_index(drop=True)
test_set = sms_random[training_set_index:].reset_index(drop=True)

#check row counts
print(training_set.shape)
print(test_set.shape)

(4458, 2)
(1114, 2)


Now that we divided the data into a training set and a test set, let's examine the percentages of spam and ham to check for similarity to entire dataset. We are looking for around 87% ham and 13% spam.

In [4]:
print(training_set['Label'].value_counts(normalize=True) * 100, '\n')
print(test_set['Label'].value_counts(normalize=True) * 100)

ham     86.54105
spam    13.45895
Name: Label, dtype: float64 

ham     86.804309
spam    13.195691
Name: Label, dtype: float64


These results look good. We will proceed by cleaning the dataset.

## Cleaning Data

In order to calculate all the probabilities, we need for our algorithm, we will need to get the data into a format that is easily extractable.

Essentially, we want to get from the top format to the bottom format:

| | Label| SMS |
| --- | --- | --- |
| 0 | spam | WINNER! MUST CLAIM NOW! CLAIM BIG PRIZE! |
| 1 | ham | Coming to my big party? |
| 2 | spam | Winner! Claim Government money. |

| | Label | winner | must | claim | now | big | prize | coming | to | my | party | government | money |
| --- | --- | --- | ---| ---| --- | --- | --- | --- | --- | --- | --- | --- | ---|
| 0 | spam | 1 | 1 | 2 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | ham  | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 0 |
| 2 | spam | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |

### Punctuation and Case

We will begin by cleaning the SMS messages. Below, we will strip the messages of punctuation and make all the letters lower case. 

In [5]:
#before cleaning for ease of checking
training_set.head()

Unnamed: 0,Label,SMS
0,ham,"Yep, by the pretty sculpture"
1,ham,"Yes, princess. Are you going to make me moan?"
2,ham,Welp apparently he retired
3,ham,Havent.
4,ham,I forgot 2 ask ü all smth.. There's a card on ...


In [6]:
training_set['SMS'] = training_set['SMS'].str.replace('\W', ' ')
training_set['SMS'] = training_set['SMS'].str.lower()
#after cleaning
training_set.head()

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...


### Creating the Vocabulary

Below we will create a vocabulary of all the words in the SMS messages. This will be a list of all the unique words in the training set.

In [7]:
#creating vocab from words in training set messages
training_set['SMS'] = training_set['SMS'].str.split()
training_set.head()

Unnamed: 0,Label,SMS
0,ham,"[yep, by, the, pretty, sculpture]"
1,ham,"[yes, princess, are, you, going, to, make, me,..."
2,ham,"[welp, apparently, he, retired]"
3,ham,[havent]
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,..."


We can see there are 7,783 unique words in our training set's SMS messages.

In [8]:
vocabulary = []

for row in training_set['SMS']:
    for word in row:
        vocabulary.append(word)
        
vocabulary = list(set(vocabulary))      
len(vocabulary)

7783

### The Final Training Set

Now that we have our vocabulary for the training set, we can transform our dataset into the desired form. We will have each column represent a word from our training set vocabulary. The values in each column and row are determined by the number of times the word appears in each message.

In [9]:
#intialize dictionary
word_counts_per_sms = {unique_word : [0] * len(training_set['SMS']) for unique_word in vocabulary}

#get word counts for words in each message using index
for index, sms in enumerate(training_set['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

In [10]:
#transform dictionary to dataframe
word_counts = pd.DataFrame(word_counts_per_sms)

word_counts.head()

Unnamed: 0,v,realise,tablet,actual,cann,landline,school,ticket,supreme,scream,...,jsco,carryin,keralacircle,platt,35p,wise,sad,withdraw,salary,preferably
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Now that we have our dictionary for our vocabulary of words, we will create a combined dataframe. It will include: the vocabulary dictionary, the ham/spam `Label` and the list of words in each SMS message.

In [11]:
training_set_clean = pd.concat([training_set, word_counts], axis=1)
training_set_clean.head()

Unnamed: 0,Label,SMS,v,realise,tablet,actual,cann,landline,school,ticket,...,jsco,carryin,keralacircle,platt,35p,wise,sad,withdraw,salary,preferably
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Calculating Constants for Filter

We are done cleaning the training set. Now we can start creating the spam filter. We will be using the Naive Bayes algorithm to classify emails as spam or ham. We will answer the two probability problems below in order to classify the emails.

_$$P(Spam|w_1, w_2,.......w_n) \alpha P(Spam)\cdot \prod_{i=1}^{n}P(w_i|Spam)$$_
_$$P(Ham|w_1, w_2,.......w_n) \alpha P(Ham)\cdot \prod_{i=1}^{n}P(w_i|Ham)$$_

In addition, we will need to calculate $P(w_i|Spam)$ and $P(w_i|Ham)$ in the formulas above. We need to use the following equations:

_$$P(w_i|Spam) = \frac{N_{w_i|Spam} + \alpha}{N_{Spam} + \alpha \cdot N_{Vocabulary}}$$_

_$$P(w_i|Ham) = \frac{N_{w_i|Ham} + \alpha}{N_{Ham} + \alpha \cdot N_{Vocabulary}}$$_

Also, some of the terms in the four equations above will maintain their value for each new message. We can avoid duplicate computations each time a new message comes in by calculating each of these terms once below. We will use our training set to calculate:

- $P(Spam)$ and $P(Ham)$
- $N_{Spam}, N_{Ham}, N_{Vocabulary}$

We will also use Laplace smoothing here to avoid any multiplying by zero issues and set $\alpha$ = 1.

The constants are calculated below. We split the messages from our training set by `Label`. Then we will perform the necessary calculations.

In [12]:
#isolate spam and ham messages from training set
spam_messages = training_set_clean[training_set_clean['Label'] == 'spam']
ham_messages = training_set_clean[training_set_clean['Label'] == 'ham']

#P(spam) and P(ham) training set
p_spam = len(spam_messages) / len(training_set_clean)
p_ham = len(ham_messages)/ len(training_set_clean)

# N_spam
num_words_per_spam = spam_messages['SMS'].apply(len)
n_spam = num_words_per_spam.sum()

#N_ham
num_words_per_ham = ham_messages['SMS'].apply(len)
n_ham = num_words_per_ham.sum()

# N_vocabulary
n_vocabulary = len(vocabulary)

#LaPlace smoothing
alpha = 1

## Calculating Parameters

Now that we have calculated the constants we need above; we can proceed with computing the parameters $P(w_i|Spam)$ and $P(w_i|Ham)$. Each of these will be a conditional probability value associated with each word in the vocabulary of our training set.

The following formulas are used for calculating the parameters:

_$$P(w_i|Spam) = \frac{N_{w_i|Spam} + \alpha}{N_{Spam} + \alpha \cdot N_{Vocabulary}}$$_

_$$P(w_i|Ham) = \frac{N_{w_i|Ham} + \alpha}{N_{Ham} + \alpha \cdot N_{Vocabulary}}$$_

In [13]:
#Initializing parameters
parameters_ham = {unique_word : 0 for unique_word in vocabulary}
parameters_spam = {unique_word : 0 for unique_word in vocabulary}

#Calculate parameters
for word in vocabulary:
    n_words_given_ham = ham_messages[word].sum()
    p_word_given_ham = (n_words_given_ham + alpha) /(n_ham + alpha * n_vocabulary)
    parameters_ham[word] = p_word_given_ham
    
    n_words_given_spam = spam_messages[word].sum()
    p_word_given_spam = (n_words_given_spam + alpha) / (n_spam + alpha * n_vocabulary)
    parameters_spam[word] = p_word_given_spam 

## Classifying a New Message

Now that we have calculated our constants and parameters we can begin building the spam filter. Our spam filter is described as a function that performs the following:

- Takes in input as a message $(w_1, w_2,...w_n)$
- Calculates the probabilities $P(Spam|w_1, w_2,...w_n)$ and $P(Ham|w_1, w_2,...w_n)$
- Compares $P(Spam|w_1, w_2,...w_n)$ and $P(Ham|w_1, w_2,...w_n)$ and:
    - If $P(Ham|w_1, w_2,...w_n) > P(Spam|w_1, w_2,...w_n)$, classifies the message as ham.
    -  If $P(Spam|w_1, w_2,...w_n) > P(Ham|w_1, w_2,...w_n)$, classifies the message as spam.
    - If $P(Spam|w_1, w_2,...w_n) = P(Ham|w_1, w_2,...w_n)$, asks for human help to classify the message.
    
Below is our function to classify messages as either ham or spam.  

In [14]:
#writing fuction to classify messages 
import re

def classify(message):
    
    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()
    
    ''' 
    message: a string
    '''
    
    p_ham_given_message = p_ham
    p_spam_given_message = p_spam
    
    for word in message:
        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]
            
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]
    
    
    print('P(spam|message):', p_ham_given_message)
    print('P(ham|message):', p_spam_given_message)
    
    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_spam_given_message > p_ham_given_message:
        print('Label: Spam')
    else:
        print('Probabilities are equal, ask for human help classifying this!')

Here we will test our classifying message on a couple of messages.

In [15]:
classify('WINNER!! This is the secret code to unlock the money: C3421.')

P(spam|message): 1.9368049028589875e-27
P(ham|message): 1.3481290211300841e-25
Label: Spam


In [16]:
classify('Sounds good, Tom, then see u there')

P(spam|message): 3.687530435009238e-21
P(ham|message): 2.4372375665888117e-25
Label: Ham


## Measuring the Spam Filter's Accuracy

Now that we have our function written and have seen it works, we need to determine how accurate it is. We will use our test set of 1,114 messages we established earlier.

We will write a function below that will return classification labels from the messages instead of printing them.

In [17]:
def classify_test_set(message):
    
    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()
    
    p_ham_given_message = p_ham
    p_spam_given_message = p_spam
    
    for word in message:
        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]
            
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]
    
    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'Probabilities are equal, seek human help!'

Now that we have a function that returns classification labels, we can create a new column in our test set data to help facilitate checking for accuracy.

In [18]:
test_set['predicted'] = test_set['SMS'].apply(classify_test_set)
test_set.head()

Unnamed: 0,Label,SMS,predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham


We can now write a function to check the accuracy of our filter with the test set data.

In [19]:
correct = 0
total = len(test_set.index)

for row in test_set.iterrows():
    row = row[1]
    if row['Label'] == row['predicted']:
        correct += 1
        
        
print('Correct:', correct)
print('Incorrect:', (total - correct))
print('Accuracy:', round(correct/total, 4))

Correct: 1100
Incorrect: 14
Accuracy: 0.9874


We see above that our accuracy was 98.74%. Our filter looked at 1,114 messages it had not seen before and correctly classified 1,100 of them. 

## Analyzing Incorrect Classifications

Although our accuracy was good, we will examine the 14 messages below that were incorrectly classified and look for reasons the algorithm may have classified incorrectly.

In [20]:
incorrect = test_set[test_set['Label'] != test_set['predicted']]
incorrect

Unnamed: 0,Label,SMS,predicted
114,spam,Not heard from U4 a while. Call me now am here...,ham
135,spam,More people are dogging in your area now. Call...,ham
152,ham,Unlimited texts. Limited minutes.,spam
159,ham,26th OF JULY,spam
284,ham,Nokia phone is lovly..,spam
293,ham,A Boy loved a gal. He propsd bt she didnt mind...,"Probabilities are equal, seek human help!"
302,ham,No calls..messages..missed calls,spam
319,ham,We have sent JD for Customer Service cum Accou...,spam
504,spam,Oh my god! I've found your number again! I'm s...,ham
546,spam,"Hi babe its Chloe, how r u? I was smashed on s...",ham


In [21]:
incorrect_ham_pct = (len(incorrect[incorrect['Label'] == 'ham'])/ len(test_set[test_set['Label'] == 'ham'])) * 100
print('Incorrect ham classification percentage is {}%.'.format(round(incorrect_ham_pct, 4)))

incorrect_spam_pct = (len(incorrect[incorrect['Label'] == 'spam'])/ len(test_set[test_set['Label'] == 'spam'])) * 100
print('Incorrect spam classification percentage is {}%.'.format(round(incorrect_spam_pct, 4)))

Incorrect ham classification percentage is 0.6205%.
Incorrect spam classification percentage is 5.4422%.


We can make a few observations from the above information. The filter did a little worse, in terms of percentage incorrect, at recognizing spam messages. A significant amount of capitalization appears in the incorrect messages. There was also one message with equal probability, which was likely the result of the words in that SMS message not reappearing in our vocabulary elsewhere. 

## Filter with Case Sensitivity

Using our observations, we will now make our filter case sensitive and check to see if we improve our accuracy. We will start by getting our training and test sets once again. 

In [22]:
#getting training and test sets for case sensitive filter
training_set_two = sms_random[0:training_set_index].reset_index(drop=True)
print(training_set_two.shape)

test_set_two = sms_random[training_set_index:].reset_index(drop=True)
test_set_two.shape

(4458, 2)


(1114, 2)

### Cleaning Case Sensitive Data

Next we will clean the data without adjusting all the words to lower case.

In [23]:
#cleaning data, no case adjustment
training_set_two['SMS'] = training_set_two['SMS'].str.replace('\W', ' ')

training_set_two.head()

Unnamed: 0,Label,SMS
0,ham,Yep by the pretty sculpture
1,ham,Yes princess Are you going to make me moan
2,ham,Welp apparently he retired
3,ham,Havent
4,ham,I forgot 2 ask ü all smth There s a card on ...


### Case Sensitive Vocabulary

Now that we have our data in a good clean format, we will establish our new version of the vocabulary with case sensitivity.

In [24]:
training_set_two['SMS'] = training_set_two['SMS'].str.split()

vocabulary_two = []

for row in training_set_two['SMS']:
    for word in row:
        vocabulary_two.append(word)
        
vocabulary_two = list(set(vocabulary_two)) 

In [25]:
len(vocabulary_two)

9656

Our new case sensitive vocabulary has 9,656 words. This is about a 24% increase in vocabulary words compared to our first iteration.

In [26]:
#intialize dictionary
word_counts_per_sms_two = {unique_word : [0] * len(training_set_two['SMS']) for unique_word in vocabulary_two}

#get word counts for words in each message
for index, sms in enumerate(training_set_two['SMS']):
    for word in sms:
        word_counts_per_sms_two[word][index] += 1

### Retest with Case Sensitive Vocabulary

Now that we have our new vocabulary that is case sensitive, we will use it with our training set and test set and review the accuracy of the test set classifications. Below we will get our training set into the format we would like.

In [27]:
#transform dictionary to dataframe
word_counts_two = pd.DataFrame(word_counts_per_sms_two)

#needed to rename column because SMS was in new vocabulary
word_counts_two.rename(columns = {'SMS':'SmS'}, inplace = True)

word_counts_two.head()

Unnamed: 0,Baaaaaaaabe,v,realise,tablet,actual,cann,SWAN,Unfortunately,landline,school,...,Cost,carryin,35p,wise,sad,withdraw,Gay,Algarve,salary,preferably
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [28]:
training_set_clean_two = pd.concat([training_set_two, word_counts_two], axis=1)
training_set_clean_two.head()

Unnamed: 0,Label,SMS,Baaaaaaaabe,v,realise,tablet,actual,cann,SWAN,Unfortunately,...,Cost,carryin,35p,wise,sad,withdraw,Gay,Algarve,salary,preferably
0,ham,"[Yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[Yes, princess, Are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[Welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[Havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[I, forgot, 2, ask, ü, all, smth, There, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Recalculate constants

Most of the constants will be the same for the case sensitive case. We have chosen to show all of them again. These calculations are only performed one time and therefore they don't cost a lot of computational time. This also makes it easier to separate our training trials.

In [29]:
#isolate spam and ham messages from training set
spam_messages_two = training_set_clean_two[training_set_clean_two['Label'] == 'spam']
ham_messages_two = training_set_clean_two[training_set_clean_two['Label'] == 'ham']

#P(spam) and P(ham) training set
p_spam = len(spam_messages) / len(training_set_clean)
p_ham = len(ham_messages)/ len(training_set_clean)

# N_spam
num_words_per_spam = spam_messages_two['SMS'].apply(len)
n_spam = num_words_per_spam.sum()

#N_ham
num_words_per_ham = ham_messages_two['SMS'].apply(len)
n_ham = num_words_per_ham.sum()

# N_vocabulary
n_vocabulary_two = len(vocabulary_two)

#LaPlace smoothing
alpha = 1
#vocabulary_two = [list(s) for s in vocabulary_two]
vocabulary_two[0:30]
vocabulary_two = list(vocabulary_two)

### Recalculate Parameters

Now we will recalculate our parameters using the new vocabulary and word counts.

In [30]:
# #Initializing parametersparameters_ham = {unique_word : 0 for unique_word in vocabulary}
parameters_spam_two = {unique_word : 0 for unique_word in vocabulary_two}
parameters_ham_two = {unique_word : 0 for unique_word in vocabulary_two}

for word in training_set_clean_two.columns[2:9658]:
    n_words_given_ham_two = ham_messages_two[word].sum()
    p_word_given_ham_two = (n_words_given_ham_two + alpha) / (n_ham + alpha * n_vocabulary_two)
    parameters_ham_two[word] = p_word_given_ham_two
    
    n_words_given_spam_two = spam_messages_two[word].sum()
    p_word_given_spam_two = (n_words_given_spam_two + alpha) / (n_spam + alpha * n_vocabulary_two)
    parameters_spam_two[word] = p_word_given_spam_two

## Measure Case Sensitive Filter Accuracy

We will run our test set SMS messages through our case sensitive filter below and then check our results for accuracy. Our function will work the same as the one did for our case insensitive function:

- Takes in input as a message $(w_1, w_2,...w_n)$
- Calculates the probabilities $P(Spam|w_1, w_2,...w_n)$ and $P(Ham|w_1, w_2,...w_n)$ using case sensitive vocabulary
- Compares $P(Spam|w_1, w_2,...w_n)$ and $P(Ham|w_1, w_2,...w_n)$ and:
    - If $P(Ham|w_1, w_2,...w_n) > P(Spam|w_1, w_2,...w_n)$, classifies the message as ham.
    -  If $P(Spam|w_1, w_2,...w_n) > P(Ham|w_1, w_2,...w_n)$, classifies the message as spam.
    - If $P(Spam|w_1, w_2,...w_n) = P(Ham|w_1, w_2,...w_n)$, asks for human help to classify the message.

In [31]:
def classify_test_set_two(message):
    
    message = re.sub('\W', ' ', message)
    message = message.split()
    
    p_ham_given_message = p_ham
    p_spam_given_message = p_spam
    
    for word in message:
        if word in parameters_ham_two:
            p_ham_given_message *= parameters_ham_two[word]
            
        if word in parameters_spam_two:
            p_spam_given_message *= parameters_spam_two[word]
    
    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'Probabilities are equal, seek human help!'
    

In [32]:
test_set['predicted_two'] = test_set['SMS'].apply(classify_test_set_two)
test_set.head()

Unnamed: 0,Label,SMS,predicted,predicted_two
0,ham,Later i guess. I needa do mcat study too.,ham,ham
1,ham,But i haf enuff space got like 4 mb...,ham,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham,ham


In [33]:
correct = 0
total = len(test_set.index)

for row in test_set.iterrows():
    row = row[1]
    if row['Label'] == row['predicted_two']:
        correct += 1
        
        
        
print('Correct:', correct)
print('Incorrect:', (total - correct))
print('Accuracy:', round(correct/total, 4))

Correct: 1092
Incorrect: 22
Accuracy: 0.9803


In [34]:
#combining results
incorrect_two = test_set[test_set['Label'] != test_set['predicted_two']]
incorrect_two

Unnamed: 0,Label,SMS,predicted,predicted_two
89,spam,goldviking (29/M) is inviting you to be his fr...,spam,"Probabilities are equal, seek human help!"
114,spam,Not heard from U4 a while. Call me now am here...,ham,ham
115,ham,1Apple/Day=No Doctor. 1Tulsi Leaf/Day=No Cance...,ham,spam
135,spam,More people are dogging in your area now. Call...,ham,ham
218,ham,Please protect yourself from e-threats. SIB ne...,ham,"Probabilities are equal, seek human help!"
284,ham,Nokia phone is lovly..,spam,spam
293,ham,A Boy loved a gal. He propsd bt she didnt mind...,"Probabilities are equal, seek human help!","Probabilities are equal, seek human help!"
319,ham,We have sent JD for Customer Service cum Accou...,spam,spam
323,ham,CHEERS U TEX MECAUSE U WEREBORED! YEAH OKDEN H...,ham,spam
363,spam,Email AlertFrom: Jeri StewartSize: 2KBSubject:...,spam,ham


From the above we can see that our second filter isn't quite as accurate on our test set. It showed a larger number of equal probabilities, likely due to the increased vocabulary list and more words not reappearing in other SMS messages. The second filter did do a better job of catching spam messages. It correctly captured 3 of the 5 spam messages, the case insensitive filter missed. 

## Combined Filter

Based on what we have seen thus far, it appears that we should be able to get a little better accuracy if we combine both filters. The case insensitive filter had better accuracy with the ham messages, so we will use that filter's feature. The case sensitive filter did well at recognizing spam messages that the case insensitive filter thought were ham. We will use the good attributes from each to create a combined filter. Our combined filter will:

- Takes in input as a message $(w_1, w_2,...w_n)$
- Calculates the probabilities $P(Spam|w_1, w_2,...w_n)$ and $P(Ham|w_1, w_2,...w_n)$ using all _lower case_ vocabulary
- Compares $P(Spam|w_1, w_2,...w_n)$ and $P(Ham|w_1, w_2,...w_n)$ and:
    - If $P(Ham|w_1, w_2,...w_n) > P(Spam|w_1, w_2,...w_n)$, classifies the message as ham.  
    - If $P(Spam|w_1, w_2,...w_n) = P(Ham|w_1, w_2,...w_n)$, asks for human help to classify the message.
    - If $P(Spam|w_1, w_2,...w_n) > P(Ham|w_1, w_2,...w_n)$, message is further analyzed:
        - Calculates the probabilities $P(Spam|w_1, w_2,...w_n)$ and $P(Ham|w_1, w_2,...w_n)$ using _case sensitive_ vocabulary
        - Compares $P(Spam|w_1, w_2,...w_n)$ and $P(Ham|w_1, w_2,...w_n)$ and:
            - If $P(Ham|w_1, w_2,...w_n) > P(Spam|w_1, w_2,...w_n)$, classifies the message as ham.
            - Otherwise, classifies the message as spam.

In [35]:
def classify_test_set_combined(message):
    
    message = re.sub('\W', ' ', message)
    message = message.split()
    
    p_ham_given_message = p_ham
    p_spam_given_message = p_spam
    p_ham_given_message_two = p_ham
    p_spam_given_message_two = p_spam
    
    for word in message:
        word_lower = word.lower()
        if word_lower in parameters_ham:
            p_ham_given_message *= parameters_ham[word_lower]
            
        if word_lower in parameters_spam:
            p_spam_given_message *= parameters_spam[word_lower]
    
    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_ham_given_message == p_spam_given_message:
        return 'Equal probability, seek human help!'
    else: 
        for word in message:
            if word in parameters_ham_two:
                p_ham_given_message_two *= parameters_ham_two[word]
            
            if word in parameters_spam_two:
                p_spam_given_message_two *= parameters_spam_two[word]
    
        if p_ham_given_message_two > p_spam_given_message_two:
            return 'ham'
        else:
            return 'spam'               

In [36]:
#combining results from all three trials
test_set['predicted_three'] = test_set['SMS'].apply(classify_test_set_combined)
test_set.head()

Unnamed: 0,Label,SMS,predicted,predicted_two,predicted_three
0,ham,Later i guess. I needa do mcat study too.,ham,ham,ham
1,ham,But i haf enuff space got like 4 mb...,ham,ham,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam,spam,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham,ham,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham,ham,ham


In [37]:
correct = 0
total = len(test_set.index)

for row in test_set.iterrows():
    row = row[1]
    if row['Label'] == row['predicted_three']:
        correct += 1
        
        
print('Correct:', correct)
print('Incorrect:', (total - correct))
print('Accuracy:', round(correct/total, 4))

Correct: 1102
Incorrect: 12
Accuracy: 0.9892


In [38]:
incorrect_three = test_set[test_set['Label'] != test_set['predicted_three']]
incorrect_three

Unnamed: 0,Label,SMS,predicted,predicted_two,predicted_three
114,spam,Not heard from U4 a while. Call me now am here...,ham,ham,ham
135,spam,More people are dogging in your area now. Call...,ham,ham,ham
284,ham,Nokia phone is lovly..,spam,spam,spam
293,ham,A Boy loved a gal. He propsd bt she didnt mind...,"Probabilities are equal, seek human help!","Probabilities are equal, seek human help!","Equal probability, seek human help!"
319,ham,We have sent JD for Customer Service cum Accou...,spam,spam,spam
363,spam,Email AlertFrom: Jeri StewartSize: 2KBSubject:...,spam,ham,ham
504,spam,Oh my god! I've found your number again! I'm s...,ham,ham,ham
546,spam,"Hi babe its Chloe, how r u? I was smashed on s...",ham,ham,ham
741,spam,"0A$NETWORKS allow companies to bill for SMS, s...",ham,"Probabilities are equal, seek human help!",ham
876,spam,RCT' THNQ Adrian for U text. Rgds Vatian,ham,ham,ham


## Conclusion

We were able to create a spam filter using the Naive Bayes algorithm with conditional probability. We used two different vocabularies for our filter. One was case sensitive, one was not. They both had an admirable accuracy of over 98% on our test set. Then we used a combination of both versions of the filter, capitalizing on their respective good features. Using this approach almost, we increased to nearly 99%  accuracy on our test set.

Our results here were quite good and we achieved our goal. Next steps could be to expand the vocabulary used and test filter on another data set. The dataset used was not particularly large.  There are over 171,000 words in the __[ Oxford English Dictionary](https://www.lexico.com/en/explore/how-many-words-are-there-in-the-english-language)__ and our _larger_ vocabulary had less than 10,000 words. Additionally, an unbiased new set of data would help determine our current filter's practicality.