# Building a Spam Filter with Naive Bayes

In this project a multinominal Naive Bayes algorithm will be developed to help mark SMS messages as spam or non-spam with above 80% accuracy with the use of already classified messages. 

To develop the learning algorithm for classification a multinorminal  algorithm will use 5572 SMS messages already classified by people to classify new received messages. 

The dataset and its details can be referenced [here](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection). Details of the data collection process can be referenced from [here](http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/#composition). The dataset was put together by Tiago A. Almeida and José María Gómez Hidalgo. Before the algorithm is developed the dataset has to be explored.   

# Exploring the dataset

In [1]:
import pandas as pd
pd.set_option('display.max_columns', 7783)

data_spam = pd.read_csv('SMSSpamCollection', sep = '\t', header = None, names = ['Label', 'SMS'])
print(data_spam.shape)

data_spam.head()

(5572, 2)


Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [2]:
# percentage of labels that are spam or 'ham' non-spam

data_spam['Label'].value_counts(normalize = True)

ham     0.865937
spam    0.134063
Name: Label, dtype: float64

Approximately 87% of the SMS messages are categorized as non-spam while the rest 13% of the messages are spam. The dataset is representative of the population of spam or non-spam messages people get since typically people receive more spam messages than non-spam messages. It is a good practice to train and test different parts of the dataset to use the best model from the dataset. That is what wll be done next.    

# Training and testing 

For training and testing the classification model the dataset is split with 80% of it for training and the rest 20% for testing. The dataset is first randomized before splitting.

In [3]:
# Randomizing the dataset
rand_data = data_spam.sample(frac=1, random_state=1)
rand_data

# Determining index for split
index_split = round(len(rand_data)*0.8)

# Splitting the dataset
train_data = rand_data[:index_split].reset_index(drop=True)
test_data = rand_data[index_split:].reset_index(drop=True)

print(train_data.shape)
print(test_data.shape)

(4458, 2)
(1114, 2)


The number of spam and non spam messages should approximately be the same for the training and testing dataset same as the original dataset. 

In [4]:
# percentage of ham and spam in the trining set

ham_spam_per_train = train_data['Label'].value_counts(normalize=True)
print(ham_spam_per_train)



ham     0.86541
spam    0.13459
Name: Label, dtype: float64


In [5]:
# percent of ham and spam in the test set

ham_spam_per_test = test_data['Label'].value_counts(normalize=True)
print(ham_spam_per_train)


ham     0.86541
spam    0.13459
Name: Label, dtype: float64


Both have the same percentage of non-spam and spam messages like in the original dataset. The next step is to clean the dataset

# Cleaning and formating the data

Before creating the models from the test and training data the messages need to be split into words to find out the word counts for each message like the format below.

![Image](https://camo.githubusercontent.com/27a4a0a699bd8f0713d73347abe2929c267a03d5/68747470733a2f2f64712d636f6e74656e742e73332e616d617a6f6e6177732e636f6d2f3433332f637067705f646174617365745f332e706e67)

First the punctuations in the SMS column will be removed and all letters will be changed to lowercase. 

In [6]:
train_data['SMS'] = train_data['SMS'].str.replace('\W', ' ')
train_data['SMS'] = train_data['SMS'].str.lower()
train_data.head()

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...


# Creating a vocabulary

A vocabulary for the training dataset contains all the unique words in the training dataset

In [7]:
# Splitting the message by the space character
train_data['SMS'] = train_data['SMS'].str.split()

# looping through each list of words and putting them in a list
vocabulary = []
for sms in train_data['SMS']:
    for word in sms:
        vocabulary.append(word)
        
vocabulary = list(set(vocabulary))
vocabulary[:10]

['thinkin',
 'pobox12n146tf150p',
 'ovulation',
 'twilight',
 'either',
 'wind',
 'stress',
 'suffering',
 'operate',
 'pink']

In [8]:
len(vocabulary)

7783

There are 7,783 unique words in all of the SMS messages in the training set. The next step is to create a dictionary that keeps the word counts from the vocabulary

# Final training dataset

In [9]:
word_counts_per_sms = {}

for unique_word in vocabulary:
    word_counts_per_sms[unique_word] = [0] * len(train_data['SMS'])
  
for index, sms in enumerate(train_data['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1       

In [None]:
word_counts = pd.DataFrame(word_counts_per_sms)
word_counts.head(10)

In [None]:
training_set = pd.concat([train_data, word_counts], axis = 1)
training_set.head()

# Developing the Naive Bayes algorithm

The training data has been put in the right format the next step in the modeling process is to work on the classification algorithm using the two equations below 

![image](https://render.githubusercontent.com/render/math?math=P%28Spam%20%7C%20w_1%2Cw_2%2C%20...%2C%20w_n%29%20%5Cpropto%20P%28Spam%29%20%5Ccdot%20%5Cprod_%7Bi%3D1%7D%5E%7Bn%7DP%28w_i%7CSpam%29&mode=display)
![image](https://render.githubusercontent.com/render/math?math=P%28Ham%20%7C%20w_1%2Cw_2%2C%20...%2C%20w_n%29%20%5Cpropto%20P%28Ham%29%20%5Ccdot%20%5Cprod_%7Bi%3D1%7D%5E%7Bn%7DP%28w_i%7CHam%29&mode=display)

To calculate P(Wi|Spam) and P(Wi|Ham) the following equations will be used

![image](https://render.githubusercontent.com/render/math?math=P%28w_i%7CSpam%29%20%3D%20%5Cfrac%7BN_%7Bw_i%7CSpam%7D%20%2B%20%5Calpha%7D%7BN_%7BSpam%7D%20%2B%20%5Calpha%20%5Ccdot%20N_%7BVocabulary%7D%7D&mode=display)

![image](https://render.githubusercontent.com/render/math?math=P%28w_i%7CSpam%29%20%3D%20%5Cfrac%7BN_%7Bw_i%7CSpam%7D%20%2B%20%5Calpha%7D%7BN_%7BSpam%7D%20%2B%20%5Calpha%20%5Ccdot%20N_%7BVocabulary%7D%7D&mode=display)

Some terms will have the same values every time a new message is being classified. These terms are the following

* P(Spam) and P(Ham)
* NSpam, NHam, NVocabulary

We'll also use Laplace smoothing and set $\alpha = 1$

In [None]:
# getting the number of spam and ham messages
spam_messages = training_set[training_set['Label'] == 'spam']
no_spam_messages = spam_messages['Label'].value_counts()

ham_messages = training_set[training_set['Label'] == 'ham']
no_ham_messages = spam_messages['Label'].value_counts()

print(no_spam_messages)
print(no_ham_messages)

In [None]:
# probability of spam and ham messages
prop_spam = no_spam_messages/len(training_set)
prop_ham = no_ham_messages/len(training_set)

print(prop_spam)
print(prop_ham)

In [None]:
# finding the number of words in spam messages
number_of_words_spam_messages = spam_messages['SMS'].apply(len)
n_spam = number_of_words_spam_messages.sum()
n_spam

In [None]:
# finding the number of words in ham messages
number_of_words_ham_messages = ham_messages['SMS'].apply(len)
n_ham = number_of_words_ham_messages.sum()
n_ham

In [None]:
alpha = 1

The parameters for the Bayes classifiers are calculated next which are the probability of a word in a message given that it has been classified as spam or ham P(Wi|Spam) and P(Wi|Ham)

# Calculating the parameters

In [None]:
spam_para = {unique_word: 0 for unique_word in vocabulary}
ham_para = {unique_word: 0 for unique_word in vocabulary}

for word in vocabulary:
    n_word_given_spam = spam_messages[word].sum()
    p_word_given_spam = (n_word_given_spam + alpha) / (no_spam_messages + alpha*n_spam)
    spam_para[word] = p_word_given_spam
    
    n_word_given_ham = ham_messages[word].sum()
    p_word_given_ham = (n_word_given_ham + alpha) / (no_ham_messages + alpha*n_ham)
    ham_para[word] = p_word_given_ham

# Classifying messages as spam or ham

Since all the parameters and constant terms have been calculated the function that is used to classify a new message as spam or ham is the next step. The function will do the following

* Takes in as input a new message (w1, w2, ..., wn).
* Calculates P(Spam|w1, w2, ..., wn) and P(Ham|w1, w2, ..., wn).
* Compares the values of P(Spam|w1, w2, ..., wn) and P(Ham|w1, w2, ..., wn), and:
   * If P(Ham|w1, w2, ..., wn) > P(Spam|w1, w2, ..., wn), then the message is classified as ham.
   * If P(Ham|w1, w2, ..., wn) < P(Spam|w1, w2, ..., wn), then the message is classified as spam.
   * If P(Ham|w1, w2, ..., wn) = P(Spam|w1, w2, ..., wn), then the algorithm may request human help.

In [None]:
# classification function

import re

def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()
    
    p_spam_given_message = prop_spam
    p_ham_given_message = prop_ham
    
    for word in message:
        if word in spam_para:
            p_spam_given_message *= spam_para[word]
        elif word in ham_para:
            p_ham_given_message *= ham_para[word]
        else:
            continue
        

    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

In [19]:
classify("Sounds good, Tom, then see u there")

P(Spam|message): spam    3.363155e-24
Name: Label, dtype: float64
P(Ham|message): spam    0.13459
Name: Label, dtype: float64


ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().