# Building a Spam Filter with Naive Bayes

In this project, We're aim to make Spam Filter for messages(SMS) using multinomial Naive Bayes Algorithm. We will try to make a function using multinomial Naive Bayes Theory from scratch, Train the function using dataset, let the function learn to clasiifies the data (as spam or non spam) and test the function using new data(measuring spam's filter accuracy).

Our goal is to train the multinomial Naive Bayes Alghorithm until the spam filter accuracy reach at least 75 % - 80 % or 80% of the new messages will be classified correctly as spam or ham (non-spam).

To train the algorithm, we will use dataset from Tiago A. Almeida and Jose Maria gomez Hidalgo (we can download it from [The UCI Machine Learning repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection)), The data is a collection of more than 5000 messages with label



## Exploring the Dataset

In [1]:
#quick explore the data

import pandas as pd
sms_data = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['Label', 'SMS']) #header non, because the dat doesn't have header

#check shape and the data
print(sms_data.shape)
sms_data.head()

(5572, 2)


Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


from the above we find that data have 5572 Spam messages (one row represent one messages), with column represent label and sms

In [2]:
#find percentage of spam and ham

round(sms_data['Label'].value_counts(normalize=True) * 100,2) 

ham     86.59
spam    13.41
Name: Label, dtype: float64

we also found that 13 percent messages labeled as spam messages when 86 percent messages labeled as ham (non-spam) messages

## Training and Test Set

Before we train dataset using Naive Bayes Algorithm, we should split the dataset become train and test dataset.
1. train dataset to train algorithm
2. test dataset to test accuracy of algorithm after we train the dataset

we will split the dataset at least 80 % from the data, while 20 % for test dataset

our goal is to get an accuracy greater than 80 % or we wxpect that more than 80 % of the new messages will be classified correctly as spam or ham

In [3]:
#randomizing the dataset
random_data = sms_data.sample(frac=1, random_state=1)

#split the data , train_dataset = 80 %, test_dataset = 20 %
#index for split data set
index_dataset_for_train = round(len(random_data) * 0.8)

#train_data
train_dataset =random_data[:index_dataset_for_train].reset_index(drop=True)

#test_data
test_dataset = random_data[index_dataset_for_train:].reset_index(drop=True)

#check train and test data

print('train_dataset')
print('{} %'.format(round(train_dataset.shape[0]/random_data.shape[0],2)*100))
print('\n')
print('test_dataset')
print('{} %'.format(round(test_dataset.shape[0]/random_data.shape[0],2)*100))



train_dataset
80.0 %


test_dataset
20.0 %


## Cleaning dataset

## Letter Case and Punctuation

before we calculating probability, we'll first need to perform  some cleaning dataset to bring the data using format that can extract easily about information we need. 

In [4]:
#before cleaning
train_dataset.head()

Unnamed: 0,Label,SMS
0,ham,"Yep, by the pretty sculpture"
1,ham,"Yes, princess. Are you going to make me moan?"
2,ham,Welp apparently he retired
3,ham,Havent.
4,ham,I forgot 2 ask ü all smth.. There's a card on ...


using function above , we know that dataset(especially in SMS column) contain many character such as punctuation (, ! and etc) and some of them mix of upper and lower case. To clean that punction and upper character, we will
1. remove every punctuation
2. replace every upper character become lower character

In [5]:
#replace any character which is not from a-z, A-Z or 0-9

#remove every punction
train_dataset['SMS'] = train_dataset['SMS'].str.replace('\W', ' ')

#replace every upper character
train_dataset['SMS'] = train_dataset['SMS'].str.lower()

#check the data
train_dataset.head()

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...


## Creating the Vocabulary

to make easy to gain information from dataset, we will change our dataset 

![img](https://dq-content.s3.amazonaws.com/433/cpgp_dataset_3.png)

we know that there are some changes after label column, which every column represent every word in dataset with their value in row spam an row ham. to get that column, we make a list vocabulary contain every word in dataset.

In [6]:
#split the dataset
train_dataset['SMS'] =train_dataset['SMS'].str.split()

#create empty list
vocabulary = []

#iterate sms columns
for row in train_dataset['SMS']:
    for word in row:
        vocabulary.append(word)
        
#set the vocabulary to remove duplicate word
vocabulary = list(set(vocabulary))

In [7]:
#check the number of vocabulary
len(vocabulary)

7783

we got 7783 words in vocabulary, it is mens that we have 7783 new column in the new dataset

## The final Training set

after we got list contain every word in dataset, we will create a dictionary contain word and number of word, which is appear in dataset, create new dataset from vocabulary and concat new dataset in training dataset

In [8]:
#counts every word in dictionary
word_counts_per_sms ={}

#loop every unique value in vocabulary
for unique_word in vocabulary:
    #input unique word as key and value with value list = 0 
    #and multiply with length message in sms columns
    word_counts_per_sms[unique_word] = [0] * len(train_dataset['SMS'])
    
#loop over train_dataset
for index, sms in enumerate(train_dataset['SMS']):
    for word in sms:
        #increment the data
        word_counts_per_sms[word][index] +=1

In [9]:
#create new dataset
word_counts = pd.DataFrame(word_counts_per_sms)

In [10]:
#check the dataset
word_counts.head()

Unnamed: 0,0,00,000,000pes,008704050406,0089,01223585334,02,0207,02072069400,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


In [11]:
#concat the data
clean_train_dataset = pd.concat([train_dataset, word_counts], axis=1)

clean_train_dataset.head()

Unnamed: 0,Label,SMS,0,00,000,000pes,008704050406,0089,01223585334,02,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


## Train Dataset

## Calculating Constants First

Recal multinomial Naive Bayes algorithm will need to know the probability value of message as spam or not spam

\begin{equation}
P(Spam | w_1,w_2, ..., w_n) \propto P(Spam) \cdot \prod_{i=1}^{n}P(w_i|Spam)
\end{equation}

\begin{equation}
P(Ham | w_1,w_2, ..., w_n) \propto P(Ham) \cdot \prod_{i=1}^{n}P(w_i|Ham)
\end{equation}



Also, to calculate P(w<sub>i</sub>|Spam) and P(w<sub>i</sub>|Ham) inside the formulas above, we'll need to use the formula:

\begin{equation}
P(w_i|Spam) = \frac{N_{w_i|Spam} + \alpha}{N_{Spam} + \alpha \cdot N_{Vocabulary}}
\end{equation}

\begin{equation}
P(w_i|Ham) = \frac{N_{w_i|Ham} + \alpha}{N_{Ham} + \alpha \cdot N_{Vocabulary}}
\end{equation}

some of them have same value every we start to calculate the probability. such as:

- P(Spam) and P(Ham)
- N<sub>Spam</sub>, N<sub>Ham</sub>, N<sub>Vocabulary</sub>
- alpha smoothing $\alpha = 1$

for the start we will all of equation above

In [12]:
#split train between spam and ham
train_spam = clean_train_dataset[clean_train_dataset['Label'] == 'spam']
train_ham = clean_train_dataset[clean_train_dataset['Label'] == 'ham']

#calculate Pspam and Pham
Pspam = len(train_spam) / len(clean_train_dataset)
Pham = len(train_ham) / len(clean_train_dataset)

#calculate Nspam 
Nspam_per_messages = train_spam['SMS'].apply(len)
Nspam = Nspam_per_messages.sum()

#calculate Nham
Nham_per_messages = train_ham['SMS'].apply(len)
Nham = Nham_per_messages.sum()

#Nvocabulary
Nvocabulary = len(vocabulary)

#alpha smoothing
a = 1



## Calculating Parameters

After we calculate constant above, we start to calculating the parameters $P(w_i|Spam)$ and $P(w_i|Ham)$, using formula

\begin{equation}
P(w_i|Spam) = \frac{N_{w_i|Spam} + \alpha}{N_{Spam} + \alpha \cdot N_{Vocabulary}}
\end{equation}

\begin{equation}
P(w_i|Ham) = \frac{N_{w_i|Ham} + \alpha}{N_{Ham} + \alpha \cdot N_{Vocabulary}}
\end{equation}

recall that w_i is about every word in vocabulary

In [13]:
#create two dictionaries from vocabulary (one dictionary for spam , other for ham)
parameters_spam = {}
parameters_ham = {}
for word in vocabulary:
    parameters_spam[word] = 0
    parameters_ham[word] = 0


#iterate the vocabulary
for word in vocabulary:
    
    #sum every word in train spam and train ham
    n_word_given_spam = train_spam[word].sum()
    n_word_given_ham = train_ham[word].sum()
    
    #calculate the probability word given spam
    p_word_given_spam = (n_word_given_spam + a) / (Nspam + a * Nvocabulary)
    p_word_given_ham = (n_word_given_ham + a) / (Nham + a * Nvocabulary)
    
    #input every value to dictionary depends on the word
    parameters_spam[word] = p_word_given_spam
    parameters_ham[word] = p_word_given_ham

In [14]:
#check the data
parameters_spam

{'those': 4.3529360553693465e-05,
 'txtstar': 8.705872110738693e-05,
 'mind': 8.705872110738693e-05,
 '50pm': 8.705872110738693e-05,
 'underwear': 4.3529360553693465e-05,
 'slowly': 4.3529360553693465e-05,
 'because': 0.0001305880816610804,
 'sol': 0.0003482348844295477,
 'indians': 4.3529360553693465e-05,
 'recently': 0.0001305880816610804,
 'hasnt': 4.3529360553693465e-05,
 'their': 4.3529360553693465e-05,
 'hostile': 4.3529360553693465e-05,
 'bunkers': 4.3529360553693465e-05,
 'dnt': 4.3529360553693465e-05,
 'yarasu': 4.3529360553693465e-05,
 'psychic': 8.705872110738693e-05,
 'permission': 4.3529360553693465e-05,
 '9755': 8.705872110738693e-05,
 'lightly': 4.3529360553693465e-05,
 'seconds': 4.3529360553693465e-05,
 '7684': 8.705872110738693e-05,
 'data': 4.3529360553693465e-05,
 'maximum': 8.705872110738693e-05,
 'chat': 0.001871762503808819,
 'mumhas': 4.3529360553693465e-05,
 'amanda': 8.705872110738693e-05,
 'colour': 0.0006094110477517085,
 'nange': 4.3529360553693465e-05,
 'y

## Classifying a New Message

Now we start to calculate probability in new message, for we will make function to input every message, in function we will
1. clean the message(remove every upper character and punctuation and split the message )
2. give first parameter in probability spam in new message
2. loop the word in message, if word appear in parameters_spam , multiply probability with value word in parameters_spam, if word appear in parameters_ham, multiply probability with value in parameters_ham
3. if probability spam higher than probability ham, data is spam andvice versa
4. if value is equal, create 'human classify'

In [15]:
import re

def classify(message):
    #clean message
    message = re.sub('\W',' ', message)
    message = message.lower()
    message = message.split()
    
    #input first parameter(we use pspam parameters)
    p_spam_given_message = Pspam
    p_ham_given_message = Pham
    
    #loop the message
    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]
        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]
            
    print('p_spam_given_message: {}'.format(p_spam_given_message))
    print('p_ham_given_message: {}'.format(p_ham_given_message))
    
    #classify the message
    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal probabilities, have a human classifiy this!')
    
    
    

In [16]:
#check
classify('WINNER!! This is the secret code to unlock the money: C3421.')

p_spam_given_message: 1.3481290211300841e-25
p_ham_given_message: 1.9368049028589875e-27
Label: Spam


In [17]:
#check the data
classify('"Sounds good, Tom, then see u there')

p_spam_given_message: 2.4372375665888117e-25
p_ham_given_message: 3.687530435009238e-21
Label: Ham


## Measuring Spam Filter's Accuracy

recall function above and make some change

In [18]:
#change function with retun not print
import re

def classify_test_set(message):
    #clean the message
    message = re.sub('\W',' ', message)
    message = message.lower()
    message = message.split()
    
    #input first parameters
    p_spam_given_message = Pspam
    p_ham_given_message = Pham
    
    #loop every word in message
    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]
        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]
            
    #classify the message(we use return instead print)
    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_ham_given_message < p_spam_given_message:
        return 'spam'
    else:
        return 'need human classification'
    

    
    
    

after that, we calculate every message in test dataset, and input the classify in the test_dataset

In [19]:
#create  new column in test train, to put the prediction
test_dataset['predicted'] = test_dataset['SMS'].apply(classify_test_set)

test_dataset.head()

Unnamed: 0,Label,SMS,predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham


to measure the accuracy, we can sum number of correct label and divide number correct label with total message in test_dataset

In [20]:
#measue the accuracy
correct = 0
total = test_dataset.shape[0]

#loop the data
for row in test_dataset.iterrows():
    row = row[1]
    
    #if row in label same with row in predicted, add the correct
    if row['Label'] == row['predicted']:
        correct +=1
    
#calculate the accuracy
accuracy = correct / total
print('accuracy : {} or {} %'.format(accuracy, accuracy*100))

accuracy : 0.9874326750448833 or 98.74326750448833 %


## Conclusion

we get accuracy at least 98 % which is very good for classification, it is means that from 1,114 messages in test dataset we classified 1,100 correctly.