## Naive Bayes Spam Filter Project

The goal of this project is to construct a spam filter based on the naive bayes algorithm. To achieve this we will use a dataset of over 5000 SMS messages that were already categorized as either spam or non-spam to teach the algorithm. We will then apply the resulting algorithm to more messages. Our spam filter schould have an accuracy of at least 80%.

Imports:

In [1]:
import numpy as np
import pandas as pd

Transforming the csv file into a pandas dataframe:

In [2]:
spam_col = pd.read_csv('SMSSpamCollection', sep = '\t', header = None, names = ['Label','SMS'])

Check if it worked:

In [3]:
spam_col.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [4]:
print(spam_col.shape)

(5572, 2)


We can see that the dataset has two columns, corresponding to the assigned label and the text in the SMS message.
We can also say that the messages were either labeled 'spam' or 'ham'. The second one being for authentic messages, this makes sense as ham is known to be way more authentic than spam.

Finding the percentage of authentic vs spam messages:

In [5]:
print(spam_col.Label.value_counts(normalize = True, dropna = False)*100)

ham     86.593683
spam    13.406317
Name: Label, dtype: float64


We can see that we have approximately 7 times as many authentic messages as we do spam.

### Testing the Spam Filter:

To ensure, that we are not biased and design a test for the spam filter after we have made it, we will do that now. To test how good our spam filter ist, we will divide the dataset into two different datasets:
* a training set containing about 80% of all messaget
* a test set with the remaining 20%.

The training set will be used to train our algorithm and the test set will be used to compare the predictions of the algorithm to the manual labeling done by users. We are striving for an accuracy of more than 80%

Creating the two datasets by random sampling:

In [40]:
random_spam = spam_col.sample(frac = 1, random_state = 1).copy()

In [41]:
training_set = random_spam[:4458].copy()
test_set = random_spam[4458:].copy()
print(training_set.shape, test_set.shape)
print(4458/5572)

(4458, 2) (1114, 2)
0.8000717875089735


checking to see if the datasets are well randomized:

In [42]:
print(training_set.Label.value_counts(normalize = True)*100)
print(test_set.Label.value_counts(normalize = True)*100)

ham     86.54105
spam    13.45895
Name: Label, dtype: float64
ham     86.804309
spam    13.195691
Name: Label, dtype: float64


### Data Cleaning:

Here we want to accomplish several tasks. We want to:
* make all words lower case, so capitalization does not impact the count
* ignore all punctuation.

We do this, because in the following step we want to transform our dataset for easier analysis.

Initial status of the datasets' heads:

In [43]:
print(training_set.head())
print(test_set.head())

     Label                                                SMS
1078   ham                       Yep, by the pretty sculpture
4028   ham      Yes, princess. Are you going to make me moan?
958    ham                         Welp apparently he retired
4642   ham                                            Havent.
4674   ham  I forgot 2 ask ü all smth.. There's a card on ...
     Label                                                SMS
2131   ham          Later i guess. I needa do mcat study too.
3418   ham             But i haf enuff space got like 4 mb...
3424  spam  Had your mobile 10 mths? Update to latest Oran...
1538   ham  All sounds good. Fingers . Makes it difficult ...
5393   ham  All done, all handed in. Don't know if mega sh...


We will be doing the same data cleaning for both the training and test dataset, because it's the same step and we won't forget in the end

In [44]:
training_set.SMS = training_set.SMS.str.replace('\W', ' ').str.lower()
test_set.SMS = test_set.SMS.str.replace('\W', ' ').str.lower()

In [45]:
training_set.SMS = training_set.SMS.str.split()
test_set.SMS = test_set.SMS.str.split()

In [12]:
print(training_set.head())
print(test_set.head())

     Label                                                SMS
1078   ham                  [yep, by, the, pretty, sculpture]
4028   ham  [yes, princess, are, you, going, to, make, me,...
958    ham                    [welp, apparently, he, retired]
4642   ham                                           [havent]
4674   ham  [i, forgot, 2, ask, ü, all, smth, there, s, a,...
     Label                                                SMS
2131   ham          later i guess  i needa do mcat study too 
3418   ham             but i haf enuff space got like 4 mb   
3424  spam  had your mobile 10 mths  update to latest oran...
1538   ham  all sounds good  fingers   makes it difficult ...
5393   ham  all done  all handed in  don t know if mega sh...


as we can clearly see by the message 'yes  princes  are you going to make me moan' it worked!

## Re-formatting the dataset


We will now change the format of the dataset to replace the SMS column with columns for each individual word. We will begin, by making a list of all the unique words in our training set.

In [13]:
vocabulary_lst = []
for sms in training_set.SMS:
    for word in sms:
        if word not in vocabulary_lst:
            vocabulary_lst.append(word)
    

finding out if every word is only present once in the new list

In [14]:
print(pd.Series(vocabulary_lst).value_counts().value_counts())

1    7783
dtype: int64


In [48]:
training_set.reset_index(inplace = True)
test_set.reset_index(inplace = True)

In [16]:
print(training_set.index)

RangeIndex(start=0, stop=4458, step=1)


The instructions want us to change the list into a set and then change it back into a list, to disable any duplicates, but since we already did this and the length of 'vocabulary_lst' matches up with the list in the solution to this project, we are good to skip this step.

Now we will create a dictionary 'word_count_per_sms'. This will contain all the words in our vocabulary as keys. The values will be a list with the length of the dataset. The indices will be the sms messages in our dataset. The value of the each index will be the ammount of times the word in the key appears in a given message.

In [17]:
word_count_per_sms = {}
for word in vocabulary_lst:
    word_count_per_sms[word] = [0]*training_set.shape[0]
    
for index, sms in enumerate(training_set.SMS):
    for word in sms:
        word_count_per_sms[word][index] += 1

In [18]:
word_count = pd.DataFrame(word_count_per_sms)

Concacenating the labels with our new dataframe:

In [19]:
# training_set.reset_index(inplace = True)
tsdf = pd.concat([training_set, word_count], axis = 1)
tsdf = tsdf.drop('index', axis = 1)

print(training_set.shape, word_count.shape,tsdf.shape)
print(training_set.Label.value_counts(dropna = False))
print(training_set.index)
print(tsdf.head())


(4458, 3) (4458, 7783) (4458, 7784)
ham     3858
spam     600
Name: Label, dtype: int64
RangeIndex(start=0, stop=4458, step=1)
  Label                                                SMS  yep  by  the  \
0   ham                  [yep, by, the, pretty, sculpture]    1   1    1   
1   ham  [yes, princess, are, you, going, to, make, me,...    0   0    0   
2   ham                    [welp, apparently, he, retired]    0   0    0   
3   ham                                           [havent]    0   0    0   
4   ham  [i, forgot, 2, ask, ü, all, smth, there, s, a,...    0   0    0   

   pretty  sculpture  yes  princess  are  ...  beauty  hides  secrets  n8  \
0       1          1    0         0    0  ...       0      0        0   0   
1       0          0    1         1    1  ...       0      0        0   0   
2       0          0    0         0    0  ...       0      0        0   0   
3       0          0    0         0    0  ...       0      0        0   0   
4       0          0    0      

In [20]:
print(tsdf.Label.value_counts(dropna = False))

ham     3858
spam     600
Name: Label, dtype: int64


## Spam filter construction

We are now done with the process of cleaning the data and will move on to constructing the actual spam filter.
The formula to gauge the probability that somethin is spam or ham is:

$$P(Spam|w_1,w_2,...,w_n) \propto P(Spam) \cdot \prod_{i = 1}^{n} P(w_{i}|Spam)$$

$$P(Ham|w_1,w_2,...,w_n) \propto P(Ham) \cdot \prod_{i = 1}^{n} P(w_{i}|Ham)$$

Where the Probability for each word given it is spam or ham is:

$$P(w_i|Spam) = \frac{N_{w_i|Spam} + \alpha}{N_{Spam} + \alpha \cdot N_{Vocabulary}} $$

$$P(w_i|Ham) = \frac{N_{w_i|Ham} + \alpha}{N_{Ham} + \alpha \cdot N_{Vocabulary}} $$

Here:

* $\alpha$ is the smoothing operator. We will be using Laplace smoothing where $\alpha = 1$
* $N_{Vocabulary}$ is the total number of unique words our dataste
* $N_{Spam} (or N_{Ham})$ is the total number (not unique) of words in messages categorized as spam (ham)

We will start by calculating the constants we will be using in the calculation later:

Same results but much faster

In [21]:
n_spam = tsdf.loc[tsdf.Label == 'spam'].sum(axis = 1).sum()
n_ham = tsdf.loc[tsdf.Label == 'ham'].sum(axis = 1).sum()

In [22]:
print(n_spam)
print(n_ham)

15188
57237


Now we will calculate $P(Spam)$ & $P(Ham)$

In [23]:
spham_counts = tsdf.Label.value_counts(normalize = True)
p_spam = spham_counts['spam']
p_ham = spham_counts['ham']
print(p_spam)
print(p_ham)

0.13458950201884254
0.8654104979811574


In [24]:
n_vocabulary = tsdf.shape[1] - 2
print(n_vocabulary)

7782


we can also make two more dataframes, that have the words as keys and the number of times they appear in either spam or ham messages as the values. This will cut down on computational expense.

In [25]:
words_occur = pd.DataFrame()
words_occur['spam'] = tsdf.loc[tsdf.Label == 'spam'].iloc[:,2:].sum(axis = 0)
words_occur['ham'] = tsdf.loc[tsdf.Label == 'ham'].iloc[:,2:].sum(axis = 0)

In [26]:
print(words_occur)

           spam  ham
yep           0    9
by           34  110
the         157  920
pretty        0   12
sculpture     0    1
...         ...  ...
related       0    1
trade         0    1
arul          0    1
bx526         1    0
wherre        0    1

[7782 rows x 2 columns]


Let' turn this dataframe from occurences to the probabilities $P(w_i|spam)$ and $P(w_i|ham)$

In [36]:
words_probs = words_occur.copy()
words_probs.spam = (words_probs.spam + alpha) / (n_spam + (alpha * n_vocabulary))
words_probs.ham = (words_probs.ham + alpha)/ (n_ham + (alpha * n_vocabulary))

In [37]:
print(words_probs)

               spam       ham
yep        0.000044  0.000154
by         0.001524  0.001707
the        0.006879  0.014165
pretty     0.000044  0.000200
sculpture  0.000044  0.000031
...             ...       ...
related    0.000044  0.000031
trade      0.000044  0.000031
arul       0.000044  0.000031
bx526      0.000087  0.000015
wherre     0.000044  0.000031

[7782 rows x 2 columns]


In [28]:
alpha = 1 #which is apparently necessary

Now we

In [60]:
print(test_set.head())

   index Label                                                SMS
0   2131   ham  [later, i, guess, i, needa, do, mcat, study, too]
1   3418   ham      [but, i, haf, enuff, space, got, like, 4, mb]
2   3424  spam  [had, your, mobile, 10, mths, update, to, late...
3   1538   ham  [all, sounds, good, fingers, makes, it, diffic...
4   5393   ham  [all, done, all, handed, in, don, t, know, if,...


In [73]:
def throbulator(row):
    spam_prob = p_spam
    ham_prob = p_ham
    for item in row:
        if item in words_probs.index:
            spam_prob *= words_probs.loc[item, 'spam']
            ham_prob *= words_probs.loc[item,'ham']
        else:
            spam_prob *= (alpha / (alpha * n_vocabulary))
            ham_prob *= (alpha / (alpha * n_vocabulary))
    if spam_prob > ham_prob:
        return 'spam'
    elif ham_prob > spam_prob:
        return 'ham'
    else:
        return'equal'

test_set['predict'] = test_set.SMS.apply(throbulator)  


In [82]:
print((test_set.Label == test_set.predict).value_counts(normalize = True)* 100)

True     98.743268
False     1.256732
dtype: float64


This means our spam filter was 98.7% accurate. This worked amazingly!