# Building a Spam Filter with Naive Bayes

This uses a dataset of 5,572 SMS and the multinomial Naive Bayes algorithm to build a spam filter for SMS messages.

In [1]:
import pandas as pd

sms_spam = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['Label', 'SMS'])

print(sms_spam.shape)
sms_spam.head()

(5572, 2)


Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [2]:
sms_spam['Label'].value_counts(normalize=True)

ham     0.865937
spam    0.134063
Name: Label, dtype: float64

86.6% of messages are "ham", eaning non-spam while 13.4% are spam.

## Testing/Training Sets for the Spam Filter

We'll create a training set with 4,458 messages (about 80% of the dataset) and a test set with 1,114 messages (about 20% of the dataset).

In [3]:
#randomize the dataset
sms_randomized = sms_spam.sample(frac=1, random_state=1)

#split the dataset 80/20
training_test_index = round(len(sms_randomized) * 0.8)
training_set = sms_randomized[:training_test_index].reset_index(drop=True)
test_set = sms_randomized[training_test_index:].reset_index(drop=True)

#check split
print(training_set.shape)
print(test_set.shape)

(4458, 2)
(1114, 2)


Now we will check the percentages in each set that it is similar to the full set.

In [4]:
training_set['Label'].value_counts(normalize=True)

ham     0.86541
spam    0.13459
Name: Label, dtype: float64

In [5]:
test_set['Label'].value_counts(normalize=True)

ham     0.868043
spam    0.131957
Name: Label, dtype: float64

These are close the original counts and we can proceed.

## Data Cleaning

Now we need to clean the data in preparation to extract what is needed for the probability algorithm.

### Remove Punctuation & Make Lowercase

Start by removing all punctuation and making all letters lowercase.

In [6]:
#before
training_set.head()

Unnamed: 0,Label,SMS
0,ham,"Yep, by the pretty sculpture"
1,ham,"Yes, princess. Are you going to make me moan?"
2,ham,Welp apparently he retired
3,ham,Havent.
4,ham,I forgot 2 ask ü all smth.. There's a card on ...


In [7]:
#remove all punctuation
training_set['SMS'] = training_set['SMS'].str.replace('\W', ' ')
#make all lowercase
training_set['SMS'] = training_set['SMS'].str.lower()

#after
training_set.head()

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...


### Creating the Vocabulary

We'll create the vocabulary which is going to be a list with all of the unique words that occur in the messages of the training set. 

In [8]:
#split each SMS into a list of words
training_set['SMS'] = training_set['SMS'].str.split()

#create vocab list
vocabulary = []
for sms in training_set['SMS']:
    for w in sms:
        vocabulary.append(w)

#remove duplicates
vocabulary = list(set(vocabulary))

#find number of unique words
len(vocabulary)

7783

There are 7,783 unique words in the training set.

### Transform to a New Format

We'll need to create a new dataframe to get the data in the format we need.

In [9]:
training_set.head()

Unnamed: 0,Label,SMS
0,ham,"[yep, by, the, pretty, sculpture]"
1,ham,"[yes, princess, are, you, going, to, make, me,..."
2,ham,"[welp, apparently, he, retired]"
3,ham,[havent]
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,..."


In [10]:
word_counts_per_sms = {unique_word: [0] * len(training_set['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(training_set['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

In [11]:
#convert to dataframe
word_counts = pd.DataFrame(word_counts_per_sms)

#concatenate dataframes
training_set_clean = pd.concat([training_set, word_counts], axis=1)

training_set_clean.head()

Unnamed: 0,Label,SMS,0,00,000,000pes,008704050406,0089,01223585334,02,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


## Calculating Constants

In [12]:
spam_msgs = training_set_clean[training_set_clean['Label'] == 'spam']
ham_msgs = training_set_clean[training_set_clean['Label'] == 'ham']

#P(Spam) and P(Ham)
p_spam = len(spam_msgs)/len(training_set_clean)
p_ham = len(ham_msgs)/len(training_set_clean)

#n_spam and n_ham
n_spam = sum([len(sms) for sms in spam_msgs['SMS']])
n_ham = sum([len(sms) for sms in ham_msgs['SMS']])

#n_vocabulary
n_vocabulary = len(vocabulary)

#Laplace smoothing
alpha = 1

## Calculate Parameters

\begin{equation}
P(w_i|Spam) = \frac{N_{w_i|Spam} + \alpha}{N_{Spam} + \alpha \cdot N_{Vocabulary}} \\
P(w_i|Ham) = \frac{N_{w_i|Ham} + \alpha}{N_{Ham} + \alpha \cdot N_{Vocabulary}}
\end{equation}

In [23]:
#initiate dictionaries for parameters
parameter_spam = {word:0 for word in vocabulary}
parameter_ham = parameter_spam.copy()

#calculate parameters
for word in vocabulary:
    n_w_given_spam = sum(spam_msgs[word])
    p_w_given_spam = (n_w_given_spam + alpha) / (n_spam + alpha*n_vocabulary)
    parameter_spam[word] = p_w_given_spam
    
    n_w_given_ham = sum(ham_msgs[word])
    p_w_given_ham = (n_w_given_ham + alpha) / (n_ham + alpha*n_vocabulary)
    parameter_ham[word] = p_w_given_ham
    

## Function to Classify a New Message

We'll need these equations:

\begin{equation}
P(Spam | w_1,w_2, ..., w_n) \propto P(Spam) \cdot \prod_{i=1}^{n}P(w_i|Spam)
\end{equation}

\begin{equation}
P(Ham | w_1,w_2, ..., w_n) \propto P(Ham) \cdot \prod_{i=1}^{n}P(w_i|Ham)
\end{equation}

In [24]:
import re

def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    #initiate p_spam_given_message & #p_ham_given_message
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    #calculate probabilities
    for word in message:
        if word in parameter_spam:
            p_spam_given_message *= parameter_spam[word]
    
        if word in parameter_ham:
            p_ham_given_message *= parameter_ham[word]  

    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

In [25]:
#test classify spam
classify('WINNER!! This is the secret code to unlock the money: C3421.')

P(Spam|message): 1.3481290211300841e-25
P(Ham|message): 1.9368049028589875e-27
Label: Spam


In [26]:
#test classify ham
classify('Sounds good, Tom, then see u there')

P(Spam|message): 2.4372375665888117e-25
P(Ham|message): 3.687530435009238e-21
Label: Ham


## Measuring the Spam Filter's Accuracy

We'll need to test the accuracy of the function using the test set.

First modify the function to return labels.

In [28]:
def classify_test_set(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in parameter_spam:
            p_spam_given_message *= parameter_spam[word]

        if word in parameter_ham:
            p_ham_given_message *= parameter_ham[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'

Use this to create a new column called 'predicted' on the testing set.

In [29]:
test_set['predicted'] = test_set['SMS'].apply(classify_test_set)
test_set.head()

Unnamed: 0,Label,SMS,predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham


Now we'll measure accuracy using:

\begin{equation}
\text{Accuracy} = \frac{\text{number of correctly classified messages}}{\text{total number of classified messages}}
\end{equation}

In [43]:
correct = 0
total = len(test_set)

for row in test_set.iterrows():
    row = row[1]
    if row['Label'] == row['predicted']:
        correct += 1
        
print('Correct:', correct)
print('Incorrect:', total - correct)
print('Accuracy:', correct/total)

Correct: 1100
Incorrect: 14
Accuracy: 0.9874326750448833


This means the spam filter classified with an accuracy of 98.7%. Out of 1,114 messages, it classifed 1,1100 correctly.