# Building a Spam Filter with Naive Bayes

In this project we will build an algorithm to filter out spam messages. We will use the dataset containing 5,572 SMS messages already classified by humans. The data set can be downloaded from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/dataset/228/sms+spam+collection).

# Data Overview

In [1]:
import pandas as pd
sms_spam = pd.read_csv("SMSSpamCollection", sep = '\t', header = None, names = ['Label','SMS']) 
sms_spam.info()
                       

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
Label    5572 non-null object
SMS      5572 non-null object
dtypes: object(2)
memory usage: 87.1+ KB


In [2]:
sms_spam.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
print(sms_spam['Label'].value_counts(normalize= True))

ham     0.865937
spam    0.134063
Name: Label, dtype: float64


In [4]:
percentage_spam = 747/5572

print(round(percentage_spam*100), "%")

13 %


## Training and Test Dataset

In [5]:
sms_spam_random = sms_spam.sample(frac = 1, random_state = 1)

training_test_index = round(len(sms_spam_random) * 0.8)

training_set = sms_spam_random[:training_test_index].reset_index(drop = True) #80% of dataset index 

test_set = sms_spam_random[training_test_index:].reset_index(drop=True) #20% of the dataset index 

print(training_set.shape)
print(test_set.shape)

(4458, 2)
(1114, 2)


In [6]:
training_set.head()

Unnamed: 0,Label,SMS
0,ham,"Yep, by the pretty sculpture"
1,ham,"Yes, princess. Are you going to make me moan?"
2,ham,Welp apparently he retired
3,ham,Havent.
4,ham,I forgot 2 ask ü all smth.. There's a card on ...


In [7]:
test_set.head()

Unnamed: 0,Label,SMS
0,ham,Later i guess. I needa do mcat study too.
1,ham,But i haf enuff space got like 4 mb...
2,spam,Had your mobile 10 mths? Update to latest Oran...
3,ham,All sounds good. Fingers . Makes it difficult ...
4,ham,"All done, all handed in. Don't know if mega sh..."


In [8]:
print(training_set["Label"].value_counts(normalize = True))

ham     0.86541
spam    0.13459
Name: Label, dtype: float64


In the training dataset 87% of SMS messages are ham(non-spam) and 13% are spam

In [9]:
print(test_set["Label"].value_counts(normalize = True))

ham     0.868043
spam    0.131957
Name: Label, dtype: float64


In the test set we also see that 87% of SMS messages are ham(non-spam) and 13% are spam.

These results are similar to watch we have in the entire dataset

## Data Cleaning

We will use the Multinomial Naive Bayes algorithm to calculate probabilities and make classifications of new messages into spam or ham.
To simplify calculations of probabilities we will clean the datasets, incluidng changing the table format, making all words lower case (consistent capitalization) and removing punctuation.

In [10]:
#remove all punction from SMS column

training_set['SMS'] = training_set["SMS"].str.replace("\W"," ")

training_set['SMS'] = training_set['SMS'].str.lower()

In [11]:
training_set.head(10)

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...
5,ham,ok i thk i got it then u wan me 2 come now or...
6,ham,i want kfc its tuesday only buy 2 meals only ...
7,ham,no dear i was sleeping p
8,ham,ok pa nothing problem
9,ham,ill be there on lt gt ok


In [12]:
#define vocabulary in training set

training_set['SMS'] = training_set['SMS'].str.split()

vocabulary = []


In [13]:
print(training_set['SMS'].head())
            

0                    [yep, by, the, pretty, sculpture]
1    [yes, princess, are, you, going, to, make, me,...
2                      [welp, apparently, he, retired]
3                                             [havent]
4    [i, forgot, 2, ask, ü, all, smth, there, s, a,...
Name: SMS, dtype: object


In [14]:
for message in training_set["SMS"]:
    for word in message:
           vocabulary.append(word)

In [15]:
vocabulary_set = list(set(vocabulary))

print("Number of Vocabulary Words")
print(len(vocabulary_set))
#print first 8 elements of vocabulary set
print(vocabulary_set[0:8])


Number of Vocabulary Words
7783
['gin', 'lemme', '08707500020', 'aunts', 'careful', 'bbd', 'intro', 'bye']


In [16]:
#create dictionary for each word and their count in each sms 
word_count_per_sms = {unique_word: [0]*len(training_set["SMS"]) for unique_word in vocabulary_set}

In [17]:
for index, sms in enumerate(training_set['SMS']):
    for word in sms:
        word_count_per_sms[word][index]+= 1


In [18]:
word_count_per_sms_df = pd.DataFrame(word_count_per_sms)

In [19]:
print(word_count_per_sms_df.shape) #check that size matches training_set
print(word_count_per_sms_df.columns)

(4458, 7783)
Index(['0', '00', '000', '000pes', '008704050406', '0089', '01223585334', '02',
       '0207', '02072069400',
       ...
       'zindgi', 'zoe', 'zogtorius', 'zouk', 'zyada', 'é', 'ú1', 'ü', '〨ud',
       '鈥'],
      dtype='object', length=7783)


In [20]:
total_training_set = pd.concat([training_set, word_count_per_sms_df], axis = 1)

In [21]:
total_training_set.head(5)

Unnamed: 0,Label,SMS,0,00,000,000pes,008704050406,0089,01223585334,02,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


## Calculating Probability Constants

To calculate the probabilities for a new message being a spam or ham based on its content, we need to calculate the following probabilities from the training set:
* P(Spam) - probability that a message is a spam
* P(Ham) - probability that a message is a ham
* N_spam - number of words in spam messages 
* N_ham - number of words in ham messages
* N_vocabulary - total number of words in the vocabulay, which already determined was 7782

In [22]:

spam_messages = total_training_set[total_training_set["Label"] == 'spam']
ham_messages = total_training_set[total_training_set["Label"]== 'ham']

#number of words in the vocabulary
N_vocabulary = len(vocabulary_set) 

#P(Spam) and P(Ham) 
P_spam = len(spam_messages)/len(total_training_set)
P_ham =len(ham_messages)/len(total_training_set)


n_word_per_spamsms = spam_messages["SMS"].apply(len)

N_spam = n_word_per_spamsms.sum()

n_word_per_hamsms = ham_messages["SMS"].apply(len)
N_ham = n_word_per_hamsms.sum()

total_count_sms  = total_training_set.shape[0]


alpha = 1 #smoothing parameter

print("P(Spam)", P_spam)
print("P(Ham)", P_ham)
print("Total Number of Training Set SMSs:",total_count_sms )

print(N_spam, N_ham, N_vocabulary)

P(Spam) 0.13458950201884254
P(Ham) 0.8654104979811574
Total Number of Training Set SMSs: 4458
15190 57237 7783


In [23]:
#initialize dictionary with unique word(key) and corresponding probabilities given spam/ham (value)

p_unique_word_given_spam ={ unique_word : 0 for unique_word in vocabulary_set}
p_unique_word_given_ham  ={ unique_word : 0 for unique_word in vocabulary_set}


In [24]:
for word in vocabulary_set:
    n_word_given_spam = spam_messages[word].sum() #recall each unique word is a column label with count of it's occurence in each sms
    p_word_given_spam = (n_word_given_spam +alpha)/ (P_spam + (N_vocabulary*alpha))
    p_unique_word_given_spam[word] =   p_word_given_spam
    
    n_word_given_ham = ham_messages[word].sum() #recall each unique word is a column label with count of it's occurence in each sms
    p_word_given_ham = (n_word_given_ham +alpha)/ (P_ham + (N_vocabulary*alpha))
    p_unique_word_given_ham[word] =   p_word_given_ham
    


## Classifying A New Message

In [25]:
#create spam filter
import re

def spam_filter(new_message):  #new message input is list of words
    #get spam given new message probability
    
    new_message = re.sub("\W"," ",new_message)
    new_message = new_message.lower()
    new_message = new_message.split()
    
    P_spam_given_new_message = P_spam
    P_ham_given_new_message = P_ham
    
    for word in new_message:
        
        if word in p_unique_word_given_spam:
            
            P_spam_given_new_message*=p_unique_word_given_spam[word]
                
        if word in p_unique_word_given_ham:

            P_ham_given_new_message*=p_unique_word_given_ham[word]
            
    

    print("P(Spam|New_Message)", P_spam_given_new_message)
    print("P(Ham|New_Message)", P_ham_given_new_message)
    
    if P_spam_given_new_message > P_ham_given_new_message:
        return "spam"
    elif P_spam_given_new_message < P_ham_given_new_message:
        return "ham"
    else:
        return "Please request human help for proper classification"
    
    

In [26]:
# apply spam filter on test set

test_set["predicted"] = test_set["SMS"].apply(spam_filter)

P(Spam|New_Message) 6.798436291464378e-23
P(Ham|New_Message) 1.2069296138440933e-12
P(Spam|New_Message) 5.295055564410864e-30
P(Ham|New_Message) 1.914541756707767e-20
P(Spam|New_Message) 3.233668566296403e-69
P(Ham|New_Message) 2.349816623783358e-71
P(Spam|New_Message) 6.1364934239512176e-30
P(Ham|New_Message) 2.933359058837065e-20
P(Spam|New_Message) 6.964076644084253e-59
P(Ham|New_Message) 1.8002559826957724e-39
P(Spam|New_Message) 3.309002989118682e-95
P(Ham|New_Message) 4.409684776352239e-59
P(Spam|New_Message) 5.776641994908997e-07
P(Ham|New_Message) 0.0010723260687245427
P(Spam|New_Message) 6.381215446881822e-38
P(Ham|New_Message) 7.90924426338011e-26
P(Spam|New_Message) 1.6699421785071928e-36
P(Ham|New_Message) 5.495364743727737e-24
P(Spam|New_Message) 2.3071908029043926e-13
P(Ham|New_Message) 2.626420559575449e-10
P(Spam|New_Message) 5.083423056469472e-14
P(Ham|New_Message) 8.944581137662487e-09
P(Spam|New_Message) 3.83059412942913e-35
P(Ham|New_Message) 3.203840025008001e-21
P

In [27]:
#check # of ham and spam messages in 'label' column and 'predicted' column 

print(test_set["Label"].value_counts())
print(test_set["predicted"].value_counts())


ham     967
spam    147
Name: Label, dtype: int64
ham     1023
spam      91
Name: predicted, dtype: int64


In [28]:
total = test_set.shape[0]

bool_correct = test_set["Label"] == test_set["predicted"]
correct = bool_correct.sum()
print("Number of correctly classified sms:", correct)
accuracy = correct/total
print(accuracy)

Number of correctly classified sms: 1058
0.9497307001795332


The accuracy of our spam filter is approximately 95%, which is quite good. Out of the 1114 new messages, the algorithm was able to classify 1058.