# Building a Spam Filter using Naive Bayes

The goal of this project is to build a spam filter software for SMS messages using a dataset of 5572 SMS messages that have already been categorized as spam or ham(non-spam). The dataset was obtained from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection). Our ideal spam filter will have an accuracy of greater than 80%.

In [1]:
import pandas as pd
from IPython.display import display

In [2]:
sms_collection = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['Label', 'SMS'])

In [3]:
display(sms_collection.head())
print('The dataset contains {} messages.' .format(sms_collection.shape[0]))
print('The percentage of non-spam (ham) vs spam messages are - ')
print(sms_collection['Label'].value_counts(normalize=True))

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


The dataset contains 5572 messages.
The percentage of non-spam (ham) vs spam messages are - 
ham     0.865937
spam    0.134063
Name: Label, dtype: float64


We will randomly divide the dataset into training(80%) and test modules.

In [4]:
sms_collection = sms_collection.sample(random_state=1, frac=1)
train = sms_collection.iloc[:4458]
test = sms_collection.iloc[4458:]

In [5]:
display(train.head())
display(test.head())

Unnamed: 0,Label,SMS
1078,ham,"Yep, by the pretty sculpture"
4028,ham,"Yes, princess. Are you going to make me moan?"
958,ham,Welp apparently he retired
4642,ham,Havent.
4674,ham,I forgot 2 ask ü all smth.. There's a card on ...


Unnamed: 0,Label,SMS
2131,ham,Later i guess. I needa do mcat study too.
3418,ham,But i haf enuff space got like 4 mb...
3424,spam,Had your mobile 10 mths? Update to latest Oran...
1538,ham,All sounds good. Fingers . Makes it difficult ...
5393,ham,"All done, all handed in. Don't know if mega sh..."


In [6]:
display(train['Label'].value_counts(normalize=True))
display(test['Label'].value_counts(normalize=True))

ham     0.86541
spam    0.13459
Name: Label, dtype: float64

ham     0.868043
spam    0.131957
Name: Label, dtype: float64

Train and test datasets have been created with similar percentages of ham and spam labels. 

## Data Cleaning

We will remove all the punctuations from the columns and convert all characters to lower case. This can be done on sms_collection dataframe as train and test are copies of the original df. 

In [7]:
train['SMS'] = train['SMS'].replace(to_replace='\W', value=' ', regex=True)
train['SMS'] = train['SMS'].str.lower()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


In [8]:
display(train.head())

Unnamed: 0,Label,SMS
1078,ham,yep by the pretty sculpture
4028,ham,yes princess are you going to make me moan
958,ham,welp apparently he retired
4642,ham,havent
4674,ham,i forgot 2 ask ü all smth there s a card on ...


In [9]:
train['SMS'] = train['SMS'].str.split()
vocabulary = []
for i in train['SMS']:
    for j in i:
        vocabulary.append(j)
        
vocabulary = set(vocabulary) #to remove any duplicated terms
vocabulary = list(vocabulary) #convert back to list
vocabulary

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


['gosh',
 'emerging',
 'discussed',
 'inches',
 'msging',
 '09066362231',
 'random',
 'entropication',
 'uh',
 'wipe',
 'mahfuuz',
 'overheating',
 'size',
 'drms',
 'either',
 'interested',
 'confirmed',
 'woulda',
 '400thousad',
 'befor',
 'wap',
 'find',
 'vivek',
 'comfey',
 'route',
 'help08718728876',
 'kb',
 'l8er',
 'lanka',
 'willing',
 'hows',
 'energy',
 'lipo',
 'ertini',
 'cramps',
 'misss',
 '3750',
 'chapel',
 'fieldof',
 'at',
 'page',
 '69669',
 'lingerie',
 'chk',
 'torch',
 'enemy',
 'thankyou',
 'bad',
 'bangbabes',
 '09066380611',
 'chip',
 'shove',
 'shop',
 'award',
 'know',
 'envelope',
 'verifying',
 '08715203652',
 'flirtparty',
 'subscribed',
 'grown',
 'then',
 '5',
 'wining',
 'smth',
 'faggot',
 'promise',
 'went',
 'flyng',
 'dippeditinadew',
 'ktv',
 '09061701851',
 'permissions',
 'certificate',
 'tooth',
 'noi',
 'mmsto',
 'sportsx',
 'island',
 'cave',
 'loans',
 'dnot',
 'gmw',
 'pei',
 'life',
 'beauties',
 'adress',
 'age',
 'jos',
 'prayrs',
 'crc

In [10]:
word_count_per_sms = {}
for unique_word in vocabulary:
    word_count_per_sms[unique_word] = [0] * len(train)
train = train.reset_index(drop=True)
for index, msg in enumerate(train['SMS']):
    for word in msg:
        word_count_per_sms[word][index] += 1  

In [11]:
word_counts = pd.DataFrame(word_count_per_sms)
word_counts.head()

Unnamed: 0,0,00,000,000pes,008704050406,0089,01223585334,02,0207,02072069400,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


In [12]:
#concatenate word_count dataframe with training dataframe
train_set_clean = pd.concat([train, word_counts], axis=1)

In [13]:
display(train_set_clean.head())

Unnamed: 0,Label,SMS,0,00,000,000pes,008704050406,0089,01223585334,02,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


## Probability calculations

We'll first start by calculating common probabilities - p_spam and p_ham. We will also calculate n_ham(number of words in all ham messages), n_spam and n_vocabulary. These terms are common in all probability equations. We will also initiate the smoothing filter to 1.

In [14]:
ham_df = train_set_clean[train_set_clean['Label'] == 'ham']
spam_df = train_set_clean[train_set_clean['Label'] == 'spam']
p_ham = len(ham_df) / len(train_set_clean)
p_spam = len(spam_df) / len(train_set_clean)
print('p_ham = {}' .format(p_ham))
print('p_spam = {}' .format(p_spam))

p_ham = 0.8654104979811574
p_spam = 0.13458950201884254


In [15]:
n_ham = 0
for row in ham_df['SMS']:
    n_ham += len(row)
    
n_spam = 0
for row in spam_df['SMS']:
    n_spam += len(row)
n_vocabulary = len(vocabulary) 
alpha = 1
print('n_ham = {}' .format(n_ham))
print('n_spam = {}' .format(n_spam))
print('n_vocabulary = {}' .format(n_vocabulary))

n_ham = 57237
n_spam = 15190
n_vocabulary = 7783


In [16]:
#initialize probabilities
p_word_given_spam = {}
p_word_given_ham = {}
for word in vocabulary:
    p_word_given_ham[word] = 0
    p_word_given_spam[word] = 0

In [17]:
for word in vocabulary:
    ham_word = 0
    spam_word = 0
    for sms in ham_df['SMS']:
        ham_word += sum([i == word for i in sms])
    p_word_given_ham[word] = (ham_word + alpha) / (n_ham + alpha*n_vocabulary)
    for sms in spam_df['SMS']:
        spam_word += sum([i == word for i in sms])
    p_word_given_spam[word] = (spam_word + alpha) / (n_spam + alpha*n_vocabulary)
        

In [18]:
p_word_given_ham['forgot']

0.00039987696093509687

We have calculated all the probabilities associated with Naive Bayes Theorem. Now we will write a function to classify any incoming new message as spam or ham.

In [23]:
import re
def classify(message):
    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()
    #print(message)
    #initiale probabilities
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    #print(p_spam_given_message)
    #print(p_ham_given_message)
    
    #loop through the message and multiply probabilities of each word
    for word in message:
        if word in vocabulary:
            p_spam_given_message *= p_word_given_spam[word]
            p_ham_given_message *= p_word_given_ham[word]
    print(p_spam_given_message)
    print(p_ham_given_message)
    if p_spam_given_message > p_ham_given_message:
        print('Label: Spam')
    elif p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    else:
        print('Equal probabilities, have a human classify this!')
    return




In [24]:
classify('Sounds, good, Neha. See u there - Adarsh')

6.43136700206982e-18
2.824167598198773e-14
Label: Ham


In [25]:
classify('WINNER!! This is the secret code to unlock the money: C3421')

1.3481290211300841e-25
1.9368049028589875e-27
Label: Spam


After testing the classify function on two example messages, we will test the accuracy of the spam filter by testing it on the test set. We will modify the classify function to return the label instead of printing which we will save in a new column in our test set.

In [28]:
def classify_test_set(message):
    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()
    #print(message)
    #initiale probabilities
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    #print(p_spam_given_message)
    #print(p_ham_given_message)
    
    #loop through the message and multiply probabilities of each word
    for word in message:
        if word in vocabulary:
            p_spam_given_message *= p_word_given_spam[word]
            p_ham_given_message *= p_word_given_ham[word]
    #print(p_spam_given_message)
    #print(p_ham_given_message)
    if p_spam_given_message > p_ham_given_message:
        classification = 'spam'
    elif p_ham_given_message > p_spam_given_message:
        classification = 'ham'
    else:
        classification = 'Equal probabilities, have a human classify this!'
    return classification

In [29]:
test['predicted'] = test['SMS'].apply(classify_test_set)
display(test.head())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


Unnamed: 0,Label,SMS,predicted
2131,ham,Later i guess. I needa do mcat study too.,ham
3418,ham,But i haf enuff space got like 4 mb...,ham
3424,spam,Had your mobile 10 mths? Update to latest Oran...,spam
1538,ham,All sounds good. Fingers . Makes it difficult ...,ham
5393,ham,"All done, all handed in. Don't know if mega sh...",ham


We will now measure the accuracy of our spam filter, where -

\begin{equation}
Accuracy = \frac {number\ of\ correctly\ classified\ messages}{total\ number\ of\ classified\ messages}
\end{equation}




In [37]:
correct = sum(test['Label'] == test['predicted']) 
total = len(test)
accuracy = correct / total
display(accuracy*100)

98.74326750448833

In [52]:
print('The accuracy of our Naive Bayes spam filter is {:3f}%' .format(accuracy*100))

The accuracy of our Naive Bayes spam filter is 98.743268%
