# Building a Spam Filter with Naive Bayes

In this project, we're going to build a spam filter for SMS messages using the multinomial Naive Bayes algorithm. Our goal is to write a program that classifies new messages with an accuracy greater than 80% — so we expect that more than 80% of the new messages will be classified correctly as spam or ham (non-spam).

To train the algorithm, we'll use a dataset of 5,572 SMS messages that are already classified by humans. The dataset was put together by Tiago A. Almeida and José María Gómez Hidalgo, and it can be downloaded from the The UCI Machine Learning Repository. The data collection process is described in more details on this page, where you can also find some of the papers authored by Tiago A. Almeida and José María Gómez Hidalgo.

## Exploring the Dataset

In [1]:
import pandas as pd
sms = pd.read_csv('SMSSpamCollection', sep='\t',header = None)

In [2]:
sms.head()

Unnamed: 0,0,1
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
names = ['Label', 'SMS']
sms.columns = names
sms.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [4]:
sms.shape

(5572, 2)

In [5]:
sms.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
Label    5572 non-null object
SMS      5572 non-null object
dtypes: object(2)
memory usage: 87.1+ KB


In [6]:
sms.Label.value_counts(normalize = True)

ham     0.865937
spam    0.134063
Name: Label, dtype: float64

## Training and Test Set

Once our spam filter is done, we'll need to test how good it is with classifying new messages. To test the spam filter, we're first going to split our dataset into two categories:

* A training set, which we'll use to "train" the computer how to classify messages.
* A test set, which we'll use to test how good the spam filter is with classifying new messages.

We're going to keep 80% of our dataset for training, and 20% for testing (we want to train the algorithm on as much data as possible, but we also want to have enough test data). The dataset has 5,572 messages, which means that:

* The training set will have 4,458 messages (about 80% of the dataset).
* The test set will have 1,114 messages (about 20% of the dataset).

First we'll randomize the data set

In [7]:
sms_rand = sms.sample(frac = 1, random_state = 1)
sms_rand.head()

Unnamed: 0,Label,SMS
1078,ham,"Yep, by the pretty sculpture"
4028,ham,"Yes, princess. Are you going to make me moan?"
958,ham,Welp apparently he retired
4642,ham,Havent.
4674,ham,I forgot 2 ask ü all smth.. There's a card on ...


Then we split the randomized data into training and test set:

In [8]:
n_train = int(round(sms_rand.shape[0]*.8,0))
n_test = int(sms_rand.shape[0] - n_train)
train = sms_rand.iloc[:n_train,:].copy()
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4458 entries, 1078 to 3482
Data columns (total 2 columns):
Label    4458 non-null object
SMS      4458 non-null object
dtypes: object(2)
memory usage: 104.5+ KB


In [9]:
test = sms_rand.iloc[n_train:,:].copy()
test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1114 entries, 2131 to 5157
Data columns (total 2 columns):
Label    1114 non-null object
SMS      1114 non-null object
dtypes: object(2)
memory usage: 26.1+ KB


We'll reset the indexes of both test and training sets:

In [10]:
test.reset_index(inplace = True)
train.reset_index(inplace = True) 

In [11]:
test.head()

Unnamed: 0,index,Label,SMS
0,2131,ham,Later i guess. I needa do mcat study too.
1,3418,ham,But i haf enuff space got like 4 mb...
2,3424,spam,Had your mobile 10 mths? Update to latest Oran...
3,1538,ham,All sounds good. Fingers . Makes it difficult ...
4,5393,ham,"All done, all handed in. Don't know if mega sh..."


In [12]:
train.head()

Unnamed: 0,index,Label,SMS
0,1078,ham,"Yep, by the pretty sculpture"
1,4028,ham,"Yes, princess. Are you going to make me moan?"
2,958,ham,Welp apparently he retired
3,4642,ham,Havent.
4,4674,ham,I forgot 2 ask ü all smth.. There's a card on ...


Let's check the percentage of spam messages in each set:

In [13]:
test.Label.value_counts(normalize = True)

ham     0.868043
spam    0.131957
Name: Label, dtype: float64

In [14]:
train.Label.value_counts(normalize = True)

ham     0.86541
spam    0.13459
Name: Label, dtype: float64

## Data Cleaning

To calculate all probabilities, we'll first need to perform a bit of data cleaning to bring the data in a format that will allow us to extract easily all the information we need.

Let's begin the data cleaning process by removing the punctuation and bringing all the words to lower case.

### Letter Case and Punctuation

First we'll remove all the punctuation from the SMS column.

In [15]:
import re
train['SMS'] = train['SMS'].apply(lambda i: re.sub('\W',' ',i))

In [16]:
train['SMS'] = train.SMS.str.lower()

In [17]:
train.head()

Unnamed: 0,index,Label,SMS
0,1078,ham,yep by the pretty sculpture
1,4028,ham,yes princess are you going to make me moan
2,958,ham,welp apparently he retired
3,4642,ham,havent
4,4674,ham,i forgot 2 ask ü all smth there s a card on ...


### Creating the Vocabulary

The vocabulary is going to be a Python list containing all the unique words across all messages, where each word is represented as a string

In [18]:
train.SMS = train.SMS.str.split()
train.SMS.head()

0                    [yep, by, the, pretty, sculpture]
1    [yes, princess, are, you, going, to, make, me,...
2                      [welp, apparently, he, retired]
3                                             [havent]
4    [i, forgot, 2, ask, ü, all, smth, there, s, a,...
Name: SMS, dtype: object

In [19]:
vocabulary = []
for i in train.SMS:
    for k in i:
        vocabulary.append(k)
print(len(vocabulary))

72427


In [20]:

vocabulary = set(vocabulary)

In [21]:
len(vocabulary)

7783

In [22]:
vocabulary = list(vocabulary)

### The Final Training Set

We start by creating a dictionary with keys consisting of each word in vocabulary and values of 0. The number of values for each key would be the size of the `SMS` column in the training data frame.

In [23]:
dic_words = {i:[0]*len(train.SMS) for i in vocabulary}


Next we do a word count and populate the zero values dictionary

In [24]:
for i, s in enumerate(train.SMS):
    for k in s:
        dic_words[k][i]+=1

In [25]:
word_df = pd.DataFrame(dic_words)
word_df.shape

(4458, 7783)

In [26]:
train_words = pd.concat([train, word_df], axis = 1)
train_words.shape

(4458, 7786)

In [27]:
train_words.columns

Index(['index', 'Label', 'SMS', '0', '00', '000', '000pes', '008704050406',
       '0089', '01223585334',
       ...
       'zindgi', 'zoe', 'zogtorius', 'zouk', 'zyada', 'é', 'ú1', 'ü', '〨ud',
       '鈥'],
      dtype='object', length=7786)

## Calculating Constants First

In [28]:
# N_ham = 
ham = train_words[train_words.Label == 'ham']
spam = train_words[train_words.Label == 'spam']
#.sum())
# N_spam = sum(train_words[train_words.Label == 'spam'].iloc[:,3:].sum())
# N_vocab = len(vocabulary)

In [34]:
p_spam = train_words.Label.value_counts(normalize = True)['spam']
p_ham = train_words.Label.value_counts(normalize = True)['ham']
print('p_spam = ', p_spam, '\n', 'p_ham = ', p_ham)

p_spam =  0.13458950201884254 
 p_ham =  0.8654104979811574


In [30]:
N_ham = sum(ham.SMS.apply(len))
N_spam = sum(spam.SMS.apply(len))
N_vocab = len(vocabulary)


In [35]:
print('N_ham = ',N_ham,'\n',
      'N_spam = ',N_spam, '\n',
      'N_vocab = ', N_vocab,'\n',
      )

N_ham =  57237 
 N_spam =  15190 
 N_vocab =  7783 



In [36]:
a = 1

## Calculating Parameters

Now that we have the constant terms calculated above, we can move on with calculating the parameters P(w_i|Spam) and P(w_i|Ham). Each parameter will thus be a conditional probability value associated with each word in the vocabulary.

First we initialize two dictionaries, where each key-value pair is a unique word (from our vocabulary) represented as a string, and the value is 0.

In [40]:
spam_words = {i:0 for i in vocabulary}
nonspam_words = {i:0 for i in vocabulary}


Next we iterate over the vocabulary, and, for each word we calculate P(wi|Spam) and P(wi|Ham)

In [41]:
for i in vocabulary:
    Ni_spam = spam[i].sum()
    Ni_ham = ham[i].sum()
    spam_words[i] = (Ni_spam + a)/(N_spam +a*N_vocab)
    nonspam_words[i] = (Ni_ham + a)/(N_ham + a*N_vocab)

## Classifying A New Message

In [54]:
def classify(mes):
    import re
    mes = re.sub('\W', ' ', mes)
    mes = mes.lower()
    mes = mes.split()
    
    p_spam_given_mes = p_spam
    p_ham_given_mes = p_ham
    
    for i in mes:
        if i in spam_words:
            p_spam_given_mes *= spam_words[i]
        if i in nonspam_words:
            p_ham_given_mes *= nonspam_words[i]
            
    

    if p_ham_given_mes > p_spam_given_mes:
        return 'ham'
    elif p_ham_given_mes < p_spam_given_mes:
        return 'spam'
    else:
        return 'needs human classification'

Let's test the `classify` function with two messages - one obvious spam and one normal message.

In [55]:
classify("Sounds good, Tom, then see u there")

'ham'

In [56]:
classify('WINNER!! This is the secret code to unlock the money: C3421.')

'spam'

## Measuring the Spam Filter's Accuracy

Now we apply the `classify` function on the test set that we created at the beginning. We add a new column `prediction` with the result of the assessment of each message. Lastly we'll compare the prediction to the actual label to measure the accuracy of our function.

In [57]:
test.head()

Unnamed: 0,index,Label,SMS
0,2131,ham,Later i guess. I needa do mcat study too.
1,3418,ham,But i haf enuff space got like 4 mb...
2,3424,spam,Had your mobile 10 mths? Update to latest Oran...
3,1538,ham,All sounds good. Fingers . Makes it difficult ...
4,5393,ham,"All done, all handed in. Don't know if mega sh..."


In [58]:
test['prediction'] = test.SMS.apply(classify)

In [59]:
test.head()

Unnamed: 0,index,Label,SMS,prediction
0,2131,ham,Later i guess. I needa do mcat study too.,ham
1,3418,ham,But i haf enuff space got like 4 mb...,ham
2,3424,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,1538,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,5393,ham,"All done, all handed in. Don't know if mega sh...",ham


In [74]:
correct = 0
total = test.shape[0]
for i, row in test.iterrows():
    if row['Label']==row['prediction']:
        correct += 1
accuracy = correct/test.shape[0]
print('Correct: ', correct, '\n'
     'Incorrect: ', total - correct, '\n'
     'Accuracy: ', accuracy)

Correct:  1100 
Incorrect:  14 
Accuracy:  0.9874326750448833


With accuracy of 98.7% our spam classification function seems ok.

Let's have a look at the 14 incorrect predictions and try to see why the classification function failed there.

In [77]:
test[test.prediction != test.Label]

Unnamed: 0,index,Label,SMS,prediction
114,3460,spam,Not heard from U4 a while. Call me now am here...,ham
135,1940,spam,More people are dogging in your area now. Call...,ham
152,3890,ham,Unlimited texts. Limited minutes.,spam
159,991,ham,26th OF JULY,spam
284,4862,ham,Nokia phone is lovly..,spam
293,2370,ham,A Boy loved a gal. He propsd bt she didnt mind...,needs human classification
302,326,ham,No calls..messages..missed calls,spam
319,5046,ham,We have sent JD for Customer Service cum Accou...,spam
504,3864,spam,Oh my god! I've found your number again! I'm s...,ham
546,4676,spam,"Hi babe its Chloe, how r u? I was smashed on s...",ham
