## Building a Spam Filter with Naive Bayes

To classify messages as spam or non-spam, we saw in the previous mission that the computer:

1. Learns how humans classify messages.

2. Uses that human knowledge to estimate probabilities for new messages — probabilities for spam and non-spam.

3. Classifies a new message based on these probability values — if the probability for spam is greater, then it classifies the message as spam. Otherwise, it classifies it as non-spam (if the two probability values are equal, then we may need a human to classify the message).


So our first task is to "teach" the computer how to classify messages. To do that, we'll use the multinomial Naive Bayes algorithm along with a dataset of 5,572 SMS messages that are already classified by humans.


In [1]:
import pandas as pd

In [2]:
data = pd.read_csv("SMSSpamCollection", sep = "\t", names = ['Label', 'SMS'], header = None )

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
Label    5572 non-null object
SMS      5572 non-null object
dtypes: object(2)
memory usage: 87.1+ KB


In [4]:
data.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [5]:
round(data["Label"].value_counts(normalize = True) * 100,0)

ham     87.0
spam    13.0
Name: Label, dtype: float64

## Training and Test Set

We're going to keep 80% of our dataset for training, and 20% for testing (we want to train the algorithm on as much data as possible, but we also want to have enough test data). The dataset has 5,572 messages, which means that:

1. The training set will have 4,458 messages (about 80% of the dataset).

2. The test set will have 1,114 messages (about 20% of the dataset).

In [6]:
randomized_data = data.sample(frac = 1 , random_state = 1)

In [7]:
training_data_index = round(len(randomized_data) * 0.8)
test_data_index      = round(len(randomized_data) * 0.2)

In [8]:
training_data = randomized_data[:training_data_index].copy().reset_index(drop = True)
test_data     = randomized_data[:test_data_index].copy().reset_index(drop = True)

In [9]:
print("Training data shape : {}".format(training_data.shape))
print("Test data shape     : {}".format(test_data.shape))

Training data shape : (4458, 2)
Test data shape     : (1114, 2)


In [10]:
round(training_data["Label"].value_counts(normalize = True) * 100)

ham     87.0
spam    13.0
Name: Label, dtype: float64

In [11]:
round(test_data["Label"].value_counts(normalize = True) * 100)

ham     87.0
spam    13.0
Name: Label, dtype: float64

In [12]:
import re

In [13]:
training_data.head()

Unnamed: 0,Label,SMS
0,ham,"Yep, by the pretty sculpture"
1,ham,"Yes, princess. Are you going to make me moan?"
2,ham,Welp apparently he retired
3,ham,Havent.
4,ham,I forgot 2 ask ü all smth.. There's a card on ...


In [14]:
training_data["SMS"] = training_data["SMS"].str.replace("\W", " ")
training_data["SMS"] = training_data["SMS"].str.lower()
training_data

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...
5,ham,ok i thk i got it then u wan me 2 come now or...
6,ham,i want kfc its tuesday only buy 2 meals only ...
7,ham,no dear i was sleeping p
8,ham,ok pa nothing problem
9,ham,ill be there on lt gt ok


## Creating the Vocabulary

In [15]:
vocabulary = []
training_data["SMS"] = training_data["SMS"].str.split().copy()
training_data.head()

Unnamed: 0,Label,SMS
0,ham,"[yep, by, the, pretty, sculpture]"
1,ham,"[yes, princess, are, you, going, to, make, me,..."
2,ham,"[welp, apparently, he, retired]"
3,ham,[havent]
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,..."


In [16]:
for sms in training_data["SMS"]:
    for word in sms:
        vocabulary.append(word)
vocabulary = list(set(vocabulary))

In [17]:
word_counts_per_sms = {unique_word: [0] * len(training_data["SMS"]) for unique_word in vocabulary}

for index, sms in enumerate(training_data["SMS"]):
    for word in sms:
        word_counts_per_sms[word][index] += 1

In [18]:
word_set = pd.DataFrame(word_counts_per_sms)

word_set.head()

Unnamed: 0,0,00,000,000pes,008704050406,0089,01223585334,02,0207,02072069400,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


In [19]:
training_set = pd.concat([training_data,word_set], axis = 1)
training_set.head()

Unnamed: 0,Label,SMS,0,00,000,000pes,008704050406,0089,01223585334,02,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


## Calculating Constants First

In [20]:
p_spam = len(training_set[training_set["Label"] == "spam"])/ len(training_set)

p_spam

0.13458950201884254

In [21]:
p_ham = training_set["Label"].value_counts().get(0) / len(training_set)
p_ham

0.8654104979811574

In [22]:
n_spam = training_set.loc[training_set["Label"] == "spam", "SMS"].apply(len).sum()

In [23]:
n_ham = training_set.loc[training_set["Label"] == "ham", "SMS"].apply(len).sum()

In [24]:
alpha = 1

In [25]:
n_vocabulary = len(vocabulary)

In [26]:
d_spam = {word : 0 for word in vocabulary}
d_ham  = {word : 0 for word in vocabulary}

m_spam = training_set[training_set["Label"] == "spam"]
m_ham = training_set[training_set["Label"] == "ham"]

In [27]:
for word in vocabulary:
    word_spam = m_spam[word].sum()
    p_w_spam = (word_spam + alpha) / (n_spam + alpha * n_vocabulary)
    d_spam[word] = p_w_spam
    
    word_ham = m_ham[word].sum()
    p_w_ham  = (word_ham + alpha) / (n_ham + alpha * n_vocabulary)
    d_ham[word] = p_w_ham

In [28]:
d_spam
d_ham

{'turn': 6.151953245155337e-05,
 'aburo': 4.6139649338665025e-05,
 'opposed': 3.075976622577668e-05,
 'scores': 3.075976622577668e-05,
 'understood': 6.151953245155337e-05,
 'connections': 3.075976622577668e-05,
 '2waxsto': 4.6139649338665025e-05,
 'crazyin': 1.537988311288834e-05,
 'fucked': 4.6139649338665025e-05,
 'differ': 4.6139649338665025e-05,
 'touched': 6.151953245155337e-05,
 'slave': 0.00012303906490310673,
 'unsub': 1.537988311288834e-05,
 'ls1': 1.537988311288834e-05,
 'frndshp': 3.075976622577668e-05,
 'lifebook': 3.075976622577668e-05,
 'adventuring': 3.075976622577668e-05,
 'maretare': 3.075976622577668e-05,
 'by': 0.001707167025530606,
 'opener': 3.075976622577668e-05,
 'c52': 1.537988311288834e-05,
 'medical': 0.00012303906490310673,
 'cts': 3.075976622577668e-05,
 'jordan': 1.537988311288834e-05,
 'facilities': 3.075976622577668e-05,
 'safety': 3.075976622577668e-05,
 'bother': 7.689941556444171e-05,
 'open': 0.0001845585973546601,
 'throws': 3.075976622577668e-05,
 

In [29]:
import re

def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()
    
    p_spam_given_message = p_spam
    p_ham_given_message  = p_ham
    
    for word in message:
        if word in d_spam:
            p_spam_given_message *= d_spam[word]
        
        if word in d_ham:
            p_ham_given_message *= d_ham[word]
            
    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

In [30]:
m_1 = 'WINNER!! This is the secret code to unlock the money: C3421.'
m_2 = "Sounds good, Tom, then see u there"

In [31]:
classify(m_1)

P(Spam|message): 1.3481290211300841e-25
P(Ham|message): 1.9368049028589875e-27
Label: Spam


In [32]:
classify(m_2)

P(Spam|message): 2.4372375665888117e-25
P(Ham|message): 3.687530435009238e-21
Label: Ham


## Measuring the Spam Filter's Accuracy

In [33]:
def classify_test_set(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()
    
    p_spam_given_message = p_spam
    p_ham_given_message  = p_ham
    
    for word in message:
        if word in d_spam:
            p_spam_given_message *= d_spam[word]
        
        if word in d_ham:
            p_ham_given_message *= d_ham[word]
            

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'

In [34]:
test_data

Unnamed: 0,Label,SMS
0,ham,"Yep, by the pretty sculpture"
1,ham,"Yes, princess. Are you going to make me moan?"
2,ham,Welp apparently he retired
3,ham,Havent.
4,ham,I forgot 2 ask ü all smth.. There's a card on ...
5,ham,Ok i thk i got it. Then u wan me 2 come now or...
6,ham,I want kfc its Tuesday. Only buy 2 meals ONLY ...
7,ham,No dear i was sleeping :-P
8,ham,Ok pa. Nothing problem:-)
9,ham,Ill be there on &lt;#&gt; ok.


In [35]:
test_data["predicted"] = test_data["SMS"].apply(classify_test_set)
test_data.head()

Unnamed: 0,Label,SMS,predicted
0,ham,"Yep, by the pretty sculpture",ham
1,ham,"Yes, princess. Are you going to make me moan?",ham
2,ham,Welp apparently he retired,ham
3,ham,Havent.,ham
4,ham,I forgot 2 ask ü all smth.. There's a card on ...,ham


In [36]:
correct = 0
total   = len(test_data)
total

1114

In [37]:
for row in test_data.iterrows():
    row = row[1]
    if row["Label"] == row["predicted"]:
        correct +=1

In [38]:
print("Correct:",correct)
print("Total:",total)
print("Wrong:", total - correct)
print("Accuracy:", correct / total)

Correct: 1106
Total: 1114
Wrong: 8
Accuracy: 0.992818671454219


The accuracy is close to 99.28%, which is really good. Our spam filter looked at 1,114 messages that it hasn't seen in training, and classified 1,106 correctly.

## Next Steps

In this project, we managed to build a spam filter for SMS messages using the multinomial Naive Bayes algorithm. The filter had an accuracy of 99.28% on the test set we used, which is a pretty good result. Our initial goal was an accuracy of over 80%, and we managed to do way better than that.

Next steps include:

Analyze the 14 messages that were classified incorrectly and try to figure out why the algorithm classified them incorrectly
Make the filtering process more complex by making the algorithm sensitive to letter case