In this guided project, we're going to study the practical side of the algorithm by building a spam filter for SMS messages.

To classify messages as spam or non-spam, we saw in the previous mission that the computer:

1. Learns how humans classify messages.
2. Uses that human knowledge to estimate probabilities for new messages — probabilities for spam and non-spam.
3. Classifies a new message based on these probability values — if the probability for spam is greater, then it classifies the message as spam. Otherwise, it classifies it as non-spam (if the two probability values are equal, then we may need a human to classify the message).

# Exploring the Dataset

In [1]:
import pandas as pd

sms_spam = pd.read_csv('SMSSpamCollection', sep='\t',
                      header=None, names=['Label', 'SMS'])

print(sms_spam.shape)
print(sms_spam.head(3))
print(sms_spam.tail(3))

(5572, 2)
  Label                                                SMS
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
     Label                                                SMS
5569   ham  Pity, * was in mood for that. So...any other s...
5570   ham  The guy did some bitching but I acted like i'd...
5571   ham                         Rofl. Its true to its name


In [2]:
spam_percentage = ((sms_spam["Label"] == "spam").sum() / sms_spam.shape[0])*100
spam_percentage

13.406317300789663

In [3]:
ham_percentage = ((sms_spam["Label"] == "ham").sum() / sms_spam.shape[0] )*100
ham_percentage

86.59368269921033

We can see that there are 13.40% in the data set are spams and the rest (86.59%) are ham messages

# Training and Test Set

We're going to keep 80% of our dataset for training, and 20% for testing (we want to train the algorithm on as much data as possible, but we also want to have enough test data). The dataset has 5,572 messages, which means that:

- The training set will have 4,458 messages (about 80% of the dataset).
- The test set will have 1,114 messages (about 20% of the dataset).

To better understand the purpose of putting a test set aside, let's begin by observing that all 1,114 messages in our test set are already classified by a human. When the spam filter is ready, we're going to treat these messages as new and have the filter classify them. Once we have the results, we'll be able to compare the algorithm classification with that done by a human, and this way we'll see how good the spam filter really is.

For this project, our goal is to create a spam filter that classifies new messages with an accuracy greater than 80% — so we expect that more than 80% of the new messages will be classified correctly as spam or ham (non-spam).

In [4]:
sms_random = sms_spam.sample(frac=1, random_state=1)

sms_training = sms_random[:4458].reset_index()
sms_test = sms_random[5572-1114:].reset_index()

print(sms_test.head(3))
print(sms_training.head(3))

   index Label                                                SMS
0   2131   ham          Later i guess. I needa do mcat study too.
1   3418   ham             But i haf enuff space got like 4 mb...
2   3424  spam  Had your mobile 10 mths? Update to latest Oran...
   index Label                                            SMS
0   1078   ham                   Yep, by the pretty sculpture
1   4028   ham  Yes, princess. Are you going to make me moan?
2    958   ham                     Welp apparently he retired


In [5]:
sms_test_ham_per = ((sms_test['Label'] == 'ham').sum() / sms_test.shape[0])*100
print(sms_test_ham_per)
sms_test_spam_per = ((sms_test['Label'] == 'spam').sum() / sms_test.shape[0] )*100
print(sms_test_spam_per)

86.80430879712748
13.195691202872531


In [6]:
sms_training_ham_per = ((sms_training['Label'] == 'ham').sum() / sms_training.shape[0])*100
print(sms_training_ham_per)
sms_training_spam_per = ((sms_training['Label'] == 'spam').sum() / sms_training.shape[0])*100
print(sms_training_spam_per)

86.54104979811575
13.458950201884253


We can see that the percentage of spam & ham messages in both Training and Test set are similar to the full data set (approx 87% for ham and 13% for spam)


# Letter Case and Punctuation

In [7]:
sms_training['SMS'] = sms_training['SMS'].str.replace('\W', ' ')
sms_training['SMS'] = sms_training['SMS'].str.lower()

sms_training['SMS'].head()

0                         yep  by the pretty sculpture
1        yes  princess  are you going to make me moan 
2                           welp apparently he retired
3                                              havent 
4    i forgot 2 ask ü all smth   there s a card on ...
Name: SMS, dtype: object

# Creating the Vocabulary

In [8]:
sms_training['SMS'] = sms_training['SMS'].str.split()

vocabulary = []

for sms in sms_training['SMS']:
    for word in sms:
        vocabulary.append(word)
        
vocabulary = set(vocabulary)
vocabulary = list(vocabulary)

# The Final Training Set

In [9]:
word_counts_per_sms = {unique_word : [0] * 
len(sms_training['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(sms_training['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

dataframe = pd.DataFrame(word_counts_per_sms)
concat = pd.concat([sms_training, dataframe], axis=1)

concat.head(3)

Unnamed: 0,index,Label,SMS,0,00,000,000pes,008704050406,0089,01223585334,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,1078,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,4028,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,958,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


# Calculating Constants First

In [10]:
spam_messages = concat[concat['Label'] == 'spam']
ham_messages = concat[concat['Label'] == 'ham']

#P(Spam) & P(Ham)
p_spam = len(spam_messages) / len(concat)
p_ham = len(ham_messages) / len(concat)

#N_spam
n_words_spam_messages = spam_messages['SMS'].apply(len)
n_spam = n_words_spam_messages.sum()

#N_ham
n_words_ham_messages = ham_messages['SMS'].apply(len)
n_ham = n_words_ham_messages.sum()

#N_vocabulary
n_vocab = len(vocabulary)

#alpha
alpha = 1

# Calculating Parameters

Now that we have the constant terms calculated above, we can move on with calculating the parameters $P(w_i|Spam)$ and $P(w_i|Ham)$. Each parameter will thus be a conditional probability value associated with each word in the vocabulary.
The parameters are calculated using the formulas:
$$
P(w_i|Spam) = \frac{N_{w_i|Spam} + \alpha}{N_{Spam} + \alpha \cdot N_{Vocabulary}}
$$$$
P(w_i|Ham) = \frac{N_{w_i|Ham} + \alpha}{N_{Ham} + \alpha \cdot N_{Vocabulary}}
$$

In [11]:
ham_dict = {unique_word: 0 for unique_word in vocabulary}
spam_dict = {unique_word: 0 for unique_word in vocabulary}
    
for word in vocabulary:
    n_words_given_spam = spam_messages[word].sum()
    p_words_given_spam = (n_words_given_spam + alpha) / (n_spam + alpha * n_vocab)
    spam_dict[word] = p_words_given_spam
    
    n_words_given_ham = ham_messages[word].sum()
    p_words_given_ham = (n_words_given_ham + alpha) / (n_ham + alpha *n_vocab)
    ham_dict[word] = p_words_given_ham

# Classifying a New Message

Now that we have all our parameters calculated, we can start creating the spam filter. The spam filter can be understood as a function that:
- Takes in as input a new message (w1, w2, ..., wn).
- Calculates P(Spam|w1, w2, ..., wn) and P(Ham|w1, w2, ..., wn).
- Compares the values of P(Spam|w1, w2, ..., wn) and P(Ham|w1, w2, ..., wn), and:
    - If P(Ham|w1, w2, ..., wn) > P(Spam|w1, w2, ..., wn), then the message is classified as ham.
    - If P(Ham|w1, w2, ..., wn) < P(Spam|w1, w2, ..., wn), then the message is classified as spam.
    - If P(Ham|w1, w2, ..., wn) = P(Spam|w1, w2, ..., wn), then the algorithm may request human help.

In [12]:
import re

def classify(message):
    
    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()
    
    #calculating p_spam_given_message and
    #p_ham_given_message
    
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    for word in message:
        if word in spam_dict:
            p_spam_given_message *= spam_dict[word]
        if word in ham_dict:
            p_ham_given_message *= ham_dict[word]
        
    
    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)
    
    if p_ham_given_message > p_spam_given_message:
        return('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        return('Label: Spam')
    else:
        return('Equal probabilities, have a human classify this')

In [13]:
print(classify('WINNER!! This is the secret code to unlock the money: C3421'))
print(classify('Sounds good, Tom, then see u there'))

P(Spam|message): 1.3481290211300841e-25
P(Ham|message): 1.9368049028589875e-27
Label: Spam
P(Spam|message): 2.4372375665888117e-25
P(Ham|message): 3.687530435009238e-21
Label: Ham


# Measuring the Spam Filter's Accuracy

In [14]:
def classify_sms_test(message):
    
    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()
    
    #calculating p_spam_given_message and
    #p_ham_given_message
    
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    for word in message:
        if word in spam_dict:
            p_spam_given_message *= spam_dict[word]
        if word in ham_dict:
            p_ham_given_message *= ham_dict[word]
    
    if p_ham_given_message > p_spam_given_message:
        return('ham')
    elif p_ham_given_message < p_spam_given_message:
        return('spam')
    else:
        return('Have a human classify this')
        
sms_test.head()

Unnamed: 0,index,Label,SMS
0,2131,ham,Later i guess. I needa do mcat study too.
1,3418,ham,But i haf enuff space got like 4 mb...
2,3424,spam,Had your mobile 10 mths? Update to latest Oran...
3,1538,ham,All sounds good. Fingers . Makes it difficult ...
4,5393,ham,"All done, all handed in. Don't know if mega sh..."


In [15]:
sms_test['predicted'] = sms_test['SMS'].apply(classify_sms_test)
sms_test.head()

Unnamed: 0,index,Label,SMS,predicted
0,2131,ham,Later i guess. I needa do mcat study too.,ham
1,3418,ham,But i haf enuff space got like 4 mb...,ham
2,3424,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,1538,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,5393,ham,"All done, all handed in. Don't know if mega sh...",ham


In [16]:
correct = 0
total = len(sms_test)

for index, row in sms_test.iterrows():
    if row["Label"] == row["predicted"]:
        correct +=1
        
print("Correct: ", correct)
print("Total: ", total)
print("Accuracy: ", correct/total)


Correct:  1100
Total:  1114
Accuracy:  0.9874326750448833


The accuracy value is 98.75% which is good for us to use this filter (it's more than expected)