# Introduction 


In this project, we're going to build a spam filter for SMS messages using the multinomial Naive Bayes algorithm. Our goal is to write a program that classifies new messages with an accuracy greater than 80% — so we expect that more than 80% of the new messages will be classified correctly as spam or ham (non-spam). To train the algorithm, we'll use a dataset of 5,572 SMS messages that are already classified by humans.

## Exploring the Dataset


In [4]:
import pandas as pd 

sms_spam = pd.read_csv('Downloads/Datasets/SMSSpamCollection', sep='\t', header=None, names=['Label', 'SMS'])

In [5]:
sms_spam

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [6]:
print(sms_spam['Label'].value_counts(normalize=True)*100)

Label
ham     86.593683
spam    13.406317
Name: proportion, dtype: float64


- Out of all the messages, 13.4% of the messages are labeled as 'spam'

## Training and Test set

We are now going to split our dataset into training and test sets where 20% of the messages will acount for the test set and 80% of the messages as our trianing set. The dataset has 5,572 messages, which means that:

- The training set will have 4,458 messages (about 80% of the dataset).
- The test set will have 1,114 messages (about 20% of the dataset).

In [7]:
data_randomized = sms_spam.sample(frac=1, random_state=1)

training_test_index = round(len(data_randomized)*0.8)

training = data_randomized[:training_test_index].reset_index(drop=True)
test = data_randomized[training_test_index:].reset_index(drop=True)

In [8]:
print(training['Label'].value_counts(normalize=True))
print(test['Label'].value_counts(normalize=True))

Label
ham     0.86541
spam    0.13459
Name: proportion, dtype: float64
Label
ham     0.868043
spam    0.131957
Name: proportion, dtype: float64


- We can say that both training and test datasets are good representatives of our original datasets since they have similar percentages of ham and spam messages. 

## Data cleaning 

To calculate all the probabilities required by the algorithm, we'll first need to perform a bit of data cleaning to bring the data in a format that will allow us to extract easily all the information we need.

In [9]:
import string
import re 

punct_pattern = f"[{re.escape(string.punctuation)}]"
training['SMS'] = training['SMS'].str.replace(punct_pattern, ' ',regex=True)
training['SMS'] = training['SMS'].str.lower()
training.head()

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...


## Creating the vocabulary 

creating a list of all the words in our sms' and counting them:

In [10]:
training['SMS'] = training['SMS'].str.split()

vocabulary = []
for sms in training['SMS']:
    for word in sms:
        vocabulary.append(word)
        
vocabulary = list(set(vocabulary))

In [11]:
len(vocabulary)

7858

It looks like there are 7,858 unique words in all the messages of our training set.



## The Final Training Set

We're now going to use the vocabulary we just created to make the data transformation we want.



In [12]:
word_counts_per_sms = {unique_word: [0] * len(training['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(training['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

In [13]:
word_counts = pd.DataFrame(word_counts_per_sms)
word_counts.head()

Unnamed: 0,deposit,callertune,birthday,minuts,prevent,colleg,online,2nhite,buff,1st,...,singapore,08712466669,neglet,stoptxtstop£1,base,xxxx,hoo,destiny,hamster,registration
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [14]:
training_set_clean = pd.concat([training, word_counts], axis=1)
training_set_clean.head()

Unnamed: 0,Label,SMS,deposit,callertune,birthday,minuts,prevent,colleg,online,2nhite,...,singapore,08712466669,neglet,stoptxtstop£1,base,xxxx,hoo,destiny,hamster,registration
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Calculating constants

We're now done with cleaning the training set, and we can begin creating the spam filter. 

In [20]:
spam_messages = training_set_clean[training_set_clean['Label']=='spam']
ham_messages = training_set_clean[training_set_clean['Label']=='ham']

#probabilities of spam and ham messages
p_spam = len(spam_messages)/len(training_set_clean['Label'])
p_ham = len(ham_messages)/len(training_set_clean['Label'])

# N_Spam
n_words_per_spam_message = spam_messages['SMS'].apply(len)
n_spam = n_words_per_spam_message.sum()

# N_Ham
n_words_per_ham_message = ham_messages['SMS'].apply(len)
n_ham = n_words_per_ham_message.sum()

# N_Vocabulary
n_vocabulary = len(vocabulary)

# Laplace smoothing
alpha = 1


In [22]:
parameters_spam = {unique_word:0 for unique_word in vocabulary}
parameters_ham = {unique_word:0 for unique_word in vocabulary}

for word in vocabulary:
    n_wordgivenspam = spam_messages[word].sum()
    p_wordgivenspam = (n_wordgivenspam + alpha)/(n_spam + alpha*n_vocabulary)
    parameters_spam[word] = p_wordgivenspam
    
for word in vocabulary:
    n_wordgivenham = ham_messages[word].sum()
    p_wordgivenham = (n_wordgivenham + alpha)/(n_ham + alpha*n_vocabulary)
    parameters_ham[word] = p_wordgivenham

## Classifying a new message

Now that we have all our parameters calculated, we can start creating the spam filter. The spam filter can be understood as a function that:

- Takes in as input a new message (w1, w2, ..., wn).
- Calculates P(Spam|w1, w2, ..., wn) and P(Ham|w1, w2, ..., wn).
- Compares the values of P(Spam|w1, w2, ..., wn) and P(Ham|w1, w2, ..., wn), and:
- If P(Ham|w1, w2, ..., wn) > P(Spam|w1, w2, ..., wn), then the message is classified as ham.
- If P(Ham|w1, w2, ..., wn) < P(Spam|w1, w2, ..., wn), then the message is classified as spam.
- If P(Ham|w1, w2, ..., wn) = P(Spam|w1, w2, ..., wn), then the algorithm may request human help.

In [32]:


def classify(message):
    
    message = re.sub('\W', ' ', message)
    message = message.lower().split()
    
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]
            
        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]
            
    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)
    
    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

In [33]:
classify('WINNER!! This is the secret code to unlock the money: C3421.')

P(Spam|message): 1.3194255327164684e-25
P(Ham|message): 1.9325207278425043e-27
Label: Spam


In [34]:
classify("Sounds good, Tom, then see u there")

P(Spam|message): 2.3967807399651816e-25
P(Ham|message): 3.681184738543211e-21
Label: Ham


## Measuring the Spam Filter's Accuracy 

The two results above look promising, but let's see how well the filter does on our test set, which has 1,114 messages.

We'll start by writing a function that returns classification labels instead of printing them.

In [35]:
def classify_test(message):
    message = re.sub('\W',' ', message)
    message = message.lower().split() 
    
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]
        if word in parameters_ham: 
            p_ham_given_message *= parameters_ham[word]
            
    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'

In [36]:
test['predicted'] = test['SMS'].apply(classify_test)
test.head()

Unnamed: 0,Label,SMS,predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham


## Accuracy of the algorithm 

In [39]:
correct = 0 
total = test.shape[0]

for row in test.iterrows():
    row = row[1]
    if row['Label'] == row['predicted']:
        correct += 1
        
print('Correct:', correct)
print('Total:', total)
print('Accuracy:', correct/total)

Correct: 1100
Total: 1114
Accuracy: 0.9874326750448833


The accuracy is close to 98.74%, which is really good. Our spam filter looked at 1,114 messages that it hasn't seen in training, and classified 1,100 correctly.

## Conclusion 

In this project, we managed to build a spam filter for SMS messages using the multinomial Naive Bayes algorithm. The filter had an accuracy of 98.74% on the test set we used, which is a pretty good result. Our initial goal was an accuracy of over 80%, and we managed to do way better than that.