# Building a Spam Filter with Naive Bayes

In this project, we're going to study the practical side of the algorithm by building a spam filter for SMS messages.

To classify messages as spam or non-spam, we saw in the previous mission that the computer:

- Learns how humans classify messages.
- Uses that human knowledge to estimate probabilities for new messages — probabilities for spam and non-spam.
- Classifies a new message based on these probability values — if the probability for spam is greater, then it classifies the message as spam. Otherwise, it classifies it as non-spam (if the two probability values are equal, then we may need a human to classify the message).

So our first task is to "teach" the computer how to classify messages. To do that, we'll use the multinomial Naive Bayes algorithm along with a dataset of 5,572 SMS messages that are already classified by humans.

## Exploring the data

In [1]:
import pandas as pd

sms_spam = pd.read_csv("SMSSpamCollection", sep='\t', header = None, names=['Label', 'SMS'])

In [2]:
print(sms_spam.shape)

(5572, 2)


In [3]:
sms_spam["Label"].value_counts()

ham     4825
spam     747
Name: Label, dtype: int64

In [4]:
sms_spam["Label"].value_counts(normalize=True)*100

ham     86.593683
spam    13.406317
Name: Label, dtype: float64

In [5]:
sms_spam.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


Our goal is to create a spam filter that classifies new messages with an accuracy greater than 80% — so we expect that more than 80% of the new messages will be classified correctly as spam or ham (non-spam).

## Test Set and Training Set

We're going to keep 80% of our dataset for training, and 20% for testing (we want to train the algorithm on as much data as possible, but we also want to have enough test data).

In [6]:
#Randomizing the dataset

randomized_data = sms_spam.sample(frac = 1, random_state = 1)

training_test_index = round(len(randomized_data)*0.8)

training_set = randomized_data[:training_test_index].reset_index(drop=True)

In [7]:
test_set = randomized_data[training_test_index:].reset_index(drop=True)

In [8]:
print(training_set.shape)
print(test_set.shape)

(4458, 2)
(1114, 2)


In [9]:
test_set["Label"].value_counts(normalize=True)

ham     0.868043
spam    0.131957
Name: Label, dtype: float64

In [10]:
training_set["Label"].value_counts(normalize=True)

ham     0.86541
spam    0.13459
Name: Label, dtype: float64

On the previous screen, we split our dataset into a training set and a test set. The next big step is to use the training set to teach the algorithm to classify new messages.

## Data Cleaning

To calculate all the probabilities required by the algorithm, we'll first need to perform a bit of data cleaning to bring the data in a format that will allow us to extract easily all the information we need.

## Letter Case and Punctuation

In [11]:
training_set['SMS']=training_set['SMS'].str.replace('\W',' ')
training_set['SMS']=training_set['SMS'].str.lower()

training_set.head()

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...


## Creating the vocabulary

In [12]:
training_set['SMS'] = training_set['SMS'].str.split()

vocabulary = []

for sms in training_set['SMS']:
    for word in sms:
        vocabulary.append(word)

In [13]:
vocabulary = list(set(vocabulary))

In [14]:
len(vocabulary)

7783

It looks like there are 7,783 unique words in all the messages of our training set.



## Create a Dictionary and a Dataframe out of it

In [15]:
word_counts_per_sms = {unique_word: [0] * len(training_set['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(training_set['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

In [16]:
word_counts = pd.DataFrame(word_counts_per_sms)
word_counts.head()

Unnamed: 0,0,00,000,000pes,008704050406,0089,01223585334,02,0207,02072069400,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


In [17]:
training_set_clean = pd.concat([training_set, word_counts], axis =1)

In [18]:
training_set_clean.head(3)

Unnamed: 0,Label,SMS,0,00,000,000pes,008704050406,0089,01223585334,02,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Calculating the Constants

In [19]:
#Isolating spam and ham messages

spam_messages = training_set_clean[training_set_clean["Label"] == "spam"]
ham_messages = training_set_clean[training_set_clean["Label"] == "ham"]

#Calculate P(Spam), P(Ham)

p_spam = len(spam_messages)/len(training_set_clean)
p_ham = len(ham_messages)/len(training_set_clean)

#N(Spam), N(Ham)

n_words_per_spam_message = spam_messages["SMS"].apply(len)
n_spam = n_words_per_spam_message.sum()

n_words_per_ham_message = ham_messages["SMS"].apply(len)
n_ham = n_words_per_ham_message.sum()


In [20]:
#N(Vocabulary)

n_vocabulary = len(vocabulary)

#Laplace smoothing

alpha = 1

## Calculating Parameters    

In [21]:
#Initiate Parameters

parameters_spam = {unique_word:0 for unique_word in vocabulary}
parameters_ham = {unique_word:0 for unique_word in vocabulary}

#Calculate parameters

for word in vocabulary:
    n_word_given_spam = spam_messages[word].sum()
    p_word_given_spam =(n_word_given_spam + alpha) / (n_spam + (alpha * n_vocabulary))
    
    n_word_given_ham = ham_messages[word].sum()
    p_word_given_ham = (n_word_given_ham + alpha) / (n_ham + (alpha * n_vocabulary))
    
    parameters_spam[word] = p_word_given_spam
    parameters_ham[word] = p_word_given_ham

## Classifying A New Message¶


In [22]:
import re

def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]
        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]
    
    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

In [23]:
classify('WINNER!! This is the secret code to unlock the money: C3421.'
)

P(Spam|message): 1.3481290211300841e-25
P(Ham|message): 1.9368049028589875e-27
Label: Spam


In [24]:
classify("Sounds good, Tom, then see u there"
)

P(Spam|message): 2.4372375665888117e-25
P(Ham|message): 3.687530435009238e-21
Label: Ham


## Measuring the Spam Filter's Accuracy

In [25]:
def classify_test_set(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in parameters_spam:
            p_spam_given_message *= parameters_spam[word]

        if word in parameters_ham:
            p_ham_given_message *= parameters_ham[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'

In [26]:
test_set['predicted'] = test_set['SMS'].apply(classify_test_set)
test_set.head()

Unnamed: 0,Label,SMS,predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham


In [27]:
correct = 0
total = test_set.shape[0]

In [28]:
for row in test_set.iterrows():
    row = row[1]
    if row["Label"] == row["predicted"]:
        correct +=1

print("Correct: ",correct)
print("Incorrect: ", total - correct)
print("Accuracy: ",(correct*100)/total,"%")

Correct:  1100
Incorrect:  14
Accuracy:  98.74326750448833 %


The accuracy is close to 98.74%, which is really good. Our spam filter looked at 1,114 messages that it hasn't seen in training, and classified 1,100 correctly.



## Next steps

In this project, we managed to build a spam filter for SMS messages using the multinomial Naive Bayes algorithm. The filter had an accuracy of 98.74% on the test set we used, which is a pretty good result. Our initial goal was an accuracy of over 80%, and we managed to do way better than that.

Next steps include:

- Analyze the 14 messages that were classified incorrectly and try to figure out why the algorithm classified them incorrectly
- Make the filtering process more complex by making the algorithm sensitive to letter case