# Creating a spam filter using a multinomial Naive Bayes algorithm
The objective of this project is to create a SMS spam filter by deploying a Naive Bayes Algorithm. The dataset of 5752 classified text messages can be found [here](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection).

Naive Bayes use the historical probability of previous events (in our case, words appearing in spam messages) to predict the likelihood of current events (is a message containing these words spam?).

## Reading and Exploring the data

In [250]:
import pandas as pd
data = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['Label', 'SMS'])
data.head(10)

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
5,spam,FreeMsg Hey there darling it's been 3 week's n...
6,ham,Even my brother is not like to speak with me. ...
7,ham,As per your request 'Melle Melle (Oru Minnamin...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...


What percentage of these are spam? What percantage are not?

In [251]:
data['Label'].value_counts(normalize=True)

ham     0.865937
spam    0.134063
Name: Label, dtype: float64

Out of 5752 text messages, 87% are not spam ('ham') while 13% are spam.

## Dividing data into training and testing sets.
Let's keep 80% of the data for training, while randomizing it first to make the distribution of ham/spam has even as possible.

In [252]:
data = data.sample(frac=1, random_state = 1)
threshold = round(.8*len(data))
train = data.iloc[:threshold,:].copy().reset_index(drop=True)
test = data.iloc[threshold:,:].copy().reset_index(drop=True)
print(train['Label'].value_counts(normalize=True))
print(test['Label'].value_counts(normalize=True))

ham     0.86541
spam    0.13459
Name: Label, dtype: float64
ham     0.868043
spam    0.131957
Name: Label, dtype: float64


The distribution is more or less the same as for the original dataset. Good!

## Data cleaning
The training data needs to be cleaned in order to apply the algorithm. First:

- Case needs to be uniform, we will seet every message to lower case.
- Punctuation needs to be removed

In [253]:
train.head()

Unnamed: 0,Label,SMS
0,ham,"Yep, by the pretty sculpture"
1,ham,"Yes, princess. Are you going to make me moan?"
2,ham,Welp apparently he retired
3,ham,Havent.
4,ham,I forgot 2 ask ü all smth.. There's a card on ...


In [254]:
train['SMS'] = train['SMS'].str.replace('\W',' ')
train['SMS'] = train['SMS'].str.lower()
train.head()

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...


Now that the data is clean, we want to transform it into numerical values. Each message will be represented by the number of times each word appears. To do this, we must first create a list of words.
### Creating vocabulary set

In [255]:
# changing each message into a list
train['SMS'] = train['SMS'].str.split()
# grouping up all the lists into a vocabulary list
vocabulary = []
for sms in train['SMS']:
    for word in sms:
        vocabulary.append(word)
        
# removing duplicates by turning into unique set then back to list
vocabulary = set(vocabulary)
vocabulary = list(vocabulary)
len(vocabulary)

7783

We now have a vocabulary of 7782 words

### Encoding the data as word frequency values
Let's count how many times each word of the vocabulary appears in each text message, and create a dictionary in which each key will be a unique word, and the corresponding value a list with values indicate the words frequency in each message. This dictionary can then be easily turned into dataframe columns.

In [256]:
word_counts = {}
for unique_word in vocabulary:
    word_counts[unique_word] = [0]*len(train)

for index,sms in enumerate(train['SMS']):
    for word in sms:
        word_counts[word][index] +=1

word_counts = pd.DataFrame(word_counts)
word_counts.head()

Unnamed: 0,unmits,wherever,4few,size,appear,favorite,andros,0789xxxxxxx,cooped,general,...,schools,cliff,idu,night,box334,kinda,teams,gram,sittin,ironing
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


We can now concatenete these columns with our original data.

In [257]:
train = pd.concat([train,word_counts], axis=1)
train

Unnamed: 0,Label,SMS,unmits,wherever,4few,size,appear,favorite,andros,0789xxxxxxx,...,schools,cliff,idu,night,box334,kinda,teams,gram,sittin,ironing
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,ham,"[ok, i, thk, i, got, it, then, u, wan, me, 2, ...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,ham,"[i, want, kfc, its, tuesday, only, buy, 2, mea...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,ham,"[no, dear, i, was, sleeping, p]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,ham,"[ok, pa, nothing, problem]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,ham,"[ill, be, there, on, lt, gt, ok]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Implementing Naive Bayes
### Calculating the Parameters
To use the algorithm, we first need to calculate:
- The probability of spam and non-spam
- The number of words in all spam messages
- The number of words in all non-spam messages
- The number of unique words in the vocabulary

We also need an alpha value for our smoothing equation, which we will set at alpha = 1

In [258]:
alpha = 1

p_spam = train['Label'].value_counts(normalize=True)['spam']
p_ham = train['Label'].value_counts(normalize=True)['ham']
N_spam = train[train['Label'] == 'spam'].sum(axis=1).sum()
N_ham = train[train['Label'] == 'ham'].sum(axis=1).sum()
N_voc = len(vocabulary)

In [259]:
print(p_spam, p_ham, N_spam, N_ham, N_voc)

0.13458950201884254 0.8654104979811574 15190 57237 7783


Next, we need to calculate the conditional probability for each word to occur given that it is spam, and the probability for each word to occur given that it is not spam.

In [260]:
word_probs_given_spam = {unique_word : 0 for unique_word in vocabulary}
word_probs_given_ham = {unique_word : 0 for unique_word in vocabulary}

spam = train[train['Label'] == 'spam'].copy()
ham = train[train['Label'] == 'ham'].copy()
for unique_word in vocabulary:
    word_probs_given_spam[unique_word] = (spam[unique_word].sum() + alpha) / (N_spam + alpha * N_voc)
    word_probs_given_ham[unique_word] = (ham[unique_word].sum() + alpha) / (N_ham + alpha * N_voc)    

### Creating the filter
With the parameters calculated, we can define a function that takes in a message, computes the probabilities of it being a spam or not, and returns the proper classification accordingly.

In [261]:
def classify(message):      
    message = re.sub('\W', ' ', message)
    message = message.lower().split()
    
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in word_probs_given_spam:
            p_spam_given_message *= word_probs_given_spam[word]
            
        if word in word_probs_given_ham:
            p_ham_given_message *= word_probs_given_ham[word]
    
    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'

## Applying the algorithm to our test set
Everything is ready, let's apply the algorithm to each message in our test set.

In [262]:
test['predicted'] = test['SMS'].apply(classify)
test.head()

Unnamed: 0,Label,SMS,predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham


How well did our predictions do?

In [267]:
correct = 0
for index, row in test.iterrows():
    if row['Label'] == row['predicted']:
        correct += 1
        
print('Correct:', correct)
print('Incorrect:', total - correct)
print('Accuracy:', correct/len(test))

Correct: 1100
Incorrect: 14
Accuracy: 0.9874326750448833


Our algorithm was 98.7% accurate!
Let's see which messages were inccorectly classified, and see if we can improve our algorith,

In [276]:
wrong_class = test[test['Label'] != test['predicted']]
for index,row in wrong_class.iterrows():
    print('IS ACTUALLY', row['Label'], 'GUESSED', row['predicted'], '\n', row['SMS'],'\n')

IS ACTUALLY spam GUESSED ham 
 Not heard from U4 a while. Call me now am here all night with just my knickers on. Make me beg for it like U did last time 01223585236 XX Luv Nikiyu4.net 

IS ACTUALLY spam GUESSED ham 
 More people are dogging in your area now. Call 09090204448 and join like minded guys. Why not arrange 1 yourself. There's 1 this evening. A£1.50 minAPN LS278BB 

IS ACTUALLY ham GUESSED spam 
 Unlimited texts. Limited minutes. 

IS ACTUALLY ham GUESSED spam 
 26th OF JULY 

IS ACTUALLY ham GUESSED spam 
 Nokia phone is lovly.. 

IS ACTUALLY ham GUESSED needs human classification 
 A Boy loved a gal. He propsd bt she didnt mind. He gv lv lttrs, Bt her frnds threw thm. Again d boy decided 2 aproach d gal , dt time a truck was speeding towards d gal. Wn it was about 2 hit d girl,d boy ran like hell n saved her. She asked 'hw cn u run so fast?' D boy replied "Boost is d secret of my energy" n instantly d girl shouted "our energy" n Thy lived happily 2gthr drinking boost evryd

Looking at these messages gives no clear pattern of errors. With near 99% accuracy, our algorithm is already pretty good.
## Conclusion
By training a naive bayes algorithm on a set of pre classified messages, we were able to predict if messages were spam with 98% accuracy.