# SMS Spam Filter

    In this project we will build Spam Filter for SMS messages. So our first task is to "teach" the computer how to classify messages. To do that, we'll use the multinomial Naive Bayes algorithm along with a dataset of 5,572 SMS messages that are already classified by humans. 
    The dataset was put together by Tiago A. Almeida and José María Gómez Hidalgo, and it can be downloaded from the The UCI Machine Learning Repository. You can also download the dataset directly from this link. The data collection process is described in more details on this page, where you can also find some of the authors' papers.

Let's start with reading in the data.

In [1]:
import pandas as pd
import numpy as np

In [2]:
data = pd.read_csv('SMSSpamCollection', sep='\t',header=None, names=['Label','SMS'])

In [3]:
data.shape

(5572, 2)

In [4]:
data.head()


Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [5]:
data.tail()

Unnamed: 0,Label,SMS
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...
5571,ham,Rofl. Its true to its name


In [6]:
data['Label'].value_counts(normalize=True) * 100

ham     86.593683
spam    13.406317
Name: Label, dtype: float64

    We have 5572 total SMS, where about 87% of the messages are non-spam, 
    and the remaining 13% are spam.

Once our spam filter is done, we'll need to test how good it is with classifying new messages. To test the spam filter, we're first going to split our dataset into two categories:<br>
- A training set, which we'll use to "train" the computer how to classify messages.
- A test set, which we'll use to test how good the spam filter is with classifying new messages.

Let's start by noting that all 1,114 of the messages in our test set have already been categorised by a human in order to better appreciate the motivation behind setting aside a test set. We'll regard these messages as new and let the spam filter categorize them whenever it's ready. Once we obtain the data, we'll be able to compare the categorization performed by the algorithm with that performed by a human, and we'll be able to determine how effective the spam filter actually is.

# Main goal

More than 80% of the new messages should be accurately classified as spam or ham (non-spam) because our project's goal is to build a spam filter that classifies new messages with an accuracy of more than 80%.

-    At this point in the guided project, let's establish a training set and a test set before returning to testing. In order to ensure that spam and ham messages are evenly distributed throughout the dataset, we will start by randomizing the full set of messages.

# Traning and Test sets

Our dataset will now be divided into two sets: a training set that contains 80% of the data and a test set that contains the remaining 20%.

In [7]:
data_randomized = data.sample(frac=1, random_state=1)
indexes = round(len(data_randomized) *0.8)

training_set = data_randomized[:indexes].reset_index(drop=True)
test_set = data_randomized[indexes:].reset_index(drop=True)

Now let's check percentage of spam and ham in both new datasets.

In [8]:
training_set['Label'].value_counts(normalize=True)

ham     0.86541
spam    0.13459
Name: Label, dtype: float64

In [9]:
test_set['Label'].value_counts(normalize=True)

ham     0.868043
spam    0.131957
Name: Label, dtype: float64

    There are almost same percentages of spam/non-spam messages in both dataframes, so sampling went pretty well.

# Data Cleaning

   For our checking purposes, we will remove all punctuation and change every letter in every word to lower case.

### Letter Case and Punctuation

In [10]:
training_set.head()

Unnamed: 0,Label,SMS
0,ham,"Yep, by the pretty sculpture"
1,ham,"Yes, princess. Are you going to make me moan?"
2,ham,Welp apparently he retired
3,ham,Havent.
4,ham,I forgot 2 ask ü all smth.. There's a card on ...


In [11]:
training_set['SMS']=training_set['SMS'].str.replace('\W', ' ')
training_set['SMS']=training_set['SMS'].str.lower()


In [12]:
training_set

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...
5,ham,ok i thk i got it then u wan me 2 come now or...
6,ham,i want kfc its tuesday only buy 2 meals only ...
7,ham,no dear i was sleeping p
8,ham,ok pa nothing problem
9,ham,ill be there on lt gt ok


### Creating a vocabulary

    Now that the vocabulary has been established, let's make a list of every distinct word in our training set.



In [13]:
vocabulary = []
training_set['SMS'] = training_set['SMS'].str.split()
for row in training_set['SMS']:
    for i in row:
        vocabulary.append(i)
vocabulary=set(vocabulary)
vocabulary=list(vocabulary)
        


In [14]:
len(vocabulary)

7783

    It looks like there are 7783 unique words in our vocabulary.

In [15]:
word_counts_per_sms = {unique_word: [0] * len(training_set['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(training_set['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

In [16]:
word_counts=pd.DataFrame(word_counts_per_sms)
training_set_clean = pd.concat([training_set,word_counts], axis =1)

In [17]:
training_set_clean.head()

Unnamed: 0,Label,SMS,0,00,000,000pes,008704050406,0089,01223585334,02,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


## Creating the spam filter

In [18]:
spam = training_set_clean[training_set_clean['Label']=='spam']
non_spam=training_set_clean[training_set_clean['Label']=='ham']

p_spam = len(spam) / len(training_set_clean)
p_non_spam = len(non_spam) / len(training_set_clean)

n_spam=0
n_non_spam=0
n_vocabulary=len(vocabulary)
alpha = 1 

for i in spam['SMS']:
    n_spam+=len(i)
for i in non_spam['SMS']:
    n_non_spam+=len(i)


In [19]:
print('Probability of spam message: {}\nProbability of non-spam message: {}\nNumber of words in spam messages: {}\nNumber of words in non-spam messages: {}\nNumber of words in all messages: {}'.format(p_spam, p_non_spam,n_spam, n_non_spam, n_vocabulary))


Probability of spam message: 0.13458950201884254
Probability of non-spam message: 0.8654104979811574
Number of words in spam messages: 15190
Number of words in non-spam messages: 57237
Number of words in all messages: 7783


In [20]:
spam_dic={}
non_spam_dic={}

parameters_spam = {unique_word:0 for unique_word in vocabulary}
parameters_ham = {unique_word:0 for unique_word in vocabulary}

In [21]:
training_spam_set = training_set_clean[training_set_clean['Label']=='spam']
training_nspam_set = training_set_clean[training_set_clean['Label']=='ham']

for word in vocabulary:
    spam_dic[word]=(training_spam_set[word].sum()+alpha)/((alpha*n_vocabulary) + n_spam)
    non_spam_dic[word]=(training_nspam_set[word].sum()+alpha)/((alpha*n_vocabulary) + n_non_spam)


In [22]:
import re

def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    

    p_spam_given_message = p_spam
    p_ham_given_message = p_non_spam
    for word in message:
        if word in spam_dic:
            p_spam_given_message *=spam_dic[word]
        if word in non_spam_dic:
            p_ham_given_message *=non_spam_dic[word]

    

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_ham_given_message < p_spam_given_message:
        return 'spam'
    else:
        return 'needs human classification'

    We'll now try to determine how well the spam filter does on our test set of 1,114 messages.

In [23]:
test_set['predicted']= test_set['SMS'].apply(classify)
test_set.head()

Unnamed: 0,Label,SMS,predicted
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham


In [24]:
correct = 0 
total = test_set.shape[0]

In [25]:
for row in test_set.iterrows():
    row = row[1]
    if row['Label'] == row['predicted']:
        correct +=1
percentages = correct/total

In [26]:
print('Predicted rows: {}\nTotal rows: {}\nAccuracy: {}%'.format(correct,total,percentages))

Predicted rows: 1100
Total rows: 1114
Accuracy: 0.9874326750448833%
