# Creating a Spam Filter #

In this project, we are going to build a spam filter for SMS messages, using Naive Bayes algorithm. Our goal is making the filter with more than 80% of accuracy. A dataset we are going to 
use can be downloaded from [here](http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/#composition)

And this is a guided project which I did through learning on DataQuest, which is an online educational platform for data science. 

In [1]:
import pandas as pd

## Read the data set 

In [2]:
df_sms = pd.read_csv('SMSSpamCollection',sep='\t',header=None,
                    names=['label','SMS'])

In [3]:
df_sms.head()

Unnamed: 0,label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [4]:
df_sms.shape

(5572, 2)

There are 5572 entries. 

In [5]:
df_sms.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
label    5572 non-null object
SMS      5572 non-null object
dtypes: object(2)
memory usage: 87.1+ KB


In [6]:
df_sms.isnull().sum()

label    0
SMS      0
dtype: int64

there are no null values, so this dataset is already cleaned 

In [7]:
df_sms['label'].value_counts(normalize=True)

ham     0.865937
spam    0.134063
Name: label, dtype: float64

The distribution of non-spam(ham) and spam is biased. About 87 % of entries are ham and 13 % is spam messages. 

## Creating a training and test data set ##

We will keep 80 % of this dataset as a training data set and 20 % of it as a test data set. The test data set will be used to test accuracy of our spam filter. 

In [8]:
# randomize our dataset 
df_random = df_sms.sample(frac=1,random_state=1)

In [9]:
df_random.shape[0]*0.8

4457.6

In [10]:
# around 4457 rows are about 80% of the data set. 
training_df = df_random.iloc[:4458,:]
test_df = df_random.iloc[4458:,:]

In [11]:
# reset indices since they are meaningless
training_df.reset_index(inplace=True,drop=True)
test_df.reset_index(inplace=True,drop=True)

In [12]:
training_df['label'].value_counts(normalize=True)

ham     0.86541
spam    0.13459
Name: label, dtype: float64

In [13]:
test_df['label'].value_counts(normalize=True)

ham     0.868043
spam    0.131957
Name: label, dtype: float64

Good. The distributions of both of these data sets are close to our population's distribution. 

## Data Transformation 

We start to manipulate the training data set to make calculations easire. 

Let's begin with getting rid of non-word chracters from SMS column. They are like puctuations, exclamation mark, and so on. 

In [14]:
#getting rid of non-word characters
training_df['SMS'] =  training_df['SMS'].str.replace('\W'," ")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


In [15]:
# lower all strings 
training_df['SMS'] = training_df['SMS'].str.lower()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


In [16]:
training_df.head()

Unnamed: 0,label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...


From here, we will get rid of ```SMS``` column and make columns where each column represents an unique word of SMS messages we have. The columns' values represent frequency of the words for each SMS message. 

First, we will make a list of all unique words in the messages of our training data set.

In [17]:
training_copy = training_df.copy()

In [18]:
training_copy['SMS'] = training_copy['SMS'].str.split()

In [19]:
vocabulary = [] 
for item in training_copy['SMS']:
    for word in item:
        vocabulary.append(word)

In [20]:
#remove duplicaetes by using set() 
vocabulary = list(set(vocabulary))

From here, we will start to create columns where each of them represents a unique word in ```vocabulary```. Values in them describe frequency of the words in each message. 

In [21]:
word_counts_per_sms = {unique_word:[0]*len(training_copy) for unique_word in vocabulary}

for index, sms in enumerate(training_copy['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] +=1 


In [22]:
vocab_count = pd.DataFrame(word_counts_per_sms)

In [23]:
vocab_count.head()

Unnamed: 0,0,00,000,000pes,008704050406,0089,01223585334,02,0207,02072069400,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


In [24]:
# concatenate a data set describing frequency of words and the original data set. 
training_with_counts = pd.concat([training_copy,vocab_count],axis=1)

In [25]:
training_with_counts.head()

Unnamed: 0,label,SMS,0,00,000,000pes,008704050406,0089,01223585334,02,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


Each row now has columns, which represent frequency of words in each SMS message

## Building a Filter 
Since we have done data transformation, we will actually work on buiding a filter. We are going to calculate things needed for naive bayes. For information of naive bayes, please visit [here](https://en.wikipedia.org/wiki/Naive_Bayes_classifier). If you need to refresh knowledge of Bayes' Theorem, this [Youtube Video](https://youtu.be/HZGCoVF3YvM) is a good material. 

In [26]:
#calculate p(spam) and p(ham)
prob = training_with_counts['label'].value_counts(normalize=True)
p_spam = prob['spam']
p_ham = prob['ham']

In [27]:
#separating our training df into spam and ham 
spam = training_with_counts[training_with_counts['label']=='spam']
ham = training_with_counts[training_with_counts['label']=='ham']

calculate number of words in spam, number of  words in ham, and number of vocabulary. 

In [28]:
# create a new columns to represent the number of words of each SMS message
spam['total'] = spam['SMS'].apply(lambda x :len(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [29]:
# create a new columns to represent the number of words of each SMS message
ham['total'] = ham['SMS'].apply(lambda x :len(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [30]:
#number of words in spam and ham
num_spam = spam['total'].sum()
num_ham = ham['total'].sum()

In [31]:
# number of unique words
num_vocab = len(vocabulary)

We are going to implement laplace smoothing to avoid getting probability of zero in calculating probabilities. We will set alpha = 1. For information of laplace smoothing. Please visit [here](https://en.wikipedia.org/wiki/Additive_smoothing)

In [32]:
alpha = 1 

In [33]:
#create two dictionaries to store parameters 
ham_prob = {unique_word:0 for unique_word in vocabulary}
spam_prob = {unique_word:0 for unique_word in vocabulary}

In [34]:
# calculate p(word|spam) and p(word|ham), and store the probabilities into
# the two dictionaries we made above
for word in vocabulary:
    num_word_spam = spam[word].sum() 
    prob_word_spam = (num_word_spam+alpha) / (num_spam+alpha*num_vocab)
    spam_prob[word] = prob_word_spam
    
    
    num_word_ham = ham[word].sum() 
    prob_word_ham = (num_word_ham+alpha) / (num_ham+alpha*num_vocab)
    ham_prob[word] = prob_word_ham

## Creating a function of a filter 

In [35]:
import re

def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham 
    
    for word in message:
        if word in vocabulary:
            p_spam_given_message *= spam_prob[word]
            p_ham_given_message  *= ham_prob[word]
            
        else:
            continue 

    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_ham_given_message < p_spam_given_message:
        return 'spam'
    else:
        return 'needs human classification'

In [36]:
# lets's check our filter a little bit 
classify('WINNER!! This is the secret code to unlock the money: C3421.')

P(Spam|message): 1.3481290211300841e-25
P(Ham|message): 1.9368049028589875e-27


'spam'

it seems good 

## Test Accuracy

let's define the function again without print 

In [37]:
import re

def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham 
    
    for word in message:
        if word in vocabulary:
            p_spam_given_message *= spam_prob[word]
            p_ham_given_message  *= ham_prob[word]
            
        else:
            continue 

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_ham_given_message < p_spam_given_message:
        return 'spam'
    else:
        return 'needs human classification'

In [38]:
# make a column cotaining predicted labels of SMS message in test_df 
test_df['predicted'] = test_df['SMS'].apply(classify)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


In [39]:
# calculate the accuracy of our spam filter 
correct = 0 
total = len(test_df)

for index, series in test_df.iterrows():
    if series['label'] == series['predicted']:
        correct += 1
accuracy = correct / total

In [40]:
accuracy

0.9874326750448833

# Conclusion # 
Although our goal was making a spam filter with more than 80% accuracy, we made a spam filter with about 98.7% of accuracy! This means that the spam filter we made correctly predicted 98.7% of our test data.