## Building Naive Bayes algorithm to classify spam messages

The goal is to build a multinomial Naive Bayes algorithm capable to find text messages considered as spam. Data set used for the project can be found https://dq-content.s3.amazonaws.com/433/SMSSpamCollection. Data set consists of 5572 messages each pre-classified by human as Spam or Non-spam. Expected accuracy of the algorithm - more than 80%.

In [2]:
#Reading in file
import pandas as pd
data = pd.read_csv('SMSSpamCollection', sep='\t', header = None)
data.head()

Unnamed: 0,0,1
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [9]:
#label header
data.columns = ['Label','SMS']

#check 'Label' column for values other than 'spam' or 'ham'
data['Label'].value_counts()

ham     4825
spam     747
Name: Label, dtype: int64

In [17]:
#find total count of entries, proportion of labeled as 'spam', and proportion labeled as 'ham'
total = data.Label.value_counts().sum()
proportion_ham = len(data[data.Label=='ham'])/total
proportion_spam = len(data[data.Label=='spam'])/total
print('Ham: {:.5f}, Spam: {:.5f}, Total: {}'.format(proportion_ham, proportion_spam, total))

Ham: 0.86594, Spam: 0.13406, Total: 5572


In [20]:
#randomize data
randomize_data = data.sample(frac=1, random_state=1) #aim is to randomize whole set and then pick 80%/20% for training and testing sets

#split data into training and testing sets(80/20)
n_train = int(total*0.8)
train_data = randomize_data.iloc[:n_train]
test_data = randomize_data.iloc[n_train:]

#reset index after randomizing the rows
train_data.reset_index(inplace=True, drop=True)
test_data.reset_index(inplace=True, drop=True)

#check if values of 'spam' and 'ham' are distrubuted well enough among both sets - training and testing
print('Train_ham: ',train_data.Label.value_counts()[0]/train_data.Label.value_counts().sum())
print('Train_spam: ',train_data.Label.value_counts()[1]/train_data.Label.value_counts().sum())
print('Test_ham: ',test_data.Label.value_counts()[0]/test_data.Label.value_counts().sum())
print('Test_spam: ',test_data.Label.value_counts()[1]/test_data.Label.value_counts().sum())

Train_ham:  0.8653803006506618
Train_spam:  0.13461969934933812
Test_ham:  0.8681614349775785
Test_spam:  0.13183856502242153


In [25]:
#clean the data in training set

#convert 'SMS' column to lowercase
train_data_clean = train_data.copy()
train_data_clean['SMS'] = train_data_clean['SMS'].str.lower()

#remove punctuation
train_data_clean['SMS'] = train_data_clean['SMS'].str.replace(r'\W', ' ')

#split message to list of strings
train_data_clean['SMS'] = train_data_clean['SMS'].str.split()

In [34]:
#create vocabulary - list of unique words or symbols from 'SMS' column

#loop through 'SMS' column and add all values to list
temp = []
for item in train_data_clean.loc[:,'SMS']:
    count_lines +=1
    for i in item: temp.append(i)

#get unique values and save them back to list
vocabulary = list(set(temp))

In [38]:
#create dataframe to count value occurancies

#create dictionary with key, value pairs as key=unique value from vocabulary, value=[0,...nth(total lenght of train data set)] 
word_count_sms_train = {item:[0]*len(train_data_clean)for item in vocabulary} 

#count the occurancies
for index, sms in enumerate(train_data_clean['SMS']):
    for word in sms:
        word_count_sms_train[word][index] += 1

#convert dictionary to DataFrame
word_count_train = pd.DataFrame(word_count_sms_train)

#concat 'Label' and 'SMS' columns together with word_count_train DataFrame
full_df = pd.concat([train_data_clean, word_count_train], axis=1)
full_df.head()

Unnamed: 0,Label,SMS,tobed,gbp,edition,gopalettan,ah,callcost150ppmmobilesvary,gpu,meets,...,costs,diamonds,footbl,81303,wanted,textpod,09099725823,summers,gained,cme
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [41]:
#find total number of values(words) in training set, as well as total in 'spam' and in 'ham'
N_spam = 0
full_df_spam = full_df[full_df.Label=='spam']
for row in full_df_spam['SMS']:
    N_spam += len(row)
    
N_ham = 0
full_df_ham = full_df[full_df.Label=='ham']
for row in full_df_ham['SMS']:
    N_ham += len(row)
    
N_total = N_spam + N_ham

'N_spam: {}, N_ham: {}, N_total: {}'.format(N_spam, N_ham, N_total)

'N_spam: 15190, N_ham: 57233, N_total: 72423'

In [43]:
#find probabilities of spam and ham messages among training data set
P_spam = len(full_df_spam)/len(full_df)
P_ham = len(full_df_ham)/len(full_df)
'P(Spam):{}, P(Ham):{}'.format(P_spam, P_ham)

'P(Spam):0.13461969934933812, P(Ham):0.8653803006506618'

In [48]:
#count probabilities of unique value occurancies in messages given either it is spam message or non-spam 

#initialize two dictionaries to save probabilities of P(word|Spam) and P(word|Ham)
dict_spam = {item:0 for item in vocabulary}
dict_ham = {item:0 for item in vocabulary}

#loop through unique words in vocabulary and calculate probabilities of unique words given that it is spam or non_spam 
#P(word|Spam)=N(word|Spam)+alpha / N(Spam) + alpha * N(vocabulary)
alpha = 1 #using Laplace smoothing  method to avoid biased probability results due to zero values
for word in vocabulary:
    N_word_spam = full_df_spam[word].sum()
    P_word_spam=(N_word_spam + alpha) / (N_spam + alpha * N_total)
    dict_spam[word] = P_word_spam

    N_word_ham = full_df_ham[word].sum()
    P_word_ham=(N_word_ham + alpha) / (N_ham + alpha * N_total)
    dict_ham[word] = P_word_ham

In [49]:
#create function to test any given sentence against the algorithm to see whether it is spam or non-spam message

def classify(message):
    
    #clean the given message
    import re
    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    #initiate probabilities
    p_spam_given_message = P_spam
    p_ham_given_message = P_ham

    for word in message:
        if word in dict_spam: p_spam_given_message *= dict_spam[word]
        else: continue

    for word in message:
        if word in dict_ham: p_ham_given_message *= dict_ham[word]
        else: continue


    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print("Sounds good, Tom, then see u there")

#test the function
classify('Free entry in 2 a wkly comp to win FA Cup')

P(Spam|message): 1.372814009317754e-38
P(Ham|message): 9.219001013815573e-42
Label: Spam


In [56]:
#evaluate how well the algorithm works against test data set

#create a function to run the testing data set

def classify_test_set(message):

    import re
    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = P_spam
    p_ham_given_message = P_ham

    for word in message:
        if word in dict_spam: p_spam_given_message *= dict_spam[word]
        if word in dict_ham: p_ham_given_message *= dict_ham[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_ham_given_message < p_spam_given_message:
        return 'spam'
    else:
        return 'undefined'
    
#run above function and save result('spam'/'ham') in new column
test_data_results = test_data.copy()
test_data_results['Result'] = test_data_results['SMS'].apply(classify_test_set)

#calculate prediction accuracy
import numpy as np
test_data_results['Match'] = np.where((test_data_results['Result']==test_data_results['Label']) ,True, False)

print(test_data_results['Match'].value_counts())
print('Accuracy: ',len(test_data_results[test_data_results['Match']==True])/len(test_data_results))

True     1077
False      38
Name: Match, dtype: int64
Accuracy:  0.9659192825112107


Accuracy of the prediction is 96,6% which is higher than expected. 

In [57]:
#take a closer look into the messages which were guessed incorrectly by the model
test_data_results[test_data_results['Match']==False]

Unnamed: 0,Label,SMS,Result,Match
52,spam,FreeMsg: Hey - I'm Buffy. 25 and love to satis...,ham,False
90,spam,goldviking (29/M) is inviting you to be his fr...,ham,False
115,spam,Not heard from U4 a while. Call me now am here...,ham,False
136,spam,More people are dogging in your area now. Call...,ham,False
142,spam,Dear Voucher Holder 2 claim your 1st class air...,ham,False
153,ham,Unlimited texts. Limited minutes.,spam,False
170,spam,Hottest pics straight to your phone!! See me g...,ham,False
181,spam,Win the newest Harry Potter and the Order of ...,ham,False
264,spam,TheMob>Yo yo yo-Here comes a new selection of ...,ham,False
285,ham,Nokia phone is lovly..,spam,False


There is a high possibility that some of the 'key' words from these messages were not in unique train data set vocabulary and they were ignored while counting probabilities. To solve this problem is neccessary to use bigger training set with wider vocabulary.

Most of the incorrectly guessed messages were human-labeled as 'Spam' while the model marked them as 'Ham'. So, there is higher chance that the model mistaken spam messages with non-spam messages rather than other way around. It is possible that it happens because of P_spam and P_ham values, which gives significant weight to the result, were far apart (P(Spam):0.1346, P(Ham):0.8653). It could possibly be solved by using training data set which has close to 50/50 ratio of spam and non-spam messages.