## Building a Spam Filet with Naive Bayes

In this project, the aim is to build a Naive Bayes algorithm that can classify messages as spam or non-spam. The data used can be downloaded [here](https://dq-content.s3.amazonaws.com/433/SMSSpamCollection) 

In [2]:
#importing packages
import pandas as pd

In [3]:
file = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['Label', 'SMS'])

In [4]:
file.head()

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [5]:
file.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
Label    5572 non-null object
SMS      5572 non-null object
dtypes: object(2)
memory usage: 87.1+ KB


In [6]:
file['Label'].value_counts(normalize=True)*100

ham     86.593683
spam    13.406317
Name: Label, dtype: float64

From the data exploration, the data was found to have 5572 rows and 2 columns. Of the classification, 87% is non spam while 13% is spam.

### Training & Testing Set

80% of the  dataset will be used for training while 20% will be for testing (to train the algorithm on as much data as possible, but also have enough test data). 

In [7]:
randomized = file.sample(frac=1, random_state=1)

# Calculate index for split
training_test_index = round(len(randomized) * 0.8)

# Training/Test split
training_set = randomized[:training_test_index].reset_index(drop=True)
test_set = randomized[training_test_index:].reset_index(drop=True)

print(training_set.shape)
print(test_set.shape)

(4458, 2)
(1114, 2)


In [8]:
file.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
Label    5572 non-null object
SMS      5572 non-null object
dtypes: object(2)
memory usage: 87.1+ KB


In [9]:
training_set['Label'].value_counts(normalize=True)*100

ham     86.54105
spam    13.45895
Name: Label, dtype: float64

In [10]:
test_set['Label'].value_counts(normalize=True)*100

ham     86.804309
spam    13.195691
Name: Label, dtype: float64

The dataset has been divided into training and testing set both of which have similar distribution of spam and non-spam labels (compared with the original dataset).

### Data Cleaning

Punctuations in the messages and mixed cases will be treated in the next couple of cells.

In [11]:
training_set['SMS'] = training_set['SMS'].str.replace('\W', ' ')
training_set['SMS'] = training_set['SMS'].str.lower()
# training_set['SMS'] = training_set['SMS'].apply(remove)

In [12]:
test_set['SMS'] = test_set['SMS'].str.replace('\W', ' ')
test_set['SMS'] = test_set['SMS'].str.lower()
# test_set['SMS'] = test_set['SMS'].apply(remove)

### Creating a vocabulary

Each word in the messages will be transformed to represent a unique word in the vocabulary. Specifically, each column shows the frequency of that unique word for any given message. 

In [13]:
# x = training_set.copy()
training_set['SMS'] = training_set['SMS'].str.split()

vocabulary = []

In [14]:
for i in range (0,len(training_set['SMS'])):
    for j in range (0, len(training_set['SMS'][i])):
        vocabulary.append(training_set['SMS'][i][j])

In [15]:
vocabulary = set(vocabulary)

In [16]:
vocabulary = list(vocabulary)

### Further Data Transformation

In the next couple of cells, I will create a new DataFrame from a dictionary containing the number of time each row of the training

In [17]:
word_counts_per_sms = {unique_word: [0] * len(training_set['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(training_set['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

In [18]:
xx = pd.DataFrame(word_counts_per_sms)

In [19]:
new_data = pd.concat([training_set, xx], axis=1)

### Calculating Constants for the Algorithm

The probability of spam and ham; the number of words in all spam messages, the number of words in all nonspam messges and the number of words in the vocabulary. 

In [20]:
len_train = len(training_set)

p_spam = len(training_set[training_set['Label']=='spam'])/len_train

In [21]:
p_ham = len(training_set[training_set['Label']=='ham'])/len_train

In [22]:
ham = training_set[training_set['Label']=='ham'].reset_index().drop(columns=['index'])

In [23]:
ham['len'] = ham['SMS'].apply(lambda x: len(x))

In [24]:
ham_number = ham['len'].sum()

In [25]:
spam = training_set[training_set['Label']=='spam'].reset_index().drop(columns=['index'])

spam['len'] = spam['SMS'].apply(lambda x: len(x))
spam_number = spam['len'].sum()

In [26]:
vocab_number = len(vocabulary)
alpha = 1

### Calculating Parameters

The previous parameters are constants for every messages regardless of the word it contains. However, P(wi|Spam) and P(wi|Ham) will vary depending on the individual words. For instance, P("secret"|Spam) will have a certain probability value, while P("cousin"|Spam) or P("lovely"|Spam) will most likely have other values.

This means that we can use our training set to calculate the probability for each word in our vocabulary. If our vocabulary contained only the words "lost", "navigate", and "sea", then we'd need to calculate six probabilities:

* P("lost"|Spam) and P("lost"|Ham)
* P("navigate"|Spam) and P("navigate"|Ham)
* P("sea"|Spam) and P("sea"|Ham)

The fact that we calculate so many values before even beginning the classification of new messages makes the Naive Bayes algorithm very fast (especially compared to other algorithms). When a new message comes in, most of the needed computations are already done, which enables the algorithm to almost instantly classify the new message.

In [27]:
dict_ham = {}
dict_spam = {}
for keys in vocabulary: 
    dict_ham[keys] = 0
    dict_spam[keys] = 0

In [28]:
spam_df = new_data[new_data['Label']=='spam']

In [29]:
ham_df = new_data[new_data['Label']=='ham']

In [30]:
for i in range(0, len(vocabulary)):
    dict_spam[vocabulary[i]] = (spam_df[vocabulary[i]].sum() + alpha) / (spam_number + alpha + vocab_number)

In [31]:
for i in range(0, len(vocabulary)):
    dict_ham[vocabulary[i]] = (ham_df[vocabulary[i]].sum() + alpha) / (ham_number + alpha + vocab_number)

### Classifying a New Message

Now that all the constants and parameters needed have been calculated, the spam filter can now be created. It takes in a new message, calculated the parameters of the words it contains (done above) ad compare the values for ham and spam to determine if it is spam or non-spam or even perhaps if human expertise will be needed if the probaility values are the same.

In [32]:
import re

def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

   

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham 
    for i in message:
        if i in dict_spam:
            p_spam_given_message *= dict_spam[i] 
        if i in dict_ham:
            p_ham_given_message *= dict_ham[i]
    
    
    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')

In [33]:
msg = 'WINNER!! This is the secret code to unlock the money: C3421.'

classify(msg)

P(Spam|message): 1.3476009873135234e-25
P(Ham|message): 1.9365368329766623e-27
Label: Spam


In [34]:
msg1 = "Sounds good, Tom, then see u there"

classify(msg1)

P(Spam|message): 2.4364950561289247e-25
P(Ham|message): 3.687133462921691e-21
Label: Ham


### Classifying Test Data & Accuracy

The classify function above will be modified a bit to return values instead of printing them. 

In [35]:
def classify_test_set(message):

#     message = re.sub('\W', ' ', message)
#     message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in dict_spam:
            p_spam_given_message *= dict_spam[word]

        if word in dict_ham:
            p_ham_given_message *= dict_ham[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'

In [36]:
test_set['predicted'] = test_set['SMS'].apply(classify_test_set)

test_set.head()

Unnamed: 0,Label,SMS,predicted
0,ham,later i guess i needa do mcat study too,ham
1,ham,but i haf enuff space got like 4 mb,ham
2,spam,had your mobile 10 mths update to latest oran...,spam
3,ham,all sounds good fingers makes it difficult ...,ham
4,ham,all done all handed in don t know if mega sh...,ham


In [37]:
#calculatin accuracy

correct = 0
total = len(test_set)

In [45]:
for row in test_set.iterrows():
    row = row[1]
    if row['Label'] == row['predicted']:
        correct += 1
        
print('Correct:', correct)
print('Incorrect:', total - correct)
print('Accuracy:', correct/total)

Correct: 1100
Incorrect: 14
Accuracy: 0.9874326750448833


### Conclusion

The Accuracy is 98.74% which is very good. The spam filter looked at 1,114 messages that it hasn't seen in training, and classified 1,100 correctly.