## Spam Classifier Project

### Introduction

In the year 2022, there exists no bigger problem to us in the first world than perhaps robocalls and text scams. Whether they are letting us know about our car's extended warranty, or making us aware of the cash prize available to us on the other side of a suspicious link, spam messages come in a variety of different forms. Luckily, with a little bit of programming skills and statistical knowledge, we can build a spam classifier to help us determine if a random message should be avoided.

In this project, we will use the Naive Bayes algorithm to help us build such a machine learning model that will help us accomplish the goal stated above. We will use a dataset of 5,572 SMS messages to help train our model. The dataset can be found in my github repository.  

### Exploratory Analysis

In [1]:
# The first thing we will do is import the required packages and load up our data
import pandas as pd
import numpy as np
messages = pd.read_csv('SMSSpamCollection', sep='\t', names=['Label', 'SMS'])

In [2]:
# Let's get an idea of some of the attributes of our data set
messages.head(5)

Unnamed: 0,Label,SMS
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
# It appears that ham is "non-spam". That's pretty funny actually
# Let's see how many hams we have
print(messages.shape)

(5572, 2)


In [4]:
messages['Label'].value_counts()

ham     4825
spam     747
Name: Label, dtype: int64

In [5]:
ham_percentage = (4825 / 5572)
spam_percentage = 1 - ham_percentage
print(ham_percentage)

0.8659368269921034


### Preprocessing

With any good Machine Learning model comes the requirement of a validation score. If we can't judge the validity of our model against a quantitative number, what good is it? As a result, we need will need to split our data into two groups: a training set and a test set. We'll perform a 80:20 split.

In [6]:
# We now create a training set and test set (without scikit - learn)
# We first randomize our set to ensure there is no bias in our split
randomized_messages = messages.sample(frac=1, random_state=1)
train_messages = randomized_messages.iloc[0:4457, :].reset_index(drop=True)  # 80% of 5572 is ~4458
test_messages = randomized_messages.iloc[4458:, :].reset_index(drop=True)

In [7]:
# Let's check to make sure our alterations worked
train_messages.head(3)

Unnamed: 0,Label,SMS
0,ham,"Yep, by the pretty sculpture"
1,ham,"Yes, princess. Are you going to make me moan?"
2,ham,Welp apparently he retired


In [8]:
test_messages.head(3)

Unnamed: 0,Label,SMS
0,ham,Later i guess. I needa do mcat study too.
1,ham,But i haf enuff space got like 4 mb...
2,spam,Had your mobile 10 mths? Update to latest Oran...


In [9]:
# Let's check to see if we have a similar distribution as our original dataset
train_messages['Label'].value_counts()

ham     3857
spam     600
Name: Label, dtype: int64

In [10]:
train_percentage_non_spam = 3857 / len(train_messages)
print(train_percentage_non_spam)

0.8653803006506618


In [11]:
test_messages['Label'].value_counts()

ham     967
spam    147
Name: Label, dtype: int64

In [12]:
test_percentage_non_spam = 967 / len(test_messages)
print(test_percentage_non_spam)

0.8680430879712747


We notice that for both our training and test set, the distribution of spam is almost identical to our overall population percentage of spam. This bodes well for our confidence going forward.

### Model Construction

Recall that the model we will be using for spam classificatin is based upon the Naive Bayes algorithm. Allow me to try to explain how we can apply it here mathematically.

We are interested in the probabilty of a message being spam/ham based upon the words in the message. As a result, this would be proportional (NOT equal) to the product of the probability of spam/ham times the product of all the probabilities of a certain word given it was spam or ham. 
In symbols, we have:
P(Spam|w(1), ..., w(n)) := P(Spam) * P(w(1)|Spam) * ... * P(w(n)|Spam)

We have other formulas to calculate P(w(i)|Spam), and as such we will apply them to calculate the overall probability. These together make up the central tenets of the Multinomial Bayes algorithm.

#### Data Cleaning

In [13]:
# We will first clean our datasets and isolate each word in our vocabulary
# Before cleaning
train_messages.head()

Unnamed: 0,Label,SMS
0,ham,"Yep, by the pretty sculpture"
1,ham,"Yes, princess. Are you going to make me moan?"
2,ham,Welp apparently he retired
3,ham,Havent.
4,ham,I forgot 2 ask ü all smth.. There's a card on ...


In [14]:
# Remove punctuation and transform to lower case
train_messages['SMS'] = train_messages['SMS'].str.replace('\W', ' ').str.lower()
test_messages['SMS'] = test_messages['SMS'].str.replace('\W', ' ').str.lower()

In [15]:
# After cleaning
train_messages.head()

Unnamed: 0,Label,SMS
0,ham,yep by the pretty sculpture
1,ham,yes princess are you going to make me moan
2,ham,welp apparently he retired
3,ham,havent
4,ham,i forgot 2 ask ü all smth there s a card on ...


In [16]:
test_messages.head()

Unnamed: 0,Label,SMS
0,ham,later i guess i needa do mcat study too
1,ham,but i haf enuff space got like 4 mb
2,spam,had your mobile 10 mths update to latest oran...
3,ham,all sounds good fingers makes it difficult ...
4,ham,all done all handed in don t know if mega sh...


In [17]:
# Now that our data is cleaned, it's time to create our vocabulary set of words
train_messages['SMS'] = train_messages['SMS'].str.split()  # Creates a list of words for each message in the SMS column
vocabulary = []  # Our vocabulary list
for sentence in train_messages['SMS']:  # Iterate over each list-message
    for word in sentence:  # Iterate over each word in the list
        vocabulary.append(word)  # Add that word to our vocabulary list in the global scope
vocabulary = list(set(vocabulary)) # Removes all duplicates   

In [18]:
# We create a dict to count the number of times a word appears in each SMS
word_counts_per_sms = {unique_word: [0] * len(train_messages['SMS']) for unique_word in vocabulary}
for index, sms in enumerate(train_messages['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1

In [19]:
# We then create a DataFrame from this dict and concatenate it with our train set
word_count_df = pd.DataFrame(word_counts_per_sms)
word_count_df.head()

Unnamed: 0,0,00,000,000pes,008704050406,0089,01223585334,02,0207,02072069400,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


In [20]:
train_clean = pd.concat([train_messages, word_count_df], axis=1)
train_clean.head()

Unnamed: 0,Label,SMS,0,00,000,000pes,008704050406,0089,01223585334,02,...,zindgi,zoe,zogtorius,zouk,zyada,é,ú1,ü,〨ud,鈥
0,ham,"[yep, by, the, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, are, you, going, to, make, me,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, he, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[i, forgot, 2, ask, ü, all, smth, there, s, a,...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,2,0,0


#### Probability and Number Calculations

In [21]:
# First, we calculate the probability of spam/ham in the training set
train_clean['Label'].value_counts()

ham     3857
spam     600
Name: Label, dtype: int64

In [22]:
p_spam = 600 / (3857 + 600)
p_ham = 3857 / (3857 + 600)

In [23]:
# We now calculate the total number of words in our spam messages
mask1 = train_clean['Label'] == 'spam'
train_spam = train_clean[mask1]
words_in_spam = []
for sms in train_spam['SMS']:
    for word in sms:
        words_in_spam.append(word)
n_spam = len(words_in_spam)

In [24]:
# We now calculate the total number of words in our ham messages
mask2 = train_clean['Label'] == 'ham'
train_ham = train_clean[mask2]
words_in_ham = []
for sms in train_ham['SMS']:
    for word in sms:
        words_in_ham.append(word)
n_ham = len(words_in_ham)

In [25]:
# We now count the total number of *unique* words in our vocabulary
n_vocabulary = len(vocabulary)

In [26]:
# Let's check the number of each 
print(n_spam)
print(n_ham)
print(n_vocabulary)

15190
57233
7782


In [27]:
# We also define our "smoothing parameter" alpha
alpha = 1

#### Calculating Parameters

We now will calculate the last elements of our algorithm, which are the conditional probabilities of a word being in an sms messages given if it's spam/ham: P(w(i)|Spam). We define these terms to be our parameters.

In [28]:
# We create two dictionaries that will track the number of times a word appears in spam/ham messages
spam_word_dict = {unique_word: 0 for unique_word in vocabulary}
ham_word_dict =  {unique_word: 0 for unique_word in vocabulary}

In [29]:
# Recall that we isolated spam/ham messages into dataframes from above
# We will use these to calculate our parameters.
for word in vocabulary:  # We iterate over our entire vocabulary
    n_word_given_spam = train_spam[word].sum()  # We count the number of times that word appears in a spam message
    p_word_given_spam = (n_word_given_spam + alpha) / (n_spam + (alpha * n_vocabulary))  # Formula used to calculate P(w(i)|Spam)
    spam_word_dict[word] = p_word_given_spam
    
    # We repeat the exat same process above, except for ham this time
    n_word_given_ham = train_ham[word].sum()
    p_word_given_ham = (n_word_given_ham + alpha) / (n_ham + (alpha * n_vocabulary))
    ham_word_dict[word] = p_word_given_ham

#### Filter Construction

We finally have all of our variables to create our spam filter. We will do so by creating a function that will take a new message as it's input, calculate the conditionaly probability values of interest, and compare them to one another to determine a final classification

In [30]:
import re
def classify(message):
    message = re.sub('\W', ' ', message) # remove message punctuation
    message = message.lower()  # make message all lowercase
    message = message.split()  # split string and create a list of words
    
    p_spam_given_message = p_spam  # Initialize our P(Spam|w1, .., w(n))
    for word in message:  # Iterate over each word in our messages
        if word in spam_word_dict:  # If the word exists in our dictionary above
            p_spam_given_message *= spam_word_dict[word]  # Mulitply that value by our probability created in this function
        else:
            pass  # We don't count words that don't exists in our dictionary
    
    # We repeat the process for ham messages
    p_ham_given_message = p_ham
    for word in message:
        if word in ham_word_dict:
            p_ham_given_message *= ham_word_dict[word]
        else:
            pass
    
    if p_spam_given_message > p_ham_given_message:
        return 'spam'
    elif p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_ham_given_message == p_spam_given_message:
        return 'need human classification'
        

In [31]:
# We now test our function on two simple sentences before we move on to our test set
classify('WINNER!! This is the secret code to unlock the money: C3421.')

'spam'

In [32]:
classify("Sounds good, Tom, then see u there")

'ham'

### Testing our model

Now that we have calculated everything needed, we will now apply our model to the test set that we created in the steps above. Once we apply our model to it, we will then measure the accuracy of our model to see how well we did!

In [33]:
# We will create an additional column in our test set to generate our prediction.
test_messages['Predicted'] = test_messages['SMS'].apply(classify)
test_messages.head()

Unnamed: 0,Label,SMS,Predicted
0,ham,later i guess i needa do mcat study too,ham
1,ham,but i haf enuff space got like 4 mb,ham
2,spam,had your mobile 10 mths update to latest oran...,spam
3,ham,all sounds good fingers makes it difficult ...,ham
4,ham,all done all handed in don t know if mega sh...,ham


In [36]:
# Now we will calculate our model accuracy.
correct = 0
total = len(test_messages)
for row in test_messages.iterrows():
    row = row[1]
    if row['Label'] == row['Predicted']:
        correct += 1
percent_accurate = round((correct / total) * 100, 2)
print("Our model is {}% accurate".format(percent_accurate))

Our model is 98.74% accurate


### Conclusion

In the end, we were able to create a spam classifier that was 98.74% in identifying spam messages withing our test set, which is very good. Classification is an important component of Machine Learning. As such, I feel that this project has given me the opportuinity to learn about how to construct a viable, albeit simplistic, Classification algorithm.