# Naive Bayes Classifier for Spam Detection

## Instructions

Total Points: 10

Complete this notebook and submit it. The notebook needs to be a complete project report with 

* your implementation,
* documentation including a short discussion of how your implementation works and your design choices, and
* experimental results (e.g., tables and charts with simulation results) with a short discussion of what they mean. 

Use the provided notebook cells and insert additional code and markdown cells as needed.

## Introduction

A spam detection agent gets as its percepts text messages and needs to decide if they are spam or not.
Create a [naive Bayes classifier](https://en.wikipedia.org/wiki/Naive_Bayes_classifier) for the 
[UCI SMS Spam Collection Data Set](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection) to perform this task.

__About the use of libraries:__ The point of this exercise is to learn how a Bayes classifier is built. You may use libraries for tokenizing, stop words and to create a document-term matrix, but you need to implement parameter estimation and prediction yourself.

## Create a bag-of-words representation of the text messages [3 Points]

The first step is to tokenize the text. Here is an example of how to use the [natural language tool kit (nltk)](https://www.nltk.org/) to create tokens (separate words).

Experiment with removing frequent words (called [stopwords](https://en.wikipedia.org/wiki/Stop_word)) and very infrequent words so you end up with a reasonable number of words used in the classifier. Maybe you need to remove digits or all non-letter characters. You may also use a stemming algorithm. 

Convert the tokenized data into a data structure that indicates for each for document what words it contains. The data structure can be a [document-term matrix](https://en.wikipedia.org/wiki/Document-term_matrix) with 0s and 1s, a pandas dataframe or some sparse matrix structure. Note: words, tokens and terms are often used interchangably. Make sure the data structure can be used to split the data into training and test documents (see below).

Report the 20 most frequent and the 20 least frequent words in your data set.

In [1]:
import pandas as pd
from collections import Counter
import re
import nltk

def print_full(x):
    pd.set_option('display.max_rows', None)
    pd.set_option('display.max_columns', None)
    pd.set_option('display.width', 2000)
    pd.set_option('display.float_format', '{:20,.2f}'.format)
    pd.set_option('display.max_colwidth', None)
    display(x)
    pd.reset_option('display.max_rows')
    pd.reset_option('display.max_columns')
    pd.reset_option('display.width')
    pd.reset_option('display.float_format')
    pd.reset_option('display.max_colwidth')
    
def print_cm(cm, labels, hide_zeroes=False, hide_diagonal=False, hide_threshold=None):
    """
    pretty print for confusion matrixes
    taken from: https://gist.github.com/zachguo/10296432 
    """
    columnwidth = max([len(x) for x in labels] + [5])  # 5 is value length
    empty_cell = " " * columnwidth
    
    # Begin CHANGES
    fst_empty_cell = (columnwidth-3)//2 * " " + "t/p" + (columnwidth-3)//2 * " "
    
    if len(fst_empty_cell) < len(empty_cell):
        fst_empty_cell = " " * (len(empty_cell) - len(fst_empty_cell)) + fst_empty_cell
    # Print header
    print("    " + fst_empty_cell, end=" ")
    # End CHANGES
    
    for label in labels:
        print("%{0}s".format(columnwidth) % label, end=" ")
        
    print()
    # Print rows
    for i, label1 in enumerate(labels):
        print("    %{0}s".format(columnwidth) % label1, end=" ")
        for j in range(len(labels)):
            cell = "%{0}.1f".format(columnwidth) % cm[i, j]
            if hide_zeroes:
                cell = cell if float(cm[i, j]) != 0 else empty_cell
            if hide_diagonal:
                cell = cell if i != j else empty_cell
            if hide_threshold:
                cell = cell if cm[i, j] > hide_threshold else empty_cell
            print(cell, end=" ")
        print()

In [2]:
# list of stopwords
stopwords = ["a", "about", "above", "above", "across", "after", "afterwards", "again", "against", "all", "almost", "alone", "along", "already", "also","although","always","am","among", "amongst", "amoungst", "amount",  "an", "and", "another", "any","anyhow","anyone","anything","anyway", "anywhere", "are", "around", "as",  "at", "back","be","became", "because","become","becomes", "becoming", "been", "before", "beforehand", "behind", "being", "below", "beside", "besides", "between", "beyond", "bill", "both", "bottom","but", "by", "call", "can", "cannot", "cant", "co", "con", "could", "couldnt", "cry", "de", "describe", "detail", "do", "done", "down", "due", "during", "each", "eg", "eight", "either", "eleven","else", "elsewhere", "empty", "enough", "etc", "even", "ever", "every", "everyone", "everything", "everywhere", "except", "few", "fifteen", "fify", "fill", "find", "fire", "first", "five", "for", "former", "formerly", "forty", "found", "four", "from", "front", "full", "further", "get", "give", "go", "had", "has", "hasnt", "have", "he", "hence", "her", "here", "hereafter", "hereby", "herein", "hereupon", "hers", "herself", "him", "himself", "his", "how", "however", "hundred", "ie", "if", "in", "inc", "indeed", "interest", "into", "is", "it", "its", "itself", "keep", "last", "latter", "latterly", "least", "less", "ltd", "made", "many", "may", "me", "meanwhile", "might", "mill", "mine", "more", "moreover", "most", "mostly", "move", "much", "must", "my", "myself", "name", "namely", "neither", "never", "nevertheless", "next", "nine", "no", "nobody", "none", "noone", "nor", "not", "nothing", "now", "nowhere", "of", "off", "often", "on", "once", "one", "only", "onto", "or", "other", "others", "otherwise", "our", "ours", "ourselves", "out", "over", "own","part", "per", "perhaps", "please", "put", "rather", "re", "same", "see", "seem", "seemed", "seeming", "seems", "serious", "several", "she", "should", "show", "side", "since", "sincere", "six", "sixty", "so", "some", "somehow", "someone", "something", "sometime", "sometimes", "somewhere", "still", "such", "system", "take", "ten", "than", "that", "the", "their", "them", "themselves", "then", "thence", "there", "thereafter", "thereby", "therefore", "therein", "thereupon", "these", "they", "thick", "thin", "third", "this", "those", "though", "three", "through", "throughout", "thru", "thus", "to", "together", "too", "top", "toward", "towards", "twelve", "twenty", "two", "un", "under", "until", "up", "upon", "us", "very", "via", "was", "we", "well", "were", "what", "whatever", "when", "whence", "whenever", "where", "whereafter", "whereas", "whereby", "wherein", "whereupon", "wherever", "whether", "which", "while", "whither", "who", "whoever", "whole", "whom", "whose", "why", "will", "with", "within", "without", "would", "yet", "you", "your", "yours", "yourself", "yourselves", "the"]

In [3]:
# import our dataset as pandas dataframe
sms_spam = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['Class', 'Text'])

In [4]:
# randomize the dataset
data_randomized = sms_spam.sample(frac=1, random_state=1)

# calculate index for split
training_test_idx = round(len(data_randomized) * 0.8)

# split into training and test sets
# note: i clean the training data here but also have a function below  
#   that cleans and classifies the test data as it analyzes it.
train = data_randomized[:training_test_idx].reset_index(drop=True)
test = data_randomized[training_test_idx:].reset_index(drop=True)

# check counts
print("Training Set Counts")
display(train['Class'].value_counts(normalize=True))
print("Test Set Counts")
display(test['Class'].value_counts(normalize=True))

Training Set Counts


ham     0.86541
spam    0.13459
Name: Class, dtype: float64

Test Set Counts


ham     0.868043
spam    0.131957
Name: Class, dtype: float64

In [5]:
# clean up the training set
# remove punctuation
train['Text'] = train['Text'].str.replace('\W', ' ')
# set all to lowercase
train['Text'] = train['Text'].str.lower()
# loop
for i, row in train.iterrows():
    # tokenize
    tok = nltk.word_tokenize(row['Text'])
    # remove not alpha characters
    tok2 = [x for x in tok if x.isalpha()]
    # remove 1 letter words
    tok3 = [x for x in tok2 if len(x)>=2]
    # remove stopwords and join
    train.loc[i,'Text_tok'] = " ".join(filter(lambda w: w not in stopwords, tok3))
    # train.loc[i,'Text_tok'] = list(filter(lambda w: w not in stopwords, tok))
    
train['Text_tok'] = train['Text_tok'].str.split()
display(train)

Unnamed: 0,Class,Text,Text_tok
0,ham,yep by the pretty sculpture,"[yep, pretty, sculpture]"
1,ham,yes princess are you going to make me moan,"[yes, princess, going, make, moan]"
2,ham,welp apparently he retired,"[welp, apparently, retired]"
3,ham,havent,[havent]
4,ham,i forgot 2 ask ü all smth there s a card on ...,"[forgot, ask, smth, card, da, present, lei, wa..."
...,...,...,...
4453,ham,sorry i ll call later in meeting any thing re...,"[sorry, ll, later, meeting, thing, related, tr..."
4454,ham,babe i fucking love you too you know fuck...,"[babe, fucking, love, know, fuck, good, hear, ..."
4455,spam,u ve been selected to stay in 1 of 250 top bri...,"[ve, selected, stay, british, hotels, holiday,..."
4456,ham,hello my boytoy geeee i miss you already a...,"[hello, boytoy, geeee, miss, just, woke, wish,..."


In [6]:
# obtain list of all vocabulary
vocab = []
for text in train['Text_tok']:
    for word in text:
        vocab.append(word)

vocab_counts = Counter(vocab)        
vocab = list(set(vocab))
print(len(vocab), " unique terms in dataset.")

6487  unique terms in dataset.


In [7]:
# create one hot encoded train set
text_counts = {unique_word: [0] * len(train['Text_tok']) for unique_word in vocab}
for index, text in enumerate(train['Text_tok']):
    for word in text:
        text_counts[word][index] += 1

In [8]:
# convert to a dataframe
word_counts = pd.DataFrame(text_counts)
word_counts.head()

Unnamed: 0,holla,hum,east,australia,permission,nos,influx,lodging,hockey,onluy,...,paracetamol,barcelona,neglet,waves,starting,wadebridge,date,virgins,svc,wat
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [9]:
# combine old dataframe to new one-hot dataframe
train_final = pd.concat([train, word_counts], axis=1)
train_final = train_final.drop(['Text'], axis=1)
train_final.head()

Unnamed: 0,Class,Text_tok,holla,hum,east,australia,permission,nos,influx,lodging,...,paracetamol,barcelona,neglet,waves,starting,wadebridge,date,virgins,svc,wat
0,ham,"[yep, pretty, sculpture]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, going, make, moan]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, retired]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[forgot, ask, smth, card, da, present, lei, wa...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [10]:
print("Most common words in dataset with counts: ")
vocab_counts.most_common(20)

Most common words in dataset with counts: 


[('ur', 316),
 ('just', 301),
 ('gt', 255),
 ('lt', 252),
 ('ok', 236),
 ('free', 223),
 ('ll', 221),
 ('know', 212),
 ('like', 196),
 ('good', 195),
 ('day', 194),
 ('come', 189),
 ('got', 184),
 ('time', 180),
 ('love', 172),
 ('send', 168),
 ('want', 154),
 ('text', 147),
 ('going', 146),
 ('txt', 142)]

In [11]:
print("Least common words in dataset with counts: ")
print("(Note that there were many words with a count of 1, these are just some)")
vocab_counts.most_common()[:-20-1:-1]

Least common words in dataset with counts: 
(Note that there were many words with a count of 1, these are just some)


[('wherre', 1),
 ('arul', 1),
 ('trade', 1),
 ('related', 1),
 ('jewelry', 1),
 ('secrets', 1),
 ('hides', 1),
 ('beauty', 1),
 ('prakesh', 1),
 ('genes', 1),
 ('recession', 1),
 ('nag', 1),
 ('inconsiderate', 1),
 ('readers', 1),
 ('phoenix', 1),
 ('potter', 1),
 ('abbey', 1),
 ('swann', 1),
 ('armenia', 1),
 ('occupied', 1)]

## Learn parameters [3 Points]

Use 80% of the data (called training set; randomly chosen) to learn the parameters of the naive Bayes classifier (prior probabilities and likelihoods). 
Remember, the naive Bayes classifier calculates:

$$P(spam|message) \propto score_{spam}(message) = P(spam) \prod_{i=1}^n P(w_i | spam)$$
$$P(ham|message) \propto score_{ham}(message) = P(ham) \prod_{i=1}^n P(w_i | ham)$$

and classifies a message as spam if 
$$score_{spam}(message) > score_{ham}(message).$$ 

You therefore need to
estimate: 

* the priors $P(spam)$ and $P(ham)$, and 
* the likelihoods $P(w_i | spam)$ and $P(w_i | ham)$ for all words

from counts obtained from the training data. Use [Laplacian smoothing](https://en.wikipedia.org/wiki/Additive_smoothing) to estimate the
likelihoods. This deals with words that have very low count in the ham or spam messages and avoids
likelihoods of zero.

Report the top 20 words (highest conditional probability) for ham and for spam. These words represent the biggest clues that a message is ham or spam.

In [12]:
# Laplace smoothing
alpha = 1
# separate spam & ham
spam = train_final[train_final['Class'] == 'spam']
ham = train_final[train_final['Class'] == 'ham']
# num vocab
n_vocab = len(vocab)

In [13]:
# calc P(spam) and P(ham)
p_spam = len(spam) / len(train_final)
print("P(spam): ", p_spam)
p_ham = len(ham) / len(train_final)
print("P(ham): ", p_ham)

# Length of all spam words
num_words_spam = spam['Text_tok'].apply(len)
n_spam = num_words_spam.sum()
print("Length of all spam words: ", n_spam)

# Length of all ham words
num_words_ham = ham['Text_tok'].apply(len)
n_ham = num_words_ham.sum()
print("Length of all ham words: ", n_ham)

# init P(w|spam) and P(w|ham)
params_spam = {word:0 for word in vocab}
params_ham = {word:0 for word in vocab}

# calc P(w|spam) and P(w|ham)
for word in vocab:
    n_word_given_spam = spam[word].sum()
    p_word_given_spam = (n_word_given_spam + alpha) / (n_spam + (alpha * n_vocab))
    params_spam[word] = p_word_given_spam

    n_word_given_ham = ham[word].sum()
    p_word_given_ham = (n_word_given_ham + alpha) / (n_ham + (alpha * n_vocab))
    params_ham[word] = p_word_given_ham

P(spam):  0.13458950201884254
P(ham):  0.8654104979811574
Length of all spam words:  7345
Length of all ham words:  26855


In [19]:
count = Counter(params_ham)
display("Top ham words: ", count.most_common(20))
count = Counter(params_spam)
display("Top spam words: ", count.most_common(20))

'Top ham words: '

[('gt', 0.007678003719033051),
 ('lt', 0.007588027112950633),
 ('just', 0.0071681362845660125),
 ('ok', 0.006988183072401176),
 ('ll', 0.006568292244016556),
 ('ur', 0.006028432607522044),
 ('know', 0.005728510587247316),
 ('come', 0.005638533981164897),
 ('like', 0.005608541779137425),
 ('good', 0.005548557375082479),
 ('got', 0.005428588566972587),
 ('day', 0.0053086197588626954),
 ('time', 0.0049487133345330215),
 ('love', 0.004918721132505548),
 ('going', 0.004318877091956092),
 ('home', 0.004138923879791255),
 ('need', 0.004108931677763781),
 ('want', 0.0038989862635714716),
 ('sorry', 0.0038390018595165255),
 ('lor', 0.00377901745546158)]

'Top spam words: '

[('free', 0.012724117987275883),
 ('txt', 0.009543088490456911),
 ('ur', 0.008458646616541353),
 ('stop', 0.0074465008675534995),
 ('mobile', 0.00708502024291498),
 ('text', 0.006795835743204164),
 ('claim', 0.006578947368421052),
 ('reply', 0.006434355118565645),
 ('www', 0.00578368999421631),
 ('prize', 0.005566801619433198),
 ('just', 0.004626951995373048),
 ('send', 0.004554655870445344),
 ('new', 0.004482359745517641),
 ('won', 0.004482359745517641),
 ('cash', 0.004410063620589937),
 ('nokia', 0.004265471370734529),
 ('uk', 0.004193175245806825),
 ('win', 0.0037593984962406013),
 ('tone', 0.003687102371312898),
 ('urgent', 0.003470213996529786)]

In [15]:
# function to classify dataset
def classify(message):
    # remove punctuation
    message = re.sub('\W', ' ', message)
    # lower case
    message = message.lower()
    # tokenize
    message = nltk.word_tokenize(message)
    # remove non alpha characters
    message = [x for x in message if x.isalpha()]
    # remove stopwords and make string
    message = list(filter(lambda w: w not in stopwords, message))
    # assign default probabilities
    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    # loop through words in given message
    for word in message:
        # iteratively calculate score for spam
        if word in params_spam:
            p_spam_given_message *= params_spam[word]
        # iteratively calculate score for ham
        if word in params_ham:
            p_ham_given_message *= params_ham[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'probabilities = 50%'

## Evaluate the classification performance [4 Points] 

Classify the remaining 20% of the data (test set) and calculate classification accuracy. Accuracy is defined as the proportion of correctly classified test documents.

1. How good is your classifier's accuracy compared to a baseline classifier.

2. Inspect a few misclassified text messages and discuss why the classification failed.

3. Discuss how you deal with words in the test data that you have not seen in the training data.

In [16]:
# store off prediction in test dataframe
test_classify = test.copy()
test_classify['prediction'] = test_classify['Text'].apply(classify)
test_classify.head()

Unnamed: 0,Class,Text,prediction
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham


In [17]:
correct = 0
total = test_classify.shape[0]
# loop through dataframe to obtain number correct
for row in test_classify.iterrows():
    # if correct add 1 to correct var
    if row[1]['Class'] == row[1]['prediction']:
        correct += 1
# print results
print('Total:', total)
print('Correct:', correct)
print('Incorrect:', total - correct)
print('Accuracy: {}%'.format((correct/total)*100))

Total: 1114
Correct: 1090
Incorrect: 24
Accuracy: 97.84560143626571%


In [18]:
from sklearn.metrics import classification_report,confusion_matrix
print("Confusion Matrix: ")
print_cm(confusion_matrix(test_classify['Class'],test_classify['prediction']),['ham','spam'])
print("")
print("Classification Report: ")
print(classification_report(test_classify['Class'],test_classify['prediction']))

Confusion Matrix: 
     t/p    ham  spam 
      ham 953.0  14.0 
     spam  10.0 137.0 

Classification Report: 
              precision    recall  f1-score   support

         ham       0.99      0.99      0.99       967
        spam       0.91      0.93      0.92       147

    accuracy                           0.98      1114
   macro avg       0.95      0.96      0.95      1114
weighted avg       0.98      0.98      0.98      1114



1. <b>How good is your classifier's accuracy compared to a baseline classifier.</b>

As seen above, my classifier has an accuracy of 97.85%. Out of 1114 text messages of test data it correctly classified 1090 of the messages and only incorrectly classified 24 of the messages. Compared to a baseline classifier I took a look at the research paper https://pdfs.semanticscholar.org/22de/da99e2c06d497c41e1b7a45c9e920c0367f2.pdf and saw that they were able to obtain an average accuracy of 97.4%, which is a bit less performant than mine. Additionally, my classsifier was most definitely better than a weak baseline of always predicting 'spam' or 'ham', which at best would have had an accuracy of 86.8% if predicted all ham.

Furthermore, taking a look above at the confusion matrix above, our false negative count (# of messages that were spam but were classified as legit) was 14 which is pretty low along with the true negative count (# of messages that were legit but classified as spam) being 10 which is even lower. Given the size of the dataset these numbers are quite good. Additionally, in our classification report we can see that our recall and precision scores are really good. This is important as recall determines the proportion of legitimate messages which have been correctly categorized while precision determines the proportion of all correctly categorized messages which are legitimate.

Given these results I would argue that the classifier performed quite well! I of course anticipate that more advanced ML algorithms out there would perform better but for a naive bayes classifier, this one worked quite well!


In [19]:
wrong_spam = pd.DataFrame()
wrong_ham = pd.DataFrame()
for row in test_classify.iterrows():
    if row[1]['Class'] != row[1]['prediction']:
        if(row[1]['Class'] == 'ham'):
            wrong_ham = wrong_ham.append(row[1])
        if(row[1]['Class'] == 'spam'):
            wrong_spam = wrong_spam.append(row[1])

print_full(wrong_ham)
print_full(wrong_spam)

Unnamed: 0,Class,Text,prediction
9,ham,I liked the new mobile,spam
130,ham,How was txting and driving,spam
152,ham,Unlimited texts. Limited minutes.,spam
159,ham,26th OF JULY,spam
284,ham,Nokia phone is lovly..,spam
302,ham,No calls..messages..missed calls,spam
319,ham,"We have sent JD for Customer Service cum Accounts Executive to ur mail id, For details contact us",spam
361,ham,"8 at the latest, g's still there if you can scrounge up some ammo and want to give the new ak a try",spam
398,ham,Hasn't that been the pattern recently crap weekends?,spam
660,ham,Any chance you might have had with me evaporated as soon as you violated my privacy by stealing my phone number from your employer's paperwork. Not cool at all. Please do not contact me again or I will report you to your supervisor.,spam


Unnamed: 0,Class,Text,prediction
114,spam,Not heard from U4 a while. Call me now am here all night with just my knickers on. Make me beg for it like U did last time 01223585236 XX Luv Nikiyu4.net,ham
135,spam,More people are dogging in your area now. Call 09090204448 and join like minded guys. Why not arrange 1 yourself. There's 1 this evening. A£1.50 minAPN LS278BB,ham
363,spam,Email AlertFrom: Jeri StewartSize: 2KBSubject: Low-cost prescripiton drvgsTo listen to email call 123,ham
500,spam,"Ur balance is now £600. Next question: Complete the landmark, Big, A. Bob, B. Barry or C. Ben ?. Text A, B or C to 83738. Good luck!",ham
504,spam,"Oh my god! I've found your number again! I'm so glad, text me back xafter this msgs cst std ntwk chg £1.50",ham
604,spam,88066 FROM 88066 LOST 3POUND HELP,ham
673,spam,Burger King - Wanna play footy at a top stadium? Get 2 Burger King before 1st Sept and go Large or Super with Coca-Cola and walk out a winner,ham
876,spam,RCT' THNQ Adrian for U text. Rgds Vatian,ham
885,spam,2/2 146tf150p,ham
953,spam,Hello. We need some posh birds and chaps to user trial prods for champneys. Can i put you down? I need your address and dob asap. Ta r,ham


2. <b>Inspect a few misclassified text messages and discuss why the classification failed.</b>

Above you can see all of the misclassified messages. To make it a bit easier, let us first clean the data so we can see what the classifier saw and drop the class and prediction columns.

In [20]:
def clean(df_in):
    df = df_in.copy()
    # remove punctuation
    df['Text'] = df['Text'].str.replace('\W', ' ')
    # set all to lowercase
    df['Text'] = df['Text'].str.lower()
    # loop
    for i, row in df.iterrows():
        # tokenize
        tok = nltk.word_tokenize(row['Text'])
        # remove not alpha characters
        tok2 = [x for x in tok if x.isalpha()]
        # remove 1 letter words
        tok3 = [x for x in tok2 if len(x)>=2]
        # remove stopwords and join
        df.loc[i,'Text'] = " ".join(filter(lambda w: w not in stopwords, tok3))
    return df

wrong_ham = wrong_ham.drop(['Class', 'prediction'], axis=1)
wrong_spam = wrong_spam.drop(['Class', 'prediction'],axis=1)

wrong_ham_clean = clean(wrong_ham)
wrong_spam_clean = clean(wrong_spam)

In [21]:
print_full(wrong_ham)
print_full(wrong_ham_clean)

Unnamed: 0,Text
9,I liked the new mobile
130,How was txting and driving
152,Unlimited texts. Limited minutes.
159,26th OF JULY
284,Nokia phone is lovly..
302,No calls..messages..missed calls
319,"We have sent JD for Customer Service cum Accounts Executive to ur mail id, For details contact us"
361,"8 at the latest, g's still there if you can scrounge up some ammo and want to give the new ak a try"
398,Hasn't that been the pattern recently crap weekends?
660,Any chance you might have had with me evaporated as soon as you violated my privacy by stealing my phone number from your employer's paperwork. Not cool at all. Please do not contact me again or I will report you to your supervisor.


Unnamed: 0,Text
9,liked new mobile
130,txting driving
152,unlimited texts limited minutes
159,july
284,nokia phone lovly
302,calls messages missed calls
319,sent jd customer service cum accounts executive ur mail id details contact
361,latest scrounge ammo want new ak try
398,hasn pattern recently crap weekends
660,chance evaporated soon violated privacy stealing phone number employer paperwork cool contact report supervisor


If we take a look at some of the instances where the messages were classified as spam but were ham (above), we can see that they typically contain many spelling errors or contain words pertaining to phone calls/texts or business related situations. Additionally, some of these words may have been used quite infrequently and when they were used were part of a spam message. This would explain it as we see above, alot of the words like 'hire', 'sms', 'unlimited', 'texts', and 'calls' are seen quite a bit in spam messages since they often try to scam you. Removing infrequent words here could help but I felt like it might make the classifier not work as well due to the size of the dataset. Additionally, it should be noted as stated below for question 3 that words not seen in the training equate to nothing when classifying. This could also be another reason why it is failing when it does.

In [22]:
print_full(wrong_spam)
print_full(wrong_spam_clean)

Unnamed: 0,Text
114,Not heard from U4 a while. Call me now am here all night with just my knickers on. Make me beg for it like U did last time 01223585236 XX Luv Nikiyu4.net
135,More people are dogging in your area now. Call 09090204448 and join like minded guys. Why not arrange 1 yourself. There's 1 this evening. A£1.50 minAPN LS278BB
363,Email AlertFrom: Jeri StewartSize: 2KBSubject: Low-cost prescripiton drvgsTo listen to email call 123
500,"Ur balance is now £600. Next question: Complete the landmark, Big, A. Bob, B. Barry or C. Ben ?. Text A, B or C to 83738. Good luck!"
504,"Oh my god! I've found your number again! I'm so glad, text me back xafter this msgs cst std ntwk chg £1.50"
604,88066 FROM 88066 LOST 3POUND HELP
673,Burger King - Wanna play footy at a top stadium? Get 2 Burger King before 1st Sept and go Large or Super with Coca-Cola and walk out a winner
876,RCT' THNQ Adrian for U text. Rgds Vatian
885,2/2 146tf150p
953,Hello. We need some posh birds and chaps to user trial prods for champneys. Can i put you down? I need your address and dob asap. Ta r


Unnamed: 0,Text
114,heard night just knickers make beg like did time xx luv net
135,people dogging area join like minded guys arrange evening minapn
363,email alertfrom jeri stewartsize low cost prescripiton drvgsto listen email
500,ur balance question complete landmark big bob barry ben text good luck
504,oh god ve number glad text xafter msgs cst std ntwk chg
604,lost help
673,burger king wan na play footy stadium burger king sept large super coca cola walk winner
876,rct thnq adrian text rgds vatian
885,
953,hello need posh birds chaps user trial prods champneys need address dob asap ta


If we take a look at some of the instances where messages were classified as ham but were spam (above), we can see that these messages drastically change after going through the cleaning process and that it kind of makes sense they were designated as non spam. A lot of the numbers and symbols were also elmintated which probably would have helped if they could have been included, but this would make our classifier larger and less compact. For the most part there are not many words that really scream spam after the cleaning and if there are there is a chance that, like stated above, they were not in the training data set and therefore equate to nothing during classification. Like above, removing infrequent words may help and we will give that a try down below.

3. <b>Discuss how you deal with words in the test data that you have not seen in the training data.<b>

For words that were not seen during training and are not apart of the vocabulary, they are simply omitted from the probability scoring during classification. I felt it was best to do it this way for the purposes of this project but stemming could be used for words that are slightly different but carry the same meaning. This would allow for more words in a given query to be counted since they wouldnt have to be identical to the words in the vocabulary.

## Bonus task [+1 Point]

Describe how you could improve the classifier.

I believe that I could improve the classifier by simply removing some of the infrequent words that have only a count of one. Additonally, I feel like stemming words could have improved the classifier, this would in turn increase the reach that the vocabulary list had on words with different endings (-ly, -ing, etc.). Also, incorporating numbers in a way that made them meaningful would have probably helped as well. However, just to see how it performs, let's take a look at removing infrequent words with a count <= 2 below...

In [23]:
# randomize the dataset
data_randomized2 = sms_spam.sample(frac=1, random_state=1)

# calculate index for split
training_test_idx = round(len(data_randomized2) * 0.8)

# split into training and test sets
# note: i clean the training data here but also have a function below  
#   that cleans and classifies the test data as it analyzes it.
train2 = data_randomized[:training_test_idx].reset_index(drop=True)
test2 = data_randomized[training_test_idx:].reset_index(drop=True)

# check counts
print("Training Set Counts")
display(train2['Class'].value_counts(normalize=True))
print("Test Set Counts")
display(test2['Class'].value_counts(normalize=True))

Training Set Counts


ham     0.86541
spam    0.13459
Name: Class, dtype: float64

Test Set Counts


ham     0.868043
spam    0.131957
Name: Class, dtype: float64

In [24]:
# clean up the training set
# remove punctuation
train2['Text'] = train2['Text'].str.replace('\W', ' ')
# set all to lowercase
train2['Text'] = train2['Text'].str.lower()
# loop
for i, row in train2.iterrows():
    # tokenize
    tok = nltk.word_tokenize(row['Text'])
    # remove not alpha characters
    tok2 = [x for x in tok if x.isalpha()]
    # remove 1 letter words
    tok3 = [x for x in tok2 if len(x)>=2]
    # remove stopwords and join
    train2.loc[i,'Text_tok'] = " ".join(filter(lambda w: w not in stopwords, tok3))
    # train.loc[i,'Text_tok'] = list(filter(lambda w: w not in stopwords, tok))
    
train2['Text_tok'] = train2['Text_tok'].str.split()
display(train2)

Unnamed: 0,Class,Text,Text_tok
0,ham,yep by the pretty sculpture,"[yep, pretty, sculpture]"
1,ham,yes princess are you going to make me moan,"[yes, princess, going, make, moan]"
2,ham,welp apparently he retired,"[welp, apparently, retired]"
3,ham,havent,[havent]
4,ham,i forgot 2 ask ü all smth there s a card on ...,"[forgot, ask, smth, card, da, present, lei, wa..."
...,...,...,...
4453,ham,sorry i ll call later in meeting any thing re...,"[sorry, ll, later, meeting, thing, related, tr..."
4454,ham,babe i fucking love you too you know fuck...,"[babe, fucking, love, know, fuck, good, hear, ..."
4455,spam,u ve been selected to stay in 1 of 250 top bri...,"[ve, selected, stay, british, hotels, holiday,..."
4456,ham,hello my boytoy geeee i miss you already a...,"[hello, boytoy, geeee, miss, just, woke, wish,..."


In [25]:
# obtain list of all vocabulary
vocab2 = []
unique_vocab = {}
for text in train2['Text_tok']:
    for word in text:
        vocab2.append(word)

for c in range(len(vocab2)):
    unique_vocab[vocab2[c]] = vocab2.count(vocab2[c])

print("Number of unqiue vocab words: ", len(unique_vocab))

Number of unqiue vocab words:  6487


In [26]:
for k,v in list(unique_vocab.items()):
    if v <= 1:
        del unique_vocab[k]
print("Number of unqiue vocab words occuring >= 2 times: ", len(unique_vocab))

Number of unqiue vocab words occuring >= 2 times:  3060


In [27]:
# create one hot encoded train set
text_counts2 = {unique_word: [0] * len(train2['Text_tok']) for unique_word in list(unique_vocab.keys())}
for index, text in enumerate(train2['Text_tok']):
    for word in text:
        if(word in set(unique_vocab.keys())):
            text_counts2[word][index] += 1

In [28]:
# convert to a dataframe
word_counts2 = pd.DataFrame(text_counts2)
word_counts2.head()

Unnamed: 0,yep,pretty,yes,princess,going,make,moan,welp,apparently,havent,...,lyfu,lyf,ali,ke,meow,favorite,pap,rhythm,cme,harry
0,1,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,1,1,1,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,1,1,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [30]:
# combine old dataframe to new one-hot dataframe
train_final2 = pd.concat([train2, word_counts2], axis=1)
train_final2 = train_final2.drop(['Text'], axis=1)
train_final2.head()

Unnamed: 0,Class,Text_tok,yep,pretty,yes,princess,going,make,moan,welp,...,lyfu,lyf,ali,ke,meow,favorite,pap,rhythm,cme,harry
0,ham,"[yep, pretty, sculpture]",1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ham,"[yes, princess, going, make, moan]",0,0,1,1,1,1,1,0,...,0,0,0,0,0,0,0,0,0,0
2,ham,"[welp, apparently, retired]",0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,ham,[havent],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ham,"[forgot, ask, smth, card, da, present, lei, wa...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [31]:
# Laplace smoothing
alpha = 1
# separate spam & ham
spam2 = train_final2[train_final2['Class'] == 'spam']
ham2 = train_final2[train_final2['Class'] == 'ham']
# num vocab
n_vocab2 = len(unique_vocab)

In [32]:
# calc P(spam) and P(ham)
p_spam2 = len(spam2) / len(train_final2)
print("P(spam): ", p_spam2)
p_ham2 = len(ham2) / len(train_final2)
print("P(ham): ", p_ham2)

# Length of all spam words
num_words_spam2 = spam2['Text_tok'].apply(len)
n_spam2 = num_words_spam2.sum()
print("Length of all spam words: ", n_spam2)

# Length of all ham words
num_words_ham2 = ham2['Text_tok'].apply(len)
n_ham2 = num_words_ham2.sum()
print("Length of all ham words: ", n_ham2)

# init P(w|spam) and P(w|ham)
params_spam2 = {word:0 for word in unique_vocab}
params_ham2 = {word:0 for word in unique_vocab}

# calc P(w|spam) and P(w|ham)
for word in unique_vocab:
    n_word_given_spam2 = spam2[word].sum()
    p_word_given_spam2 = (n_word_given_spam2 + alpha) / (n_spam2 + (alpha * n_vocab2))
    params_spam2[word] = p_word_given_spam2

    n_word_given_ham2 = ham2[word].sum()
    p_word_given_ham2 = (n_word_given_ham2 + alpha) / (n_ham2 + (alpha * n_vocab2))
    params_ham2[word] = p_word_given_ham2

P(spam):  0.13458950201884254
P(ham):  0.8654104979811574
Length of all spam words:  7345
Length of all ham words:  26855


In [33]:
# function to classify dataset
def classify2(message):
    # remove punctuation
    message = re.sub('\W', ' ', message)
    # lower case
    message = message.lower()
    # tokenize
    message = nltk.word_tokenize(message)
    # remove non alpha characters
    message = [x for x in message if x.isalpha()]
    # remove stopwords and make string
    message = list(filter(lambda w: w not in stopwords, message))
    # assign default probabilities
    p_spam_given_message2 = p_spam2
    p_ham_given_message2 = p_ham2
    # loop through words in given message
    for word in message:
        # iteratively calculate score for spam
        if word in params_spam2:
            p_spam_given_message2 *= params_spam2[word]
        # iteratively calculate score for ham
        if word in params_ham2:
            p_ham_given_message2 *= params_ham2[word]

    if p_ham_given_message2 > p_spam_given_message2:
        return 'ham'
    elif p_spam_given_message2 > p_ham_given_message2:
        return 'spam'
    else:
        return 'probabilities = 50%'

In [34]:
# store off prediction in test dataframe
test_classify2 = test2.copy()
test_classify2['prediction'] = test_classify2['Text'].apply(classify2)
test_classify2.head()

Unnamed: 0,Class,Text,prediction
0,ham,Later i guess. I needa do mcat study too.,ham
1,ham,But i haf enuff space got like 4 mb...,ham
2,spam,Had your mobile 10 mths? Update to latest Oran...,spam
3,ham,All sounds good. Fingers . Makes it difficult ...,ham
4,ham,"All done, all handed in. Don't know if mega sh...",ham


In [35]:
correct2 = 0
total2 = test_classify2.shape[0]
# loop through dataframe to obtain number correct
for row in test_classify2.iterrows():
    # if correct add 1 to correct var
    if row[1]['Class'] == row[1]['prediction']:
        correct2 += 1
# print results
print('Total:', total2)
print('Correct:', correct2)
print('Incorrect:', total2 - correct2)
print('Accuracy: {}%'.format((correct2/total2)*100))

Total: 1114
Correct: 1086
Incorrect: 28
Accuracy: 97.48653500897666%


In [36]:
from sklearn.metrics import classification_report,confusion_matrix
print("Confusion Matrix: ")
print_cm(confusion_matrix(test_classify2['Class'],test_classify2['prediction']),['ham','spam'])
print("")
print("Classification Report: ")
print(classification_report(test_classify2['Class'],test_classify2['prediction']))

Confusion Matrix: 
     t/p    ham  spam 
      ham 948.0  19.0 
     spam   9.0 138.0 

Classification Report: 
              precision    recall  f1-score   support

         ham       0.99      0.98      0.99       967
        spam       0.88      0.94      0.91       147

    accuracy                           0.97      1114
   macro avg       0.93      0.96      0.95      1114
weighted avg       0.98      0.97      0.98      1114



As seen above, removing the words that occur only once actually hurt our classifier's score accross the board. It was about half the size of the previous classifier, but we did see an increase in our false negative count and decrease in accuracy which is not ideal. I believe that removing these infrequent words led to worse scores accross the board due to the fact that I only ever process and analyze alphabet characters (a-z). All other words containing non-alpha characters are thrown out. This prevents weird symbols from getting in but then makes the infrequent words important to classification. It wasn't a terrible classifier but when we included all unique vocab it performed better. Overall, there are a many approaches to this, and I have explored two, one of which worked out quite well.