# Naive Bayes Classifier for Spam Detection

## Instructions

Total Points: 10

Complete this notebook and submit it. The notebook needs to be a complete project report with 

* your implementation,
* documentation including a short discussion of how your implementation works and your design choices, and
* experimental results (e.g., tables and charts with simulation results) with a short discussion of what they mean. 

Use the provided notebook cells and insert additional code and markdown cells as needed.

## Introduction

A spam detection agent gets as its percepts text messages and needs to decide if they are spam or not.
Create a [naive Bayes classifier](https://en.wikipedia.org/wiki/Naive_Bayes_classifier) for the 
[UCI SMS Spam Collection Data Set](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection) to perform this task.

__About the use of libraries:__ The point of this exercise is to learn how a Bayes classifier is built. You may use libraries for tokenizing, stop words and to create a document-term matrix, but you need to implement parameter estimation and prediction yourself.

## Create a bag-of-words representation of the text messages [3 Points]

The first step is to tokenize the text. Here is an example of how to load the data as a Pandas dataframe and then hoe to use the [natural language tool kit (nltk)](https://www.nltk.org/) to create tokens (separate terms) for the first message in the dataset.

In [1]:
import pandas as pd

data = pd.read_csv("smsspamcollection/SMSSpamCollection", sep='\t',header = None,names = ["label","sms"])
data.head()

Unnamed: 0,label,sms
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [2]:
# ! pip install nltk
import nltk
# You need to install nltk and then download the punctuation database for the tokenizer.
# nltk.download('punkt')

message = data.at[0,'sms']
label = data.at[0,'label']
print(f"message: {message} (label: {label})")

tokens = nltk.word_tokenize(message)
print(f"tokens: {tokens}")

message: Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat... (label: ham)
tokens: ['Go', 'until', 'jurong', 'point', ',', 'crazy', '..', 'Available', 'only', 'in', 'bugis', 'n', 'great', 'world', 'la', 'e', 'buffet', '...', 'Cine', 'there', 'got', 'amore', 'wat', '...']


Experiment with removing frequent words (called [stopwords](https://en.wikipedia.org/wiki/Stop_word)) and very infrequent words so you end up with a reasonable number of words used in the classifier. Maybe you need to remove digits or all non-letter characters. You may also use a stemming algorithm. 

Convert the tokenized data into a data structure that indicates for each for document what words it contains. The data structure can be a [document-term matrix](https://en.wikipedia.org/wiki/Document-term_matrix) with 0s and 1s, a pandas dataframe or some sparse matrix structure. Note: words, tokens and terms are often used interchangably. Make sure the data structure can be used to split the data into training and test documents (see below). 

In [3]:
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

#vectorizer = CountVectorizer(binary = True, stop_words='english')
vectorizer = CountVectorizer(binary = True)
X = vectorizer.fit_transform(data['sms'])
X

<5572x8713 sparse matrix of type '<class 'numpy.int64'>'
	with 74169 stored elements in Compressed Sparse Row format>

In [4]:
X_spam = X[data['label'] == "spam"]
X_ham = X[data['label'] == "ham"]

In [5]:
terms = pd.DataFrame({
    'term' : vectorizer.get_feature_names(), 
    'doc_freq' : np.squeeze(np.asarray(X.sum(axis = 0))),
    'spam_freq' : np.squeeze(np.asarray(X_spam.sum(axis = 0))),
    'ham_freq' : np.squeeze(np.asarray(X_ham.sum(axis = 0)))
})

terms = terms.sort_values('doc_freq', ascending=False)

len(terms)

8713

In [6]:
terms.head(n = 20)

Unnamed: 0,term,doc_freq,spam_freq,ham_freq
7806,to,1687,468,1219
8668,you,1591,242,1349
7674,the,1035,167,868
4114,in,810,70,740
1097,and,795,108,687
4233,is,752,142,610
4968,me,690,27,663
3323,for,624,179,445
4245,it,614,29,585
5254,my,613,12,601


In [7]:
terms.tail(n = 20)

Unnamed: 0,term,doc_freq,spam_freq,ham_freq
3814,healer,1,0,1
3813,heal,1,0,1
3812,headstart,1,0,1
3811,headset,1,1,0
3804,hdd,1,0,1
3803,hcl,1,0,1
3801,havn,1,0,1
3798,haventcn,1,0,1
3795,havebeen,1,0,1
3793,havbeen,1,0,1


Report the 20 most frequent and the 20 least frequent words and there frequency in your data set. Remember: words are only counted once per document.

In [8]:
# Code that prints the tables with the words

## Learn parameters [3 Points]

Use 80% of the data (called training set; randomly chosen) to learn the parameters of the naive Bayes classifier (prior probabilities and likelihoods). 
Remember, the naive Bayes classifier assumes conditional independence between words and estimates porteriori probabilities as:

$$\hat{P}(spam|message) \propto score_{spam}(message) = P(spam) \prod_{i=1}^n P(w_i | spam)$$
$$\hat{P}(ham|message) \propto score_{ham}(message) = P(ham) \prod_{i=1}^n P(w_i | ham)$$

Messages are classified as spam if the posteriori probability for spam is larger than for ham which is
equivalent to 
$$score_{spam}(message) > score_{ham}(message)$$ 

You therefore need to
estimate 

* the priors $P(spam)$ and $P(ham)$ for messages, and 
* the likelihoods $P(w_i | spam)$ and $P(w_i | ham)$ for all the words/tokens you chose to use

from counts obtained from the training data. Use [Laplacian smoothing](https://en.wikipedia.org/wiki/Additive_smoothing) for the estimation of
likelihoods to avoid likelihoods of zero.

In [9]:
# split data
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import confusion_matrix, accuracy_score

y = data["label"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Report the prior probabilities.

In [10]:
n_ham = sum(y_train == 'ham')
P_ham = n_ham / len(y_train)
P_ham

0.8658290329818263

In [11]:
n_spam = sum(y_train == 'spam')
P_spam = n_spam / len(y_train)
P_spam

0.13417096701817366

In [12]:
# calculate log of priors for later
log_P_ham = np.log(P_ham)
log_P_spam = np.log(P_spam)

In [13]:
# calculate term frequencies (column sums)
terms = pd.DataFrame({
    'term' : vectorizer.get_feature_names(), 
    'doc_freq' : np.squeeze(np.asarray(X_train.sum(axis = 0))),
    'spam_freq' : np.squeeze(np.asarray(X_train[y_train == "spam"].sum(axis = 0))),
    'ham_freq' : np.squeeze(np.asarray(X_train[y_train == "ham"].sum(axis = 0)))
})

# for Laplace smoothing
alpha = 1
n_classes = 2

# calculate likelihoods and log likelihoods
terms['likelihood_spam'] = (terms['spam_freq'] + alpha) / (n_spam + n_classes) 
terms['likelihood_ham'] = (terms['ham_freq'] + alpha) / (n_ham + n_classes) 

terms['likelihood_ratio'] = terms['likelihood_spam'] / terms['likelihood_ham']

terms['log_likelihood_spam'] = np.log(terms['likelihood_spam']) 
terms['log_likelihood_ham'] = np.log(terms['likelihood_ham']) 

# we also need the log likelihood of a term not being in spam/ham
terms['log_likelihood_not_spam'] = np.log(1-terms['likelihood_spam']) 
terms['log_likelihood_not_ham'] = np.log(1-terms['likelihood_ham']) 

terms.head()

Unnamed: 0,term,doc_freq,spam_freq,ham_freq,likelihood_spam,likelihood_ham,likelihood_ratio,log_likelihood_spam,log_likelihood_ham,log_likelihood_not_spam,log_likelihood_not_ham
0,00,8,8,0,0.015,0.000259,57.915,-4.199705,-8.258681,-0.015114,-0.000259
1,000,22,22,0,0.038333,0.000259,148.005,-3.261435,-8.258681,-0.039087,-0.000259
2,000pes,1,0,1,0.001667,0.000518,3.2175,-6.39693,-7.565534,-0.001668,-0.000518
3,008704050406,2,2,0,0.005,0.000259,19.305,-5.298317,-8.258681,-0.005013,-0.000259
4,0089,1,1,0,0.003333,0.000259,12.87,-5.703782,-8.258681,-0.003339,-0.000259


Report the top 20 words (highest conditional probability) for ham and for spam. These words represent the biggest clues that a message is ham or spam.

In [14]:
# ham
terms.sort_values('likelihood_ratio', ascending=True).head(n = 20)

Unnamed: 0,term,doc_freq,spam_freq,ham_freq,likelihood_spam,likelihood_ham,likelihood_ratio,log_likelihood_spam,log_likelihood_ham,log_likelihood_not_spam,log_likelihood_not_ham
4793,lt,195,0,195,0.001667,0.050764,0.032832,-6.39693,-2.980567,-0.001668,-0.052098
3684,gt,194,0,194,0.001667,0.050505,0.033,-6.39693,-2.985682,-0.001668,-0.051825
3805,he,131,0,131,0.001667,0.034188,0.04875,-6.39693,-3.37588,-0.001668,-0.034786
4747,lor,114,0,114,0.001667,0.029785,0.055957,-6.39693,-3.513749,-0.001668,-0.030238
4550,later,111,0,111,0.001667,0.029008,0.057455,-6.39693,-3.540183,-0.001668,-0.029437
2428,da,106,0,106,0.001667,0.027713,0.06014,-6.39693,-3.585853,-0.001668,-0.028104
6843,she,94,0,94,0.001667,0.024605,0.067737,-6.39693,-3.704805,-0.001668,-0.024913
5533,oh,86,0,86,0.001667,0.022533,0.073966,-6.39693,-3.792773,-0.001668,-0.022791
2714,doing,74,0,74,0.001667,0.019425,0.0858,-6.39693,-3.941193,-0.001668,-0.019616
1054,already,67,0,67,0.001667,0.017612,0.094632,-6.39693,-4.039174,-0.001668,-0.017769


In [15]:
# spam
terms.sort_values('likelihood_ratio', ascending=True).tail(n = 20)

Unnamed: 0,term,doc_freq,spam_freq,ham_freq,likelihood_spam,likelihood_ham,likelihood_ratio,log_likelihood_spam,log_likelihood_ham,log_likelihood_not_spam,log_likelihood_not_ham
364,16,42,41,1,0.07,0.000518,135.135,-2.65926,-7.565534,-0.072571,-0.000518
2150,collection,21,21,0,0.036667,0.000259,141.57,-3.305887,-8.258681,-0.037356,-0.000259
1,000,22,22,0,0.038333,0.000259,148.005,-3.261435,-8.258681,-0.039087,-0.000259
8375,weekly,22,22,0,0.038333,0.000259,148.005,-3.261435,-8.258681,-0.039087,-0.000259
6284,rate,25,25,0,0.043333,0.000259,167.31,-3.138833,-8.258681,-0.0443,-0.000259
6525,ringtone,25,25,0,0.043333,0.000259,167.31,-3.138833,-8.258681,-0.0443,-0.000259
356,150ppm,26,26,0,0.045,0.000259,173.745,-3.101093,-8.258681,-0.046044,-0.000259
8016,uk,54,53,1,0.09,0.000518,173.745,-2.407946,-7.565534,-0.094311,-0.000518
8596,www,82,80,2,0.135,0.000777,173.745,-2.002481,-7.160069,-0.145026,-0.000777
1333,awarded,32,32,0,0.055,0.000259,212.355,-2.900422,-8.258681,-0.05657,-0.000259


## Evaluate the classification performance [4 Points] 

Classify the remaining 20% of the data (test set) and calculate classification accuracy. Accuracy is defined as the proportion of correctly classified test documents (see https://github.com/mhahsler/CS7320-AI/blob/master/ML/ML_example.ipynb).

1. How good is your classifier's accuracy compared to the baseline classifier that always predicts the majority class. You can also compare your classifier with the [Naive Bayes classifier implemented in sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB). 

2. Inspect a few misclassified text messages and discuss why the classification failed.

3. Discuss how you deal with words in the test data that you have not seen in the training data.

### Weak baseline test accuracy (always predict ham)

In [16]:
accuracy_score(['ham'] * len(y_test), y_test)

0.8663677130044843

### Strong baseline test accuracy

In [17]:
from sklearn.naive_bayes import BernoulliNB

clf = BernoulliNB(alpha = 1)
clf.fit(X_train, y_train)

pred = clf.predict(X_test)

accuracy_score(pred, y_test)

0.9847533632286996

In [18]:
# look at model parameters
print(f"Priors (log): {clf.class_log_prior_}")
print(f"Loglikelihoods: {clf.feature_log_prob_}")

Priors (log): [-0.14406781 -2.00864042]
Loglikelihoods: [[-8.2586815  -8.2586815  -7.56553432 ... -7.56553432 -8.2586815
  -8.2586815 ]
 [-4.19970508 -3.26143544 -6.39692966 ... -6.39692966 -5.70378247
  -6.39692966]]


In [19]:
# log probability score for first five test example
clf.predict_proba(X_test[0:5])

array([[9.99999991e-01, 9.16080713e-09],
       [9.99999970e-01, 3.02570471e-08],
       [1.00000000e+00, 3.26475627e-13],
       [1.00000000e+00, 1.29121253e-13],
       [9.99999853e-01, 1.47208670e-07]])

### Implementation

Calculate scores by summing the log likelihoods for words in the message and words not in the message. 
Since the messages are short, it is more efficient to start with the score for no words in the message and subtract the score of the words that are in it. 

In [20]:
 sum_score_not_ham = sum(terms['log_likelihood_not_ham'])
sum_score_not_spam = sum(terms['log_likelihood_not_spam'])

# @ is the dot product
pred = pd.DataFrame({
    'score_ham' : np.exp(X_test @ terms['log_likelihood_ham'] + 
                         (sum_score_not_ham - X_test @ terms['log_likelihood_not_ham']) + 
                         log_P_ham),
    'score_spam': np.exp(X_test @ terms['log_likelihood_spam'] + 
                         (sum_score_not_spam - X_test @ terms['log_likelihood_not_spam']) + 
                         log_P_spam)
})

# normalize to probabilities
pred['P_ham'] = pred['score_ham'] / (pred['score_ham'] + pred['score_spam'])
pred['P_spam'] = pred['score_spam'] / (pred['score_ham'] + pred['score_spam'])

pred.head()

Unnamed: 0,score_ham,score_spam,P_ham,P_spam
0,3.7988199999999996e-51,3.480026e-59,1.0,9.160807e-09
1,6.083092e-52,1.840564e-59,1.0,3.025705e-08
2,6.659536e-35,2.174176e-47,1.0,3.264756e-13
3,1.583173e-25,2.0442119999999997e-38,1.0,1.291213e-13
4,2.8147509999999998e-42,4.143558e-49,1.0,1.472087e-07


In [21]:
pred['class'] = (pred['P_spam'] > pred['P_ham']).map({True : 'spam', False : 'ham'})
pred.head()

Unnamed: 0,score_ham,score_spam,P_ham,P_spam,class
0,3.7988199999999996e-51,3.480026e-59,1.0,9.160807e-09,ham
1,6.083092e-52,1.840564e-59,1.0,3.025705e-08,ham
2,6.659536e-35,2.174176e-47,1.0,3.264756e-13,ham
3,1.583173e-25,2.0442119999999997e-38,1.0,1.291213e-13,ham
4,2.8147509999999998e-42,4.143558e-49,1.0,1.472087e-07,ham


In [22]:
accuracy_score(pred['class'], y_test)

0.9847533632286996

In [23]:
confusion_matrix(pred['class'], y_test)

array([[963,  14],
       [  3, 135]])

In [27]:
# to find the misclassified message, I resplit the original data
message_train, message_test, y_train, y_test = train_test_split(data['sms'], y, test_size=0.2, random_state=42)

In [29]:
res = pd.DataFrame({'message' : message_test.reset_index(drop = True), 
              'y' : y_test.reset_index(drop = True), 
              'pred' : pred['class'].reset_index(drop = True)
                   })
             
res[ res.y != res.pred ].sort_values(by = 'y')

Unnamed: 0,message,y,pred
707,"Keep ur problems in ur heart, b'coz nobody wil...",ham,spam
562,Nowadays people are notixiquating the laxinorf...,ham,spam
360,Si.como no?!listened2the plaid album-quite gd&...,ham,spam
40,Reminder: You have not downloaded the content ...,spam,ham
927,PRIVATE! Your 2003 Account Statement for 078,spam,ham
678,Mobile Club: Choose any of the top quality ite...,spam,ham
635,Sorry I missed your call let's talk when you h...,spam,ham
561,Talk sexy!! Make new friends or fall in love i...,spam,ham
550,Bloomberg -Message center +447797706009 Why wa...,spam,ham
324,Bloomberg -Message center +447797706009 Why wa...,spam,ham


## Bonus task [+1 Point]

Describe how you could improve the classifier. Implement and test one of the improvements.

In [26]:
# Code goes here!