# Naive Bayes Classifier for Spam Detection

## Instructions

Total Points: 10

Complete this notebook and submit it. The notebook needs to be a complete project report with 

* your implementation,
* documentation including a short discussion of how your implementation works and your design choices, and
* experimental results (e.g., tables and charts with simulation results) with a short discussion of what they mean. 

Use the provided notebook cells and insert additional code and markdown cells as needed.

## Introduction

A spam detection agent gets as its percepts text messages and needs to decide if they are spam or not.
Create a [naive Bayes classifier](https://en.wikipedia.org/wiki/Naive_Bayes_classifier) for the 
[UCI SMS Spam Collection Data Set](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection) to perform this task.

__About the use of libraries:__ The point of this exercise is to learn how a Bayes classifier is built. You may use libraries for tokenizing, stop words and to create a document-term matrix, but you need to implement parameter estimation and prediction yourself.

## Create a bag-of-words representation of the text messages [3 Points]

The first step is to tokenize the text. Here is an example of how to load the data as a Pandas dataframe and then hoe to use the [natural language tool kit (nltk)](https://www.nltk.org/) to create tokens (separate terms) for the first message in the dataset.

In [1]:
import pandas as pd

data = pd.read_csv("smsspamcollection/SMSSpamCollection", sep='\t',header = None,names = ["label","sms"])
data.head()

Unnamed: 0,label,sms
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [2]:
import nltk
# You need to install nltk and then download the punctuation database for the tokenizer.
# ! pip install nltk
# nltk.download('punkt')

message = data.at[0,'sms']
label = data.at[0,'label']
print(f"message: {message} (label: {label})")

tokens = nltk.word_tokenize(message)
print(f"tokens: {tokens}")

message: Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat... (label: ham)
tokens: ['Go', 'until', 'jurong', 'point', ',', 'crazy', '..', 'Available', 'only', 'in', 'bugis', 'n', 'great', 'world', 'la', 'e', 'buffet', '...', 'Cine', 'there', 'got', 'amore', 'wat', '...']


Experiment with removing frequent words (called [stopwords](https://en.wikipedia.org/wiki/Stop_word)) and very infrequent words so you end up with a reasonable number of words used in the classifier. Maybe you need to remove digits or all non-letter characters. You may also use a stemming algorithm. 

Convert the tokenized data into a data structure that indicates for each for document what words it contains. The data structure can be a [document-term matrix](https://en.wikipedia.org/wiki/Document-term_matrix) with 0s and 1s, a pandas dataframe or some sparse matrix structure. Note: words, tokens and terms are often used interchangably. Make sure the data structure can be used to split the data into training and test documents (see below). A very convenient way to create document-term matrices is implemented in sklearn as a the 
text feature extractor [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).

In [3]:
# Description and code goes here!

Report the 20 most frequent and the 20 least frequent words and there frequency in your data set. Remember: words are only counted once per document.

In [4]:
# Code that prints the tables with the words

## Prepare training/testing data [0 Points]

Split the data randomly into 80% for training and 20% for testing ([Example](https://github.com/mhahsler/CS7320-AI/blob/master/ML/ML_example.ipynb)). You can do the split directly on your document-term data structure.

In [5]:
# Code goes here!

## Learn parameters [3 Points]

Use the training set to learn the parameters of the naive Bayes classifier.
Remember, the naive Bayes classifier assumes conditional independence between words and estimates posteriori probabilities as:

$$\hat{P}(spam|message) \propto score_{spam}(message) = P(spam) \prod_{i=1}^n P(w_i | spam)$$
$$\hat{P}(ham|message) \propto score_{ham}(message) = P(ham) \prod_{i=1}^n P(w_i | ham)$$

Note that $w_i$ is an indicator that is true (term is in the message) or false (term is not in the message) and each of these have a likelihood. That means you need not only consider likelihoods that 
a term appears in ham/spam messages, but also likelihoods that it does not appear in ham/spam messages.
Trivially, $P(w_i = true | spam) = 1 - P(w_i = false | spam)$ (same for ham).

Messages are classified as spam if the posteriori probability for spam is larger than for ham which is
equivalent to 
$$score_{spam}(message) > score_{ham}(message)$$ 

You can also convert the scores back to probabilities using the normalization trick:
$\hat{P}(spam|message) = \frac{1}{score_{spam}(message) + score_{ham}(message)} score_{spam}(message)$


You therefore need to
estimate 

* the priors $P(spam)$ and $P(ham)$ for messages, and 
* the likelihoods $P(w_i | spam)$ and $P(w_i | ham)$ for all the words/tokens you chose to use

from counts obtained from the training data. Use [Laplacian smoothing](https://en.wikipedia.org/wiki/Additive_smoothing) for the estimation of
likelihoods to avoid likelihoods of zero.

Implementation note: Multiplying small number can be problematic. Most implementation add log likelihoods instead of multiplying likelihoods. Adding also means that you can use very efficient matrix algebra to calculate the score. 

In [6]:
# Description and code goes here!

Report the prior probabilities.

In [7]:
# Code to print the prior probabilities

Report the top 20 words for ham and for spam. The most important words can be found by looking at the ratio of likelihoods
$\frac{\hat{P}(ham|message)}{\hat{P}(spam|message)} = \frac{score_{ham}(message)}{score_{spam}(message)}$.

In [8]:
# Code that prints the tables with the words

## Evaluate the classification performance [4 Points] 

Classify the remaining 20% of the data (test set) and calculate classification accuracy. Accuracy is defined as the proportion of correctly classified test documents (see https://github.com/mhahsler/CS7320-AI/blob/master/ML/ML_example.ipynb).

1. How good is your classifier's accuracy compared to the weak baseline classifier that always predicts the majority class and a strong baseline given by the [Naive Bayes classifier implemented in sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.BernoulliNB) 
(use a Bernoulli naive Bayes classifier for binary features). You are implementing the same classifier, so you should get the same performance.

2. Inspect a few misclassified text messages and discuss why the classification failed.

3. Discuss how you would deal with words in the test data that you have not seen in the training data.

In [9]:
# Description, code and discussion goes here!

## Bonus task [+1 Point]

Describe how you could improve the classifier. Implement and test one of the improvements.

In [10]:
# Code goes here!