# Naive Bayes Classifier for Spam Detection

## Instructions

Total Points: 10

Complete this notebook and submit it. The notebook needs to be a complete project report with 

* your implementation,
* documentation including a short discussion of how your implementation works and your design choices, and
* experimental results (e.g., tables and charts with simulation results) with a short discussion of what they mean. 

Use the provided notebook cells and insert additional code and markdown cells as needed.

## Introduction

A spam detection agent gets as its percepts text messages and needs to decide if they are spam or not.
Create a [naive Bayes classifier](https://en.wikipedia.org/wiki/Naive_Bayes_classifier) for the 
[UCI SMS Spam Collection Data Set](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection) to perform this task.

__About the use of libraries:__ The point of this exercise is to learn how a Bayes classifier is built. You may use libraries for tokenizing, stop words and to create a document-term matrix, but you need to implement parameter estimation and prediction yourself.

## Create a bag-of-words representation of the text messages [3 Points]

The first step is to tokenize the text. Here is an example of how to use the [natural language tool kit (nltk)](https://www.nltk.org/) to create tokens (separate words).

In [1]:
import nltk
# You need to install nltk and then download the tokenizer once.
# nltk.download('punkt')

file = open("smsspamcollection/SMSSpamCollection", "r")

sentence = file.readline()
print(f"text message: \"{sentence}\"")

tokens = nltk.word_tokenize(sentence)

print(f"tokens: {tokens}")

text message: "ham	Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
"
tokens: ['ham', 'Go', 'until', 'jurong', 'point', ',', 'crazy', '..', 'Available', 'only', 'in', 'bugis', 'n', 'great', 'world', 'la', 'e', 'buffet', '...', 'Cine', 'there', 'got', 'amore', 'wat', '...']


Experiment with removing frequent words (called [stopwords](https://en.wikipedia.org/wiki/Stop_word)) and very infrequent words so you end up with a reasonable number of words used in the classifier. Maybe you need to remove digits or all non-letter characters. You may also use a stemming algorithm. 

Convert the tokenized data into a data structure that indicates for each for document what words it contains. The data structure can be a [document-term matrix](https://en.wikipedia.org/wiki/Document-term_matrix) with 0s and 1s, a pandas dataframe or some sparse matrix structure. Note: words, tokens and terms are often used interchangably. Make sure the data structure can be used to split the data into training and test documents (see below).

Report the 20 most frequent and the 20 least frequent words in your data set.

In [2]:
# Description and code goes here!

## Learn parameters [3 Points]

Use 80% of the data (called training set; randomly chosen) to learn the parameters of the naive Bayes classifier (prior probabilities and likelihoods). 
Remember, the naive Bayes classifier calculates:

$$P(spam|message) \propto score_{spam}(message) = P(spam) \prod_{i=1}^n P(w_i | spam)$$
$$P(ham|message) \propto score_{ham}(message) = P(ham) \prod_{i=1}^n P(w_i | ham)$$

and classifies a message as spam if 
$$score_{spam}(message) > score_{ham}(message).$$ 

You therefore need to
estimate: 

* the priors $P(spam)$ and $P(ham)$, and 
* the likelihoods $P(w_i | spam)$ and $P(w_i | ham)$ for all words

from counts obtained from the training data. Use [Laplacian smoothing](https://en.wikipedia.org/wiki/Additive_smoothing) to estimate the
likelihoods. This deals with words that have very low count in the ham or spam messages and avoids
likelihoods of zero.

Report the top 20 words (highest conditional probability) for ham and for spam. These words represent the biggest clues that a message is ham or spam.

In [3]:
# Description and code goes here!

## Evaluate the classification performance [4 Points] 

Classify the remaining 20% of the data (test set) and calculate classification accuracy. Accuracy is defined as the proportion of correctly classified test documents.

1. How good is your classifier's accuracy compared to a baseline classifier.

2. Inspect a few misclassified text messages and discuss why the classification failed.

3. Discuss how you deal with words in the test data that you have not seen in the training data.

In [4]:
# Description, code and discussion goes here!

## Bonus task [+1 Point]

Describe how you could improve the classifier.

In [5]:
# Code goes here!