# Lab - Text Classification with Naïve Bayes
This notebook serves as the starter code and lab description covering **Chapter 12 - Quantifying Uncertainty** and **Chapter 13 - Probabilistic Reasoning** from the book *Artificial Intelligence: A Modern Approach.*"

In [1]:
# pip install pandas
# pip install sklearn
# pip install numpy

from starter import *
import pandas as pd
from sklearn.model_selection import train_test_split

# This function is placed here to help you read through the source code of different classes, 
#  and debug what has been loaded into jupyter, 
#  make sure all the function calls to `psource` are commented in your submission
def psource(*functions):
    """Print the source code for the given function(s)."""
    from inspect import getsource
    source_code = '\n\n'.join(getsource(fn) for fn in functions)
    try:
        from pygments.formatters import HtmlFormatter
        from pygments.lexers import PythonLexer
        from pygments import highlight
        from IPython.display import HTML

        display(HTML(highlight(source_code, PythonLexer(), HtmlFormatter(full=True))))

    except ImportError:
        print(source_code)

## OVERVIEW
In this lab, you will partially implement a Naïve Bayes text classifier which looks at SMS text messages and categorizes them into two classes of **ham** vs. **spam**. In this regards, we will learn how a probability distribution is implemented in terms of python code and how it can be extended to represent a joint probability distribution. Let's get started!

### Data
The data has been collected from free or free for research sources at the Internet ([Accessible Here](http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/)). The data format is very simple; it is a *.tsv* file containing 5,574 text messages from which 747 are spam and the other 4,827 are normal (ham) messages. 

Lets load up the dataset and look at the first text message and its label:

In [2]:
data = pd.read_csv("SMSSpamCollection.tsv", sep='\t', header=0)
data.iloc[0]

LABEL                                                  ham
TEXT     Go until jurong point, crazy.. Available only ...
Name: 0, dtype: object

Since we have all the data loaded in one place, we need to speparate out a part of it for testing our model once its ready, use `train_test_split` function from *sklearn* library (which has already been loaded in the first cell), to speparate out a portion of the test data.

In [3]:
# TODO fill in this cell properly to have 10% of the data as test and the rest as train, 
# have train_test_split shuffle your data and use 42 as random seed.
train, test = data, data

Now lets see the distribution of messages in our classes in the test and train data:

In [4]:
train.groupby(by='LABEL').agg('count')

Unnamed: 0_level_0,TEXT
LABEL,Unnamed: 1_level_1
ham,4340
spam,674


In [5]:
test.groupby(by='LABEL').agg('count')

Unnamed: 0_level_0,TEXT
LABEL,Unnamed: 1_level_1
ham,485
spam,73


This means our test set will contain enough instances of each class to help us evaluate both classes. Be careful about aggergation of accuracy scores for both classes though. Think what could go wrong!

## PROBABILITY DISTRIBUTION

Let us continue by specifying discrete probability distributions. The class **ProbDist** defines a discrete probability distribution. We name our random variable and then assign probabilities to the different values of the random variable. Assigning probabilities to the values works similar to that of using a dictionary with keys being the Value and we assign to it the probability. This is possible because of the magic methods **_ _getitem_ _**  and **_ _setitem_ _** which store the probabilities in the prob dict of the object. You can keep the source window open alongside while playing with the rest of the code to get a better understanding.

In [6]:
# psource(ProbDist)

To get a bit more comfortable with *ProbDist* define an instance of it which recevies a vocabulary containing the four words: {`cat`, `dog`, `hamster`, `rabbit`} with respective frequencies {417, 330, 240, 32}. Print out the probability of `hamster` happening in this vocabulary.

In [7]:
# TODO using ProbDist find probability of 'hamster' happening in the vocabulary which should be equal to 0.23552502453385674

0.23552502453385674


## Joint Probability Distribution

A probability model is completely determined by the joint distribution for all of the random variables. The probability module implements these as the class **JointProbDist** which inherits from the **ProbDist** class. This class specifies a discrete probability distribute over a set of variables. 

In [8]:
#psource(JointProbDist)

*Values* for a Joint Distribution is a an ordered tuple in which each item corresponds to the value associate with a particular variable. For Joint Distribution of X, Y where X, Y take integer values this can be something like (18, 19).

To specify a Joint distribution we first need an ordered list of variables.

In [9]:
variables = ['X', 'Y']
j = JointProbDist(variables)
j

P(['X', 'Y'])

Like the **ProbDist** class **JointProbDist** also employes magic methods to assign probability to different values.
The probability can be assigned in either of the two formats for all possible values of the distribution. The **event_values** call inside  **_ _getitem_ _**  and **_ _setitem_ _** does the required processing to make this work.

In [10]:
j[1,1] = 0.2
j[dict(X=0, Y=1)] = 0.5

(j[1,1], j[0,1])

(0.2, 0.5)

It is also possible to list all the values for a particular variable using the **values** method.

In [11]:
j.values('X')

[1, 0]

## Text classification with Naïve Bayes
We can get back to our task of text classification for SMS messages. As we discussed in the lecture, The Naïve Bayes model consists of the prior probabilities $\textbf{P}(Category)$ and the conditional probabilities
$\textbf{P}(HasWord_i|Category)$. Here, our categories are clearly `ham` and `spam`. So first thing we should collect statistics about our categories.
Make a *ProbDist* instance and fill it with $\textbf{P}(Category)$ information. **You must only collect these information from your train data**.

In [12]:
df = train.groupby(by='LABEL').agg('count').reset_index()
ham_count = int(df.loc[df['LABEL']=='ham']['TEXT'].astype(int))
# TODO continue from here and create p_category here using the train data
# use show_approx function and make sure the probability of spam is not too low here (e.g. below 0.11) 
# if it was the case re-run the 'train_test_split' cell!

'ham: 0.866, spam: 0.134'

Similarly, $\textbf{P}(HasWord_i|Category)$ is estimated as the fraction of
documents of each category that contain word $i$. 

Using the knowledge of what you have learned and all of the code that has been provided to you in this lab so far, **collect/create $\textbf{P}(HasWord_i|Category)$. Again, you must only collect these information using your train data**.

Note that calculating $\textbf{P}(HasWord_i|Category)$ means counting how many times the word with index $i$ appears in text messages in $Category$ of documents divided by the total number of words in that category.

In [13]:
# TODO create p_has_word_category as a JointProbDist and fill it in by iterating through train instances
# Implementation hint 1: you can iterate through the rows of pandas DataFrame using the function `iterrows`
# Implementation hint 2: once you collected the joint word, category information, you must normalize 
#                        the collected counts and turn them to probability distributions over each category
#                        i.e. the content of p_has_word_category for each category must sum to 1.

In [None]:
for c in p_has_word_category.values('Category'):
    print("Sum probability for category {} should sum to 1. The actual summation is equal to {}.".format(
        c, sum((p_has_word_category[w, c] for w in p_has_word_category.values('HasWord')))))

## Putting it all together to classify test data
Based on Equation 12.21 of the textbook, For an evidence $\textbf{e}$ (or in our case an SMS message) we can calculate the probability of each $Category$ (both `ham` and `spam`) using the following equation in which $e_j$ is the $j$th word in our text message:

$$\textbf{P}(Category|\textbf{e}) = \alpha \sum_y{\textbf{P}(Category)\textbf{P}(\textbf{y}|Category)\big(\prod_j\textbf{P}(e_j|Category)\big)}$$
$$= \alpha \textbf{P}(Category) \big(\prod_j\textbf{P}(e_j|Category)\big) \sum_y \textbf{P}(\textbf{y}|Category)$$
$$= \alpha \textbf{P}(Category) \prod_j\textbf{P}(e_j|Category)$$

Your next task is to use the information you have created so far (and the equation we just reviewed) to classify the test data.

In [None]:
spam_predicted_as_spam = 0.0
spam_predicted_as_ham = 0.0
ham_predicted_as_ham = 0.0
ham_predicted_as_spam = 0.0
total_ham = 0.0
total_spam = 0.0
# #############################################################################################
# IMPORTANT! DO NOT MODIFY THE LINES ABOVE THIS LINE

# TODO use the Naïve Bayes classification equation here to classify test data.
# Implementation hint: simply use the two distributions you just collected and calculate P(ham|text_message) 
# and P(spam|text_message) and selected the one with higher probability as the message class.
# once your prediction is ready for each instance, increment the proper equivalent values from the 6 values above
# (for each instace only one *predicted_as* variable will be updated and one *total_* variable depending on
# the actual test message label.

# Once you are done with the implementation, running this cell will use your collected stats and print out the 
# confusion matrix and precision, recall, and f-1 scores of your classifier.
    
# IMPORTANT! DO NOT MODIFY THE LINES BELOW THIS LINE
# #############################################################################################
print("confusion matrix\tprd_ham\t\tprd_spam\nact_ham         \t{}\t\t{}\nact_spam        \t{}\t\t{}\n".format(
    ham_predicted_as_ham, ham_predicted_as_spam, spam_predicted_as_ham, spam_predicted_as_spam))
acc_ham = ham_predicted_as_ham * 100 /total_ham
acc_spam = spam_predicted_as_spam * 100 /total_spam
rec_ham = ham_predicted_as_ham * 100 /(ham_predicted_as_ham+spam_predicted_as_ham)
rec_spam = spam_predicted_as_spam * 100 /(spam_predicted_as_spam + ham_predicted_as_spam)
f1_ham = 2 * acc_ham * rec_ham / (acc_ham + rec_ham)
f1_spam = 2 * acc_spam * rec_spam / (acc_spam + rec_spam)
print("Prediction accuracy\tham = {:.3f}\tspam = {:.3f}".format(acc_ham, acc_spam))
print("Prediction recall\tham = {:.3f}\tspam = {:.3f}".format(rec_ham, rec_spam))
print("Prediction F-1 score\tham = {:.3f}\tspam = {:.3f}".format(f1_ham, f1_spam))

## Analysis
Explain what can you understand from the results you just got. In particular, explain what do you get out of the results in confusion matrix and what do accuracy (precision), recall, and f-1 scores tell You?

### Your Answer
...

## Improvement
Think of one thing that would be helpful to improve the accuracy of your implemented Naïve Bayes classifier. Implement it and use the same evaluation script to calculate the results and analyse its performance.

In [None]:
# TODO implement your improvement here and re-test it!