# DSCI 6003 4.2 Practicum: Naive Bayes

In this exercise you will implement Naive Bayes classification in Python. You should rely primarily on counters and dictionaries instead of numpy arrays for this implementation.

Recall the formulas we use for Naive Bayes:

![likelihood](images/likelihood2.png)

Let's unpack this a bit.  The numerator is the number of times a word from the document in question appears in each class from the training set plus a Laplace smoother.  The denominator is the total number of words in each class from the training set with additional smoothing.

Notice that these probabilities are simply the probability that the word you are investigating would be drawn at random from all of the documents in a given class (with smoothing).

And here's how we calculate the probability that the document in question belongs to a class:

![posterior](images/posterior.png)

Here we determine the probability of a class given a document.  This probability is given by the frequency of each class in the training set `P(y)` times the sum of the probabilities that each word in the document would be drawn at random from the class you are investigating `sum(P(x_i|y))`.  You will need one of these probabilities for each class in your training set.  Choose the class with the largest probability.

The summation here explains that we need to sum the probabilities for each word in the document we are investigating. <a href='http://scikit-learn.org/stable/modules/naive_bayes.html'>Sklearn's formulation is pretty good too.</a>

1. Open code\naive_bayes.py and look at the 'fit' method in the NaiveBayes class definition. This method calculates the prior probabilities.  In this case the prior is just the frequency of each class in the training set.

2. Implement the `_compute_likelihood` method. This is the majority of work we will need to do to train the model. Go to the test file for this practicum and see what the input for the model will look like.

    * The `class_counts` attribute should contain the total number of samples in all the features for each class. This is denominator (minus the smoothing) from above. The keys should be the classes.

    * The `class_feature_counts` attribute should contain the number of occurrences of each word (feature) for each class. This is a dictionary of dictionaries (technically a defaultdict of Counters). This is numerator from above. You should be able to access this dictionary like this: `class_feature_counts[class y][feature j]`.

    This is in fact all that we need to precompute. We will be doing the Laplace smoothing when we do predictions. As you go, you can run `nosetests tests/test_nb.py` to verify you've correctly implemented each method.

3. Implement the `posteriors` method. For each row in the feature matrix `X` and for each potential label, you will need to calculate the log likelihood. You should follow the formula from above.

4. The `predict` method then returns the class with the largest probability for each data point. Implement this in the `predict` method. 

5. Run `nosetests tests/test_nb.py` to verify you've correctly implemented.


In [37]:
from __future__ import division
from imp import reload

import nose
from code.naive_bayes import NaiveBayes


from nose.tools import assert_equal, assert_not_equal
import numpy as np

import pandas as pd
%load_ext autoreload


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [38]:
%autoreload code.naive_bayes

In [39]:
X = ['a long document about fishing',
     'a book on fishing',
     'a book on knot-tying']
X = [x.split() for x in X]
y = np.array(['fishing', 'fishing', 'knot-tying'])
nb = NaiveBayes()
nb.fit(X, y)

def test_class_freq():
    assert nb.class_freq['fishing'] == 2
    assert nb.class_freq['knot-tying'] == 1

def test_class_counts():
    assert_equal(nb.class_counts['fishing'], 9)

def test_p_is_number_features():
    assert_equal(nb.p, 8)

def test_class_feature_counts():
    assert_equal(nb.class_feature_counts['fishing']['document'], 1)
    assert_equal(nb.class_feature_counts['knot-tying']['fishing'], 0)
    assert_equal(nb.class_feature_counts['fishing']['fishing'], 2)

def laplace(n, d, p):
    return (n + 1) / (d + 1 * p)

def test_predict():
    test_X = [["book"]]
    p = 8
    fishing_likelihood = sum((np.log(laplace(1, 9, p)),
                             np.log(2/3)))
    knot_tying_likelihood = sum((np.log(laplace(1, 4, p)),
                                np.log(1/3)))
    posts = nb.posteriors(test_X)
    
    print(fishing_likelihood, ' fishing lieklihood')
    print(knot_tying_likelihood,' knot tying prob')
    #print(posts[0],'posts')
    
    preds = nb.predict(test_X)
    print(preds, ' preds')
    assert_equal(fishing_likelihood, posts[0]['fishing'])
    assert_equal(knot_tying_likelihood, posts[0]['knot-tying'])
    assert_equal(preds[0], 'fishing')
    assert_not_equal(preds[0], 'knot-tying')

def test_score():
    print(nb.posteriors(X))
    assert_equal(nb.score(X, y), 1.0)


In [40]:
test_class_freq()
test_class_counts()
test_p_is_number_features()
test_class_feature_counts()
test_predict()
test_score()

-2.5455312716  fishing lieklihood
-2.8903717579  knot tying prob
['fishing']  final predictions
['fishing']  preds
[defaultdict(<class 'int'>, {'knot-tying': -12.829998357048165, 'fishing': -10.294865709373189}), defaultdict(<class 'int'>, {'knot-tying': -8.9587973461402743, 'fishing': -8.1547995458769176}), defaultdict(<class 'int'>, {'knot-tying': -8.2656501655803289, 'fishing': -9.2534118345450267})]
['fishing', 'fishing', 'knot-tying']  final predictions


In [41]:
nb.class_freq

Counter({'fishing': 2, 'knot-tying': 1})

In [None]:
nb.class_freq

In [None]:


%autoreload naive_bayes

In [None]:
spam_df = pd.read_csv('../NaiveBayes_Practicum/data/spam.csv',header=None)

# Reach Goals

1. Now that you can take in text and classify as being from a certain document, try using your implementation of Naive Bayes on the mini20-train and mini20-test data. More information about these datasets can be found <a href = http://ana.cachopo.org/datasets-for-single-label-text-categorization>at this webpage</a>.

2. This time, modify your code to read in tf-idf data. In the spam.csv folder, there is a tf-idf representation of e-mails, some of which are spam. Use your version of Naive Bayes to classify the e-mails, then check it against sklearn's Multinomial NB.

In [13]:
mini20test_df = np.array(pd.read_csv('../NaiveBayes_Practicum/data/mini20-test.txt',header=None))
mini20train_df = np.array(pd.read_csv('../NaiveBayes_Practicum/data/mini20-train.txt',header=None))

In [14]:
spam_df = pd.read_csv('../NaiveBayes_Practicum/data/spam.csv',header=None)

In [26]:
## clean the text in each row
X_train = []
y_train = []
for row in mini20train_df:
    row = str(row)
    split_row = row.split('\\t')
    
    for count,term in enumerate(split_row):

        if count ==0:
            y_train.append(term)
        else:
            X_train.append(term)


In [29]:
len(X_train)

1334

In [30]:
len(y_train)

1334

In [16]:
## clean the text in each row
X_test = []
y_test = []
for row in mini20test_df:
    row = str(row)
    split_row = row.split('\\t')

    for count,term in enumerate(split_row):

        if count ==0:
            y_test.append(term)
        else:
            X_test.append(term)


In [31]:
len(X_test)

666

In [42]:
nb_model = NaiveBayes()

In [43]:
nb_model.fit(X_train,y_train)

In [45]:
# my code is too inefficient to run this :(
nb_model.predict(X_test)

KeyboardInterrupt: 

In [35]:
nb_model.score(X_test,y_test)

KeyboardInterrupt: 

This is how to implement Naive Bayes with a tf-idf vector as an input.

## Background

- Naive Bayes primarily relies on the Bayes Theorem:

  $$p(y|x) = \frac{p(x|y) \times p(y)}{p(x)}$$

  <br>

  where 

  - $p(y|x)$ is the probability of observing a particular label / class given the data (posterior)
  - $p(x|y)$ is the probability of observing the data given a particular label / class (likelihood)
  - $p(y)$ is the probability of observing the a particular label / class (prior)
  - $p(x)$ is the probability of observing the data

  <br>

- It is assumed that $p(x)$ is constant, and therefore we can ignore the term and rewrite the formulation for Naive Bayes as:

  $$p(y|x) \propto p(x|y) \times p(y)$$

  <br>

- In more concrete terms, we can express the likelihood of observing the data as the joint probability of observing all the features in the data:

  $$p(x|y) = p(x_i|y) \cdot p(x_{i+1}|y) \cdot p(x_{i+2}|y) \cdot \text{...} \cdot p(x_n|y)$$
  
  <br>
  
- We would compute the likelihood based on exisiting data and set a prior based on the class distribution
- Based on the likelihood and prior, we can then compute the probability observing a certain class given I have observed feature i two times and  feature i+1 3 times:

  $$p(y|x) \propto p(x_i|y)^2 \times p(x_{i+1}|y)^3 \times p(y)$$

  <br>

- To take the log form of the above formulation, we will get:

  $$log(p(y|x)) \propto 2log(p(x_i|y)) + 3log(p(x_{i+1}|y)) + log(p(y))$$
  
  <br>
  
- The general form to compute the posterior would be:

  $$log(p(y|x)) \propto \sum_{i=1}^n  x_i log(p(x_i|y)) + log(p(y))$$

  <br>
  
- To compute the likelihood of observing a certain feature given a class, $p(x_i|y)$:

  $$p(x_i|y) = \frac{S_{y,i} + \alpha}{S_y + \alpha p}$$
  
  where 
  - p is the number of features
  - $\alpha$ is a smoothing terming which prevents undefined probability, usually set to 1
  - $S_{y,i}$ is the sum of all of the $i^{th}$ features for all the datapoints in class $y$
  - $S_y$ is the sum of all of the features for all the datapoints in class $y$