### Build and test a Naive Bayes classifier.

We will again use the iris data. 

Goals for this notebook:
1. Understand NB well enough to make a prediction by hand
2. Use the naive_bayes module in scikit-learn
3. First glimpse at the pandas package for manipulating data

In [None]:
# This tells matplotlib not to try opening a new window for each plot.
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import GaussianNB
from sklearn.cross_validation import train_test_split

In [None]:
# Load the data, which is included in sklearn.
iris = load_iris()
print 'Iris target names:', iris.target_names
print 'Iris feature names:', iris.feature_names
X, y = iris.data, iris.target

In [None]:
# Train/test split


## EDA

The iris feature values are real valued measurements in centimeters. Let's look at histograms of each feature.

In [None]:
# Create a new figure and set the figsize argument so we get square-ish plots of the 4 features.
plt.figure(figsize=(15, 3))

# Iterate over the features, creating a subplot with a histogram for each one.
for feature in range(X_train.shape[1]):
    plt.subplot(1, 4, feature+1)
    plt.hist(X_train[:,feature], 20)
    plt.title(iris.feature_names[feature])

To make things simple, let's binarize these feature values. That is, we'll treat each measurement as either "short" or "long". I'm just going to choose a threshold for each feature.

In [None]:
# Define a function that applies a threshold to turn real valued iris features into 0/1 features.
# 0 will mean "short" and 1 will mean "long".
def binarize_iris(data, thresholds=[6.0, 3.0, 2.5, 1.0]):
    # Initialize a new feature array with the same shape as the original data.
    binarized_data = np.zeros(data.shape)

    # Apply a threshold  to each feature.
    for feature in range(data.shape[1]):
        binarized_data[:,feature] = data[:,feature] > thresholds[feature]
    return binarized_data

# Create new binarized training and test data
binarized_train_data = binarize_iris(X_train)
binarized_test_data = binarize_iris(X_test)

print X_train[:10, ]
print binarized_train_data[:10, ]

Recall that Naive Bayes assumes conditional independence of features. With $Y$ the set of labels and $X$ the set of features ($y$ is a specific label and $x$ is a specific feature), Naive Bayes gives the probability of a label $y$ given input features $X$ as:

$ \displaystyle P(y|X) \approx 
  \frac { P(y) \prod_{x \in X} P(x|y) }
        { \sum_{y \in Y} P(y) \prod_{x \in X} P(x|y) }
$

Let's estimate some of these probabilities using maximum likelihood, which is just a matter of counting and normalizing. We'll start with the prior probability of the label $P(y)$.

In [None]:
# Initialize counters for all labels to zero.
label_counts = [0 for i in iris.target_names]

# Iterate over labels in the training data and update counts.
for label in y_train:
    label_counts[label] += 1

# Normalize counts to get a probability distribution.
total = sum(label_counts)
label_probs = [1.0 * count / total for count in label_counts]
for (prob, name) in zip(label_probs, iris.target_names):
    print '%15s : %.2f' %(name, prob)

### Repeat above three cells using pandas

Pandas allows us to stop thinking about code in terms of procedures, and start thinking about code in terms of _data manipulation_.

- Check out pandas here: pandas.pydata.org
- [Here](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf) is cheat sheet to get a sense for what is possible.
- [This](http://pandas.pydata.org/pandas-docs/stable/10min.html) is probably the best place to get started


Load the training data into a pandas data frame

Redo the train/test split

Plot some histograms

Binarize the columns

Estimate priors

## Naive Bayes by Hand

Next, let's estimate $P(X|Y)$, that is, the probability of each feature given each label. Remember that we can get the conditional probability from the joint distribution:

$\displaystyle P(X|Y) = \frac{ P(X,Y) } { P(Y) } \approx \frac{ \textrm{Count}(X,Y) } { \textrm{Count}(Y) }$

Let's think carefully about the size of the count matrix we need to collect. There are 3 labels $y_1$, $y_2$, and $y_3$ and 4 features $x_0$, $x_1$, $x_2$, and $x_3$. Each feature has 2 possible values, 0 or 1. So there are actually $4 \times 2 \times 3=24$ probabilities we need to estimate: 

$P(x_0=0, Y=y_0)$

$P(x_0=1, Y=y_0)$

$P(x_1=0, Y=y_0)$

$P(x_1=1, Y=y_0)$

...

However, we already estimated (above) the probability of each label. And, we know that each feature value is either 0 or 1. So, for example,

$P(x_0=0, Y=\textrm{setosa}) + P(x_0=1, Y=\textrm{setosa}) = P(Y=\textrm{setosa}) \approx 0.31$.

As a result, we can just estimate probabilities for one of the feature values, say, $x_i = 1$. This requires a $4 \times 3$ matrix.




Append target to binarized data

Conditional probabilities of feature_i > thresh

Priors

Now that we have all the pieces, let's try making a prediction for the first test example. 

#### Exercise

Compute the probability, according to the Naive Bayes model, that the label of this observation is 'virginica'

1: Write a function that takes a label as input and returns _the numerator_ of the NB equation. That is, P(y) * P(x | y)

Note: This function does not need to handle any possible input observation. It can be hard-coded to specifically handle `test_instance_binarized`.

Note: The funtion should actually return the _log_ of the numerator...

In [None]:
def lognum(label):
    """
    Compute P(label) * P(features | label)
    """
    
    return score

Test

In [None]:
label = 'virginica'
lognum('virginica')

2: Compute the log of the numberator for each possible label

3: Convert the results of (2) into probabilities

## Naive Bayes in Scikit-Learn

### BernoulliNB means _the features are distributed Bernoulli_

### GaussianNB _means the features are distributed Gaussian_

Exercise: Build a GaussianNB model using sklearn, and estimate the accuracy on the test set.

Let's look at those predicted probabilities

The predicted probabilities are _very_ confident. Since the test predictions are also extremely accurate, we might conclude that GaussianNB is an appropriate model

### MultinomialNB _means the features are distributed multinomial_

In [None]:
# Percentile cuts to use for discretizing data
cuts = [0, .33, .66, 1]
breaks = train_df[colnames].quantile(cuts)
breaks.loc[0, :] = 0  # Set a global minimimum
breaks.loc[1, :] = 10  # Set a global maximum
print breaks

In [None]:
# Discretize each column according to the breaks defined above
train_df_mult = train_df[colnames].apply(
    lambda col: pd.cut(col, bins=breaks[col.name], include_lowest=True, labels=False)
)
test_df_mult = test_df[colnames].apply(
    lambda col: pd.cut(col, bins=breaks[col.name], include_lowest=True, labels=False))

In [None]:
train_df_mult.head()

In [None]:
from sklearn.naive_bayes import MultinomialNB

nb = MultinomialNB(alpha=1)
nb.fit(train_df_mult, train_df['target'])
nb.score(test_df_mult, test_df['target'])