<img src='img/logo.png'>
<img src='img/title.png'>

# Naive Bayes classifiers

Naive Bayes classifiers learn the *statistical characteristics* of the labels represented in your training data.

New observations are compared to those *statistical characteristics* to make a prediction.

Naive Bayes classifiers assume *independence between pairs of features*.

Naive Bayes classifiers may perform better than linear models for:
* many training observations
* large number of features

# Table of Contents
* [Naive Bayes classifiers](#Naive-Bayes-classifiers)
* [Automobile data](#Automobile-data)
* [GaussianNB](#GaussianNB)
	* [Prior probabilities](#Prior-probabilities)
	* [Feature Statistics](#Feature-Statistics)
		* [counts](#counts)
		* [means](#means)
		* [standard deviations](#standard-deviations)
* [BernouliNB](#BernouliNB)
	* [get_dummies](#get_dummies)


# Automobile data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import src.mglearn as mglearn
sns.reset_orig()
%matplotlib inline

In [None]:
auto = pd.read_csv('data/auto-mpg.csv')

In [None]:
auto.columns

In [None]:
sns.pairplot(auto, vars=['hp','mpg'], hue='origin', size=4)

In [None]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(auto, random_state=0)

X_train = train[['hp', 'mpg']]
X_test = test[['hp', 'mpg']]

y_train = train['origin']
y_test = test['origin']

# GaussianNB

<img src='img/topics/Essential-Concept.png' align='left'>
<div class='alert alert-info' align='center'><font size='+2'>
GaussianNB assumes:<br/>an independent Gaussian distribution<br/> for each feature per label
</font></div>

By learning means and variance predictions are made for 

In [None]:
from sklearn.naive_bayes import GaussianNB

In [None]:
gauss = GaussianNB()

How can we tell that this model is *actually* doing well?

In [None]:
gauss.fit(X_train, y_train)
gauss.score(X_test, y_test)

## Prior probabilities

GaussianNB takes into account the *prior probabilities* of each class in the training data when making a prediction.

This helps to reduce the bias in our dataset because most of the cars were made in the US.

In [None]:
pd.Series(gauss.class_prior_, index=gauss.classes_)

In [None]:
fig, ax = plt.subplots(figsize=(12,8))
sns.reset_orig()
ax.scatter(auto['hp'], auto['mpg'], c=auto['origin'].astype('category').cat.codes)
gauss.fit(X_train[['hp','mpg']], y_train.astype('category').cat.codes)
mglearn.plot_2d_separator.plot_2d_classification(gauss, auto[['hp','mpg']].values, alpha=0.2, ax=ax)

## Feature Statistics

The GaussianNB model trains for:
* the number of observations for each label
* the mean of each feature per label ($\theta$)
* the variance of each feature per label ($\sigma$)

### counts

In [None]:
train['origin'].value_counts()

In [None]:
gauss.class_count_

### means

In [None]:
train.groupby('origin')[['hp','mpg']].mean()

In [None]:
gauss.theta_

### standard deviations

In [None]:
train.groupby('origin')[['hp','mpg']].var(ddof=0)

In [None]:
gauss.sigma_

# BernouliNB

The BernouliNB model is designed to work with binary feature data.

In [None]:
titanic = pd.read_csv('data/titanic.csv').dropna()
titanic.head()

## get_dummies

For Pandas DataFrames, using get_dummies transforms multi-categorical columns to separate yes/no observations.

In [None]:
titanic_dummies = pd.get_dummies(titanic[['sex', 'class', 'embarked', 'adult_male', 'alone','survived']])
titanic_dummies.head()

In [None]:
train, test = train_test_split(titanic_dummies, random_state=0)

features = titanic_dummies.columns.drop('survived')

X_train = train[features]
X_test = test[features]
y_train = train['survived']
y_test = test['survived']

The BernouliNB model then predicts the probabilities by evaluating the fractional frequency of each feature per label (true/false).

The BernouliNB model also provides explicit account for features being absent in an observation. This means that a label that was trained where a feature was entirely False will penalize predictions for that label where those observations are true.

In [None]:
from sklearn.naive_bayes import BernoulliNB

bern = BernoulliNB()
bern.fit(X_train, y_train)
bern.score(X_test, y_test)

<img src='img/copyright.png'>