In [None]:
import numpy, pandas
import matplotlib.pyplot as plot
from helpers.bayes import cmap, sample_data_1, plot_decision, joint_histograms
%matplotlib inline
#%config InlineBackend.figure_format = 'svg'
#plot.rcParams['figure.figsize'] = [4, 4]

# from https://stackoverflow.com/a/21009774/6871666
numpy.set_printoptions(formatter={'float_kind': lambda x: "%.4f" % x})

Let's start with the same sample data: two features and points in three categories.

In [None]:
observations, classes = sample_data_1()
plot.scatter(observations[:, 0], observations[:, 1], c=classes/2.0, cmap=cmap, edgecolor='k');

# Naive Bayes Classifier

We will use Scikit-Learn to give us a Gaussian Naive Bayes classifier to work with the data.

1. **Gaussian**: synonym for "normally" distributed. We will assume the data (in one category on one feature) is normally distributed.
1. **Naive**: we assume that the values for features are independent. The probability of a certain value for feature #1 doesn't change if you know the values for the other features.
1. **Bayes**: we will use the conditional probability tricks from before and pick the class with the highest probability for the observation we're considering.

In [None]:
from sklearn.naive_bayes import GaussianNB

We will create a Scikit-Learn model that will understand the world with the Gaussian Naive Bayes assumptions.

In [None]:
model = GaussianNB()

# Data: Training and Testing

We usually call our observations "X" and labeled classes "y", because it's easier to spell.

In [None]:
X = observations
y = classes

Our X values are in an array with *n* rows (for *n* observations) and 2 columns, because there are two features in our data set.

The y values are the *n* corresponding classes/labels that we know for those observations.

In [None]:
X.shape

In [None]:
y.shape

Review: we want to use *some* of our data to train the model, and some to *test* the model so we can see how it will do on data it hasn't seen before (and isn't just memorizing the training set).

There is a function in Scikit-Learn that will randomly break up our data into training and testing sets (75%, 25% by default).

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [None]:
X_train.shape

In [None]:
X_test.shape

Since the split is random, we might get *slightly* different results each time we run the program, but it will be similar.

# Training The Model

Now that we have a model and some training data, we can **train** the model (or **fit** it) on that data, so it learns what kind of decisions to make.

In [None]:
model.fit(X_train, y_train);

Let's have a look at the decisions it makes. This is the **training** data, and the background colours are the **predictions** that the model will make on each point if it is asked to guess the class at that observation.

We can see how the training points guided the way the predictions are made, and where the boundaries are between the classes in the model.

In [None]:
plot_decision(model, X_train, y_train)

# Testing The Model

The whole point of having the testing data is to use it to evaluate our model. That is a meaningful evaluation because the model didn't use this data as part of the training, so we think it would be representative of new unlabeled data we find later and want to make predictions about.

This is what the **testing** data looks like on top of the predictions the model makes:

In [None]:
plot_decision(model, X_test, y_test)

And we can ask what fraction of the testing data the model predicts correctly: how many of the training X values are predicted to match the training y values?

In [None]:
model.score(X_test, y_test)

# Using The Model

Once we have a trained model, we can use it to actually make some predictions.

Let's pretend we have some new inputs and want to make our best guess what the category is.

In [None]:
new_observations = numpy.array([
    [4, 5],
    [8, -10],
    [2, -8],
    [6, -1],
])

The `model.predict` function can take these values and ask the model to predict the class.

In [None]:
model.predict(new_observations)

We can look at those values on the plot (zoomed in slightly):

In [None]:
plot_decision(model, new_observations, model.predict(new_observations))

Remember that the naive Bayes classifier is working with probabilities: it calculates a probability for each class and selects the largest.

We can ask for the probability for each class with `model.predict_proba`:

In [None]:
model.predict_proba(new_observations)

So, the predictions on the third and fourth were close, but the model had to choose something.