In [None]:
import numpy, pandas
import matplotlib.pyplot as plot
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from helpers.bayes import correlated_data, plot_decision, cmap, joint_histograms
%matplotlib inline
#%config InlineBackend.figure_format = 'svg'
#plot.rcParams['figure.figsize'] = [4, 4]

# Features in Machine Learning

Remember what we assumed about the features for Gaussian Naive Bayes: features are independent; features are normally distributed.

If we violate those assumptions, the classifier might not work well.

Let's have a look at some data where the features are not independent...

In [None]:
X, y = correlated_data()
X_train, X_test, y_train, y_test = train_test_split(X, y)
plot.scatter(X[:,0], X[:,1], c=y, edgecolor='k', cmap=cmap);

In [None]:
model = GaussianNB()
model.fit(X_train, y_train)
model.score(X_test, y_test)

In [None]:
plot_decision(model, X_test, y_test)

Why are the decisions so bad?

In the feature on the vertical axis, the mean and standard deviation are almost the same. The only difference is on the horizontal direction, so that's all the classifier can work with.

In [None]:
joint_histograms(X_train, y_train)

# Pipelines

We need to manipulate the data as part of the machine learning process. We **don't always want to put our data directly into the model**: we often need to manipulate the data so that the model has something sensible it can work with.

This is one of those cases: it would be nice if we could remove the correlation between the features before we let `GaussianNB` at it.

The things that we use to manipulate the data in Scikit-Learn are called **transformers**. These are tools that somehow manipulate the feature values and turn them into a new set of values that can be passed along to either another transformer or the model, an **estimator**.

An **estimator** is a machine learning model that actually makes predictions. We have seen one: `GaussianNB`.

We can put transformer(s) and an estimator together in a **pipeline**.

A pipeline takes our observations (either the ones we're fitting with or making predictions on) and passes them through each step to an estimator where we get predictions.

In Scikit-Learn, we can use `make_pipeline` to create a machine learning model includes several steps.

from sklearn.pipeline import make_pipeline

In this example, we will use the `PCA` transformer. It does "principlal component analysis". We won't worry about the details, but one of the effects of PCA: it tends to remove correlation.

In [None]:
from sklearn.decomposition import PCA

We can use `make_pipeline` to put out steps together.

In [None]:
from sklearn.pipeline import make_pipeline
model = make_pipeline(
    PCA(),
    GaussianNB()
)

We can train and test a pipeline model just like any other Scikit-Learn model.

In [None]:
model.fit(X_train, y_train)
model.score(X_test, y_test)

We get *much* better predictions out of this model than `GaussianNB` by itself.

In [None]:
plot_decision(model, X_test, y_test)

We can peek in the middle of the pipeline and see what the `PCA` transformer is doing in this case. This is the data after `PCA()` but before `GaussianNB`: the features still aren't truly independent, but things have been stretched enough that the Naive Bayes method can make some decent decisions.

In [None]:
X_transf = model.named_steps['pca'].transform(X_train)
plot.scatter(X_transf[:,0], X_transf[:,1], c=y_train, edgecolor='k', cmap=cmap);

In [None]:
joint_histograms(X_transf, y_train)

# Custom Transformers

There are a few built-in transformers in Scikit-Learn, but there are many more ways you might want to manipulate your data.

There is a transformer called `FunctionTransformer` that can be used to do whatever calculations you want to do with your features.

In [None]:
from sklearn.preprocessing import FunctionTransformer

Your job: write a function that takes the X array (`X_train`, `X_test`, or whatever features matrix you give your model), does some calculations, and returns a new array of features.

In this case, I make up this function (because I also made up the data, so I could cheat by knowing exactly what to do): it takes a matrix of features and returns some better features.

In [None]:
def remove_correlation(X):
    X0 = X[:, 0] - 1.5 * X[:, 1]
    X1 = X[:, 1]
    return numpy.stack((X0, X1), axis=1)

Then we can use this in a transformer in a pipeline. This pipeline model is really good.

In [None]:
model = make_pipeline(
    FunctionTransformer(remove_correlation),
    GaussianNB()
)
model.fit(X_train, y_train)
model.score(X_test, y_test)

In [None]:
plot_decision(model, X_test, y_test)

Again, we can peek into the middle of the pipeline and see what happened to the (training) data when it went through the `FunctionTransformer`.

In [None]:
X_transf = model.named_steps['functiontransformer'].transform(X_train)
plot.scatter(X_transf[:,0], X_transf[:,1], c=y_train, edgecolor='k', cmap=cmap);

In [None]:
joint_histograms(X_transf, y_train)