Lesson 8 - Build statistical and predictive models
==================================================
<center>
<img src=http://scikit-learn.org/stable/_images/scikit-learn-logo-notext.png />
</center>
Reference: [Scikit-Learn documentation](http://scikit-learn.org/)

8.1 Use Scikit-Learn to create a predictive model
-------------------------------------------------

For these exercises you'll need to install or update the following packages to the latest versions:

* scikit-learn
* ggplot

This can be done by executing this command:

```bash
conda install -c conda-forge scikit-learn ggplot
```

Or the equivalent in Anaconda Navigator: add the two channels `conda-forge` and `datasciencepythonr`, then update indicies and select those three packages.

### Supervised learning
<img src=http://ijstokes-public.s3.amazonaws.com/dspyr/img/supervised_workflow.svg width=50%>

### Data representations
<img src="http://ijstokes-public.s3.amazonaws.com/dspyr/img/data_representation.svg" width=50%>

### Dataset train/test split
<img src="http://ijstokes-public.s3.amazonaws.com/dspyr/img/train_test_split_matrix.svg" width=50%>

Linear models in scikit-learn are regressors on continuous input data or classifiers on categorical input data.  The [scikit-learn documentation](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model) has a good overview of the `sklearn.linear_model` package's regressors and classifiers.

Regression models predict from a linear combination of features and intercept for an input matrix `X` with 1 or more continuous features (columns).

$Y = c_0 + c_1 X_1 + c_2 X_2 + ... + c_{n} X_{n} $


In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from bokeh.sampledata.autompg import autompg
from sklearn.model_selection  import train_test_split
from sklearn.linear_model     import LinearRegression

pd.options.display.max_rows = 10

In [None]:
%matplotlib inline

In [None]:
autompg

In [None]:
car_train, car_test = train_test_split(autompg, train_size=0.80, random_state=123)
car_train = car_train.copy()
car_test  = car_test.copy()

In [None]:
linear_regression = LinearRegression()

In [None]:
linear_regression.fit(car_train['cyl displ hp weight accel yr'.split()],
                      car_train[['mpg']])

8.2 Generate predictions with a model
-------------------------------------

In [None]:
predictions_lg = linear_regression.predict(car_test['cyl displ hp weight accel yr'.split()])
predictions_lg[:10]

In [None]:
from sklearn.svm          import SVC                    as support_vector_classifier
from sklearn.ensemble     import RandomForestClassifier as random_forest_classifier
from sklearn.neighbors    import KNeighborsClassifier   as knn_classifier
from sklearn.linear_model import LinearRegression       as linear_regression_classifier

In [None]:
model_rfc = random_forest_classifier(n_estimators=100)
model_rfc.fit(car_train['mpg cyl displ hp weight accel yr'.split()],
          car_train['origin'].astype(int).values.ravel())

In [None]:
predictions_rfc = model_rfc.predict(car_test['mpg cyl displ hp weight accel yr'.split()])
predictions_rfc

In [None]:
# Number incorrect
sum(car_test.origin != predictions_rfc)

In [None]:
# Accuracy
sum(car_test.origin == predictions_rfc) / len(predictions_rfc)

8.3 Score a model
-----------------

In [None]:
linear_regression.score(car_train['cyl displ hp weight accel yr'.split()], car_train[['mpg']])

In [None]:
linear_regression.score(car_test['cyl displ hp weight accel yr'.split()], car_test[['mpg']])

In [None]:
from sklearn import datasets, linear_model, model_selection, metrics

diabetes_dataset        = datasets.load_diabetes()
dd_examples, dd_targets = diabetes_dataset.data, diabetes_dataset.target

linreg = linear_model.LinearRegression()

In [None]:
# 10 train/test splits
kfold = model_selection.KFold(n_splits=10)

for train, test in kfold.split(dd_examples):
    
    linreg.fit(dd_examples[train], dd_targets[train])
    
    preds = linreg.predict(dd_examples[test])
    
    print(linreg.score(dd_examples[test], dd_targets[test]), 
          metrics.mean_squared_error(preds, dd_targets[test]))

There is an easier way, with `cross_validation`, which will do this routine for you.

In [None]:
cv_scores = model_selection.cross_val_score(linreg, dd_examples, dd_targets, 
                                             cv=kfold, 
                                             scoring='neg_mean_squared_error', 
                                             n_jobs=-1) # all CPUs
print(cv_scores) # neg mean sq error for each train / test split

8.4 Visualize model performance
-------------------------------

In [None]:
import sys
import os

import numpy             as np
import matplotlib.pyplot as plt
import matplotlib        as mpl
import pandas            as pd

from sklearn.svm             import SVC
from sklearn.model_selection import train_test_split

In [None]:
plt.rcParams['image.interpolation'] = "none"
np.set_printoptions(precision=3)
mpl.rcParams['legend.numpoints'] = 1

In [None]:
%matplotlib inline

In [None]:
plt.scatter(car_test.mpg, predictions_lg)

**NOTE:** The following two cells will require you to install Andreas Mueller's `mglearn` package using `pip`.  It must be installed into the Conda environment you are using to run this Notebook.  It can be done with the command:

```
pip install mglearn
```

Which must be invoked from the command line.

In [None]:
import mglearn
from mglearn.datasets        import make_blobs 

X, y = make_blobs(n_samples=(400, 50), centers=2, cluster_std=[7.0, 2], random_state=22)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

svc = SVC(gamma=.05).fit(X_train, y_train)

In [None]:
mglearn.plots.plot_decision_threshold()

Lesson 8 Easter egg
-------------------

In [None]:
from __future__ import braces