# Introduction to Machine Learning with `scikit-learn III`

## <img src='https://az712634.vo.msecnd.net/notebooks/python_course/v1/lightening.png' alt="Smiley face" width="42" height="42" align="left">Learning Objectives
* * *
* Become familiar with tools in `sklearn` for evaluating a model
* Learn what a pipline is in `sklearn` and how to use it
* Learn and practice a technique for hyperparameter tuning with `GridSerchCV`
<img src='https://az712634.vo.msecnd.net/notebooks/python_course/v1/ml_process.png' alt="Smiley face" width="400"><br>

## Evaluating Models

### Evaluating using metrics
* <b>Confusion matrix</b> - visually inspect quality of a classifier's predictions (more [here](http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html)) - very useful to see if a particular class is problematic

<b>Here, we will process some data, classify it with SVM (see [here](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC) for more info), and view the quality of the classification with a confusion matrix.</b>

In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
# Import model algorithm and data
from sklearn import svm, datasets

# Import splitter
from sklearn.cross_validation import train_test_split

# Import metrics method
from sklearn.metrics import confusion_matrix

# Feature data (X) and labels (y)
iris = datasets.load_iris()
X, y = iris.data, iris.target

# Split data into training and test sets
X_train, X_test, y_train, y_test = \
    train_test_split(X, y, train_size = 0.90, random_state = 42)

In [None]:
# Perform the classification step and run a prediction on test set from above
clf = svm.SVC(kernel = 'rbf', C = 0.5, gamma = 10)
y_pred = clf.fit(X_train, y_train).predict(X_test)

pd.DataFrame({'Prediction': iris.target_names[y_pred],
    'Actual': iris.target_names[y_test]})

In [None]:
# Accuracy score
clf.score(X_test, y_test)

In [None]:
# Define a plotting function confusion matrices 
#  (from http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html)

import matplotlib.pyplot as plt

def plot_confusion_matrix(cm, target_names, title = 'The Confusion Matrix', cmap = plt.cm.YlOrRd):
    plt.imshow(cm, interpolation = 'nearest', cmap = cmap)
    plt.tight_layout()
    
    # Add feature labels to x and y axes
    tick_marks = np.arange(len(target_names))
    plt.xticks(tick_marks, target_names, rotation=45)
    plt.yticks(tick_marks, target_names)
    
    plt.ylabel('True Label')
    plt.xlabel('Predicted Label')
    
    plt.colorbar()

Numbers in confusion matrix:
* on-diagonal - counts of points for which the predicted label is equal to the true label
* off-diagonal - counts of mislabeled points

In [None]:
cm = confusion_matrix(y_test, y_pred)

# Actual counts
print(cm)

# Visually inpsect how the classifier did of matching predictions to true labels
plot_confusion_matrix(cm, iris.target_names)

* <b>Classification reports</b> - a text report with important classification metrics (e.g. precision, recall)

In [None]:
from sklearn.metrics import classification_report

# Using the test and prediction sets from above
print(classification_report(y_test, y_pred, target_names = iris.target_names))

Modify the code below in a new code cell.
```python
# Another example with some toy data - Fill in the blanks

y_test = ['cat', 'dog', 'mouse', 'mouse', 'cat', 'cat']
y_pred = ['mouse', 'dog', 'cat', 'mouse', 'cat', 'mouse']

# How did our predictor do?
print(classification_report(y_test, ___, target_names = ___))
```

In [None]:
# Try here

QUICK QUESTION:  Is it better to have too many false positives or too many false negatives?

### Evaluating Models and Under/Over-Fitting
* Over-fitting or under-fitting can be visualized as below and tuned as we will see later with `GridSearchCV` paramter tuning
* A <b>validation curve</b> gives one an idea of the relationship of model complexity to model performance.
* For this examination it would help to understand the idea of the [<b>bias-variance tradeoff</b>](https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff).
* A <b>learning curve</b> helps answer the question of if there is an added benefit to adding more training data to a model.  It is also a tool for investigating whether an estimator is more affected by variance error or bias error.

QUESTION:  Does a hyperparameter when increased/decreased cause overfitting or underfitting?  What are the implications of those cases?

## Search for best parameters and create a pipeline

### Easy reading...create and use a pipeline

> <b>Pipelining</b> (as an aside to this section)
* `Pipeline(steps=[...])` - where steps can be a list of processes through which to put data or a dictionary which includes the parameters for each step as values
* For example, here we do a transformation (SelectKBest) and a classification (SVC) all at once in a pipeline we set up.

See a full example [here](http://scikit-learn.org/stable/auto_examples/feature_stacker.html)

Note:  If you wish to perform <b>multiple transformations</b> in your pipeline try [FeatureUnion](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html#sklearn.pipeline.FeatureUnion)

In [None]:
from sklearn.cross_validation import train_test_split
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.datasets import load_iris

iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

# A feature selection instance
selection = SelectKBest(chi2, k = 2)

# Classification instance
clf = SVC(kernel = 'linear')

# Make a pipeline
pipeline = Pipeline([("feature selection", selection), ("classification", clf)])

# Train the model
pipeline.fit(X, y)

# Score
pipeline.score(X_test, y_test)

### Last, but not least, Searching Parameter Space with `GridSearchCV`

In [None]:
from sklearn.grid_search import GridSearchCV

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

poly = PolynomialFeatures(include_bias = False)
lm = LinearRegression()

pipeline = Pipeline([("polynomial_features", poly),
                         ("linear_regression", lm)])

param_grid = dict(polynomial_features__degree = list(range(1, 30, 2)),
                  linear_regression__normalize = [False, True])

grid_search = GridSearchCV(pipeline, param_grid=param_grid)
grid_search.fit(X, y)
print(grid_search.best_params_)

EXERCISE: Tune hyperparameters C and gamma in a support vector machine, SVC, using `GridSearchCV`

The parameter definitions are:
1. C: Penalty parameter C of the error term (e.g. try within the range from 0.01 to 1.0)
* gamma: Kernel coefficient for 'rbf', 'poly' and 'sigmoid' kernels (e.g. try from 0 to 10)

The non-`GridSearchCV` approach from earlier looks like:

```python
# Import model algorithm and data
from sklearn import svm, datasets

# Import splitter
from sklearn.cross_validation import train_test_split

# Import metrics
from sklearn.metrics import confusion_matrix

# Feature data (X) and labels (y)
iris = datasets.load_iris()
X, y = iris.data, iris.target

# Split data into training and test sets
X_train, X_test, y_train, y_test = \
    train_test_split(X, y, train_size = 0.90, random_state = 42)
    
# Perform the classification step and run a prediction on test set from above
clf = svm.SVC(kernel = 'rbf', C = 0.5, gamma = 10)
y_pred = clf.fit(X_train, y_train).predict(X_test)
```

In [None]:
# Code up your solution here...

### Additional Resources
* Another popular python library, [PyBrain](http://pybrain.org/), includes reinforcement learning
* For a parallel, out-of-core memory Python tool check out [dask](http://dask.pydata.org/en/latest/) and the learn example.

Created by a Microsoft Employee.
	
The MIT License (MIT)<br>
Copyright (c) 2016