# Machine Learning with `scikit-learn`

## <img src='imgs/robotguy.png' alt="Smiley face" width="42" height="42" align="left">Learning Objectives
* * *
* Learn the importance of pre-processing data and how scikit-learn expects data and helps create it
* See data transformations for machine learning in action
* Get an idea of your options for learning on training sets and applying model for prediction
* See what sort of metrics are commonly used in scikit-learn
* Learn options for model evaluation
* Become familiar with ways to make this process robust and simplified (pipelining and tuning parameters)

## A brief introduction to machine learning

I like to think of a couple ways to categorize machine learning approaches (with the help of the [machine learning wikipedia article](https://en.wikipedia.org/wiki/Machine_learning)).  The first way of thinking about ML, is by the type of information given to a system.  So, given that criteria there are three classical categories:
1.  Supervised learning - we get the data and the labels
2.  Unsupervised learning - only get the data (no labels)
3.  Reinforcement learning - reward/penalty based information (feedback)

Another way of categorizing ML approaches, is to think of the desired output:
1.  Classification
2.  Regression
3.  Clustering
4.  Density estimation
5.  Dimensionality reduction

--> This second approach is how scikit-learn categorizes it's ML algorithms...and you'll see how it works in this module (lucky you!).

Easy reading is damn hard writing, and vice versa. --Nathaniel Hawthorne

### Preprocessing the Input Data

<b>Commonly, machine learning algorithms will require your data to be standardized and preprocessed.  In scikit-learn the data must also take on a certain structure as well.</b>

<p>What you might have to do before using scikit-learn:</p>
1. Non-numerics transformed to numeric (tip: use applymap() method from `pandas`)
* Fill in missing values
* Standardization
* Normalization
* Encoding categorical features (e.g. one-hot encoding or dummy variables)

<b>Features should end up in a numpy.ndarray (hence numeric) and labels in a list.</b>

Data options:
* Use pre-processed [datasets](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.datasets) from scikit-learn
* [Create your own]([Make your own with sklearn](http://scikit-learn.org/stable/datasets/index.html#sample-generators)
* Read from a file

<br>Let's use a built-in dataset, `diabetes`, and split it into a training and testing set using `cross_val_score` (which actually splits into training, testing and validation set and runs three times).

In [None]:
from sklearn import cross_validation, datasets, linear_model

# Get the dataset
diabetes = datasets.load_diabetes()
X = diabetes.data[:250]
y = diabetes.target[:250]

clf = linear_model.LinearRegression()
score = cross_validation.cross_val_score(clf, X, y)
print(score)

<p>Regarding the above code chunk, we could have used a more manual approach with the `train_test_split` method in scikit-learn ( [train_test_split](http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html) ) as in:<p>

```python
# Create some data by hand and place 70% into a training set and the rest into a test set
# Here we are using labeled features (X - feature data, y - labels) in our made-up data
X, y = np.arange(10).reshape((5, 2)), range(5)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.70)
```

<p>There is also a `cross_val_predict` method to create estimates rather than scores ( [cross_val_predict](http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.cross_val_predict.html) )

### Transformers
* PCA
* SelectKBest
* One-Hot Encoder

In [None]:
# PCA for dimensionality reduction

from sklearn import decomposition
from sklearn import datasets

iris = datasets.load_iris()

centers = [[1, 1], [-1, -1], [1, -1]]
X = iris.data
y = iris.target

pca = decomposition.PCA(n_components = 3)
pca.fit(X)
X_t = pca.transform(X)

#np.set_printoptions(suppress=True)
#import matplotlib.pyplot as plt
#%matplotlib inline
#plt.scatter(X_t[:, 0], X_t[:, 1], c=y)

In [None]:
# SelectKBest for selecting top-scoring features

from sklearn import datasets
from sklearn.feature_selection import SelectKBest, chi2

iris = datasets.load_iris()
X, y = iris.data, iris.target

print(X.shape)

# Do feature selection
#  input is scoring function (here chi2) to get univariate p-values
#  and number of top-scoring features (k) - here we get the top 2
X_t = SelectKBest(chi2, k = 2).fit_transform(X, y)

print(X_t.shape)

<b>Note on scoring function selection in `SelectKBest` tranformations:</b>
* For regression - f_regression
* For classification - chi2, f_classif

In [None]:
# OneHotEncoder for dummying variables

from sklearn.preprocessing import OneHotEncoder, LabelEncoder

data = pd.DataFrame({'index': range(1, 7),
                    'state': ['WA', 'NY', 'CO', 'NY', 'CA', 'WA']})

# We encode both our categorical variable and it's labels
enc = OneHotEncoder()
label_enc = LabelEncoder()

# Encode labels (can use for discrete numerical values as well)
data_label_encoded = label_enc.fit_transform(data['state'])
data['state'] = data_label_encoded

# Encode and "dummy" variables
data_feature_one_hot_encoded = enc.fit_transform(data[['state']])


print(data_feature_one_hot_encoded.toarray())
print(data_label_encoded)
print(label_enc.inverse_transform(data_label_encoded))

### Learning and Predictions
* Using regression and classification to train on a dataset, create a model, and predict on new data
* Here is scikit-learn's algorithm diagram (note the regression and classification bubbles) - this is not an exhaustive list of model options
![](imgs/ml_map.png)

> "Often the hardest part of solving a machine learning problem can be finding the right estimator for the job."

> "Different estimators are better suited for different types of data and different problems."

<a href = "http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html" style = "float: right">-Choosing the Right Estimator from sklearn docs</a>


<b>An estimator for recognizing a hand-written digit from an image</b>
* Using some hand-written dataset:
![](imgs/plot_lle_digits1.png)
we wish to predict to which group (a digit in the range 0-9) a newly written digit belongs.

> Or, in machine learning parlance, we <i>fit</i> an estimator on known samples of the digit classes to <i>predict</i> the class to which an unseen digit belongs.

Let's give it a try!

In [None]:
from sklearn.datasets import load_digits
from sklearn import svm

# Let's load the digits dataset (yes, sklearn includes this one)
digits = load_digits()
X = digits.data
y = digits.target

# Let's train on all but last value ("hold out" one)
X_train = X[:-1]
y_train = y[:-1]

# Define an estimator instance (here, support vector classification)
#   this just means giving our instance a name and setting the parameters
clf = svm.SVC(gamma=0.001, C=100.)

# We can now fit and predict with this object instance

In [None]:
# Let's fit the data to the SVC instance object
clf.fit(X_train, y_train)

In [None]:
# Let's predict on some data outside the training set

# This was actually our "held out" sample digit
X_test = X[-1:]

clf.predict(X_test)

In [None]:
# Question:  what was the label associated with this data?

In [None]:
# Plot it as it would appear as an image
import matplotlib.pyplot as plt
%matplotlib inline

# It's just a long array, so reshape to 8x8 2D array
image = X_test.reshape(8, 8)

plt.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest')

# Do you agree with this prediction?

### Metrics
* <b>Confusion matrix</b> - visually inpsect quality of a classifier's predictions

<b>Here, we will process some data, classify it with SVM (see [here](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC) for more info), and view the quality of the classification with a confusion matrix.</b>

In [None]:
import numpy as np

# import model algorithm and data
from sklearn import svm, datasets

# import splitter
from sklearn.cross_validation import train_test_split

# import metrics
from sklearn.metrics import confusion_matrix

# feature data (X) and labels (y)
iris = datasets.load_iris()
X = iris.data
y = iris.target

# split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.70, random_state = 42)

In [None]:
# perform the classification step and run a prediction on test set from above
clf = svm.SVC(kernel = 'linear', C = 0.1)
y_pred = clf.fit(X_train, y_train).predict(X_test)

In [None]:
# Define a plotting function confusion matrices 
#  (from http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html)

import matplotlib.pyplot as plt

def plot_confusion_matrix(cm, target_names, title = 'The Confusion Matrix', cmap = plt.cm.YlOrRd):
    plt.imshow(cm, interpolation = 'nearest', cmap = cmap)
    plt.tight_layout()
    
    # Add feature labels to x and y axes
    tick_marks = np.arange(len(target_names))
    plt.xticks(tick_marks, target_names, rotation=45)
    plt.yticks(tick_marks, target_names)
    
    plt.ylabel('True Label')
    plt.xlabel('Predicted Label')

In [None]:
%matplotlib inline

cm = confusion_matrix(y_test, y_pred)
plot_confusion_matrix(cm, iris.target_names)

* <b>Classification reports</b> - a text report with important classification metrics (e.g. precision, recall)

In [None]:
from sklearn.metrics import classification_report

# Using the test and prediction sets from above
print(classification_report(y_test, y_pred, target_names = iris.target_names))

In [None]:
# Another example with some toy data

y_test = ['cat', 'dog', 'mouse', 'mouse', 'cat', 'cat']
y_pred = ['mouse', 'dog', 'cat', 'mouse', 'cat', 'mouse']

# How did our predictor do?
print(classification_report(y_test, y_pred, target_names = y_test))

EXERCISE:  Explore Metrics for Raw vs. Normalized Data
* something like [this](http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html)

### Evaluating Models and Under/Over-Fitting (+Pipelines Intro)
* Over-fitting or under-fitting can be visualized as below and tuned as we will see later with `GridSearchCV` paramter tuning
* A <b>validation curve</b> gives one an idea of the relationship of model complexity to model performance.
* For this examination it would help to understand the idea of the [<b>bias-variance tradeoff</b>](https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff).
* A <b>learning curve</b> helps answer the question of if there is an added benefit to adding more training data to a model.  It is also a tool for investigating whether an estimator is more affected by variance error or bias error.

In [None]:
import numpy as np
from sklearn import cross_validation

# Let's run a prediction on some test data given a trained model

# First, create some data
X = np.sort(np.random.rand(20))
func = lambda x: np.cos(1.5 * np.pi * x)
y = np.array([func(x) for x in X])

In [None]:
# A plotting function

import matplotlib.pyplot as plt
%matplotlib inline

def plot_fit(X_train, y_train, X_test, y_pred):
    plt.plot(X_test, y_pred, label = "Model")
    plt.plot(X_test, func(X_test), label = "Function")
    plt.scatter(X_train, y_train,  label = "Samples")
    plt.xlabel("x")
    plt.ylabel("y")
    plt.xlim((0, 1))
    plt.ylim((-2, 2))

> <b>Pipelining</b> (as an aside to this section)
* `Pipeline(steps=[...])` - where steps can be a list of processes through which to put data or a dictionary which includes the parameters for each step as values
* For example, here we do a transformation (SelectKBest) and a classification (SVC) all at once in a pipeline we set up

```python
# a feature selection instance
selection = SelectKBest(chi2, k = 2)

# classification instance
clf = svm.SVC(kernel = 'linear')

# make a pipeline
pipeline = Pipeline([("feature selection", selection), ("classification", clf)])

# train the model
pipeline.fit(X, y)
```

See a full example [here](http://scikit-learn.org/stable/auto_examples/feature_stacker.html)

Note:  If you wish to perform <b>multiple transformations</b> in your pipeline try [FeatureUnion](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html#sklearn.pipeline.FeatureUnion)

In [None]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

poly = PolynomialFeatures(degree = 1, include_bias = False)
lm = LinearRegression()

In [None]:
from sklearn.pipeline import Pipeline

pipeline = Pipeline([("polynomial_features", poly),
                         ("linear_regression", lm)])
pipeline.fit(X[:, np.newaxis], y)


X_test = np.linspace(0, 1, 100)

y_pred = pipeline.predict(X_test[:, np.newaxis])

plot_fit(X, y, X_test, y_pred)

### Last, but not least, Searching Paramter Space with `GridSearchCV`

In [None]:
from sklearn.grid_search import GridSearchCV

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

poly = PolynomialFeatures(include_bias = False)
lm = LinearRegression()

pipeline = Pipeline([("polynomial_features", poly),
                         ("linear_regression", lm)])

param_grid = dict(polynomial_features__degree = list(range(1, 30, 2)),
                  linear_regression__normalize = [False, True])

grid_search = GridSearchCV(pipeline, param_grid=param_grid)
grid_search.fit(X[:, np.newaxis], y)
print(grid_search.best_params_)

### Some References
* [A. Mueller's Conference Notebooks and Presentation](https://github.com/amueller/odscon-sf-2015)
* [An interesting real-world example set of notebooks for learning ML](https://github.com/cmmalone/malone_OpenDataSciCon)

### Some Datasets
* [Machine learning datasets](http://mldata.org/)
* [Make your own with sklearn](http://scikit-learn.org/stable/datasets/index.html#sample-generators)

Created by a Microsoft Employee.
	
The MIT License (MIT)<br>
Copyright (c) 2016 Micheleen Harris