# Tutorial: Machine Learning in scikit-learn
*From the video series: [Introduction to machine learning with scikit-learn](https://github.com/justmarkham/scikit-learn-videos)*

![Machine learning](images/01_robot.png)

## Agenda

-. **What is machine learning?**
- What are the two main categories of machine learning?
- What are some examples of machine learning?
- How does machine learning "work"?

-. **Setting up Python for machine learning: scikit-learn and IPython Notebook**
- What are the benefits and drawbacks of scikit-learn?
- How do I install scikit-learn?
- How do I use the IPython Notebook?
- What are some good resources for learning Python?

-. **Getting started in scikit-learn with the famous iris dataset**

# - What is machine learning?

## What are the two main categories of machine learning?

**Supervised learning**: Making predictions using data
    
- Example: Will Leonardo DiCaprio "survived" or "no survived"?
- There is an outcome we are trying to predict

![Titanic filter](images/superv_Titanic.png)

**Unsupervised learning**: Extracting structure from data

- Example: Segment grocery store shoppers into clusters that exhibit similar behaviors
- There is no "right answer"

![Clustering](images/01_clustering.png)

## How does supervised learning "work"?

High-level steps of supervised learning:

1. First, train a **machine learning model** using **labeled data**

    - "Labeled data" has been labeled with the outcome
    - "Machine learning model" learns the relationship between the attributes of the data and its outcome

2. Then, make **predictions** on **new data** for which the label is unknown

![Supervised learning](images/01_supervised_learning.png)

The primary goal of supervised learning is to build a model that "generalizes": It accurately predicts the **future** rather than the **past**!

# - SCIKIT_LEARN

## Benefits and drawbacks of scikit-learn

### Benefits:

- **Consistent interface** to machine learning models
- Provides many **tuning parameters** but with **sensible defaults**
- Exceptional **documentation**
- Rich set of functionality for **companion tasks** (model selection, model evaluation, data preparation, ...)
- **Active community** for development and support

### Potential drawbacks:

- Harder (than R) to **get started with machine learning**
- Less emphasis (than R) on **model interpretability**

### Further reading:

- Ben Lorica: [Six reasons why I recommend scikit-learn](http://radar.oreilly.com/2013/12/six-reasons-why-i-recommend-scikit-learn.html)
- scikit-learn authors: [API design for machine learning software](http://arxiv.org/pdf/1309.0238v1.pdf)
- Data School: [Should you teach Python or R for data science?](http://www.dataschool.io/python-or-r-for-data-science/)

![scikit-learn logo](images/02_sklearn_logo.png)

## Installing scikit-learn

**Option 1:** [Install scikit-learn library](http://scikit-learn.org/stable/install.html) and dependencies (NumPy and SciPy)

**Option 2:** [Install Anaconda distribution](https://www.continuum.io/downloads) of Python, which includes:

- Hundreds of useful packages (including scikit-learn)
- IPython and IPython Notebook
- conda package manager
- Spyder IDE

![IPython header](images/02_ipython_header.png)

## Using the IPython Notebook

### Components:

- **IPython interpreter:** enhanced version of the standard Python interpreter
- **Browser-based notebook interface:** weave together code, formatted text, and plots

### Installation:

- **Option 1:** Install [IPython](http://ipython.org/install.html) and the [notebook](https://jupyter.readthedocs.io/en/latest/install.html)
- **Option 2:** Included with the Anaconda distribution

### Launching the Notebook:

- Type **jupyter notebook** at the command line to open the dashboard
- Don't close the command line window while the Notebook is running

# Part A. Supervised Algorithms. 
## Methodology overview

To evaluate how well our supervised models generalize, we can split our data into a training and test set:

![ini_matrix](images/04_matrix.svg) 

The procedure is as follows:

![supervised_workflow](images/supervised_workflow.svg) 

## Getting started in scikit-learn

- ** model.fit(X_train, y_train) **,   train a new model 
- ** model.predict(X_test) **, predicted the output values for the new data to be predicted
- ** model.score(y_test, y_pred) **, evaluation method

# Part B. Supervised Algorithm: Classification.

![superv_flowchart](images/supervised_learning_flowchart.png)

## Step 1. Getting started in scikit-learn with the famous iris dataset

![Iris](images/03_iris.png)

- 50 samples of 3 different species of iris (150 samples total)
- Measurements: sepal length, sepal width, petal length, petal width

## Step 2. Loading the iris dataset into scikit-learn

- Each row is an **observation** (also known as: sample, example, instance, record)
- Each column is a **feature** (also known as: predictor, attribute, independent variable, input, regressor, covariate)

In [None]:
#Hide warnings
import warnings
warnings.filterwarnings('ignore')

In [None]:
# import load_iris function from datasets module
from sklearn.datasets import load_iris

In [None]:
# save "bunch" object containing iris dataset and its attributes
iris = load_iris()

# print the names of the four features
print(iris.feature_names)

# print the iris data
print(iris.data[0:5])

In [None]:
# print the encoding scheme for species: 0 = setosa, 1 = versicolor, 2 = virginica
print(iris.target_names)
# print integers representing the species of each observation
print(iris.target)

## Requirements for working with data in scikit-learn

1. Features and response are **separate objects**
2. Features and response should be **numeric**
3. Features and response should be **NumPy arrays**
4. Features and response should have **specific shapes**

In [None]:
# check the shape of the features (first dimension = number of observations, second dimensions = number of features)
print(iris.data.shape)
# check the shape of the response (single dimension matching the number of observations)
print(iris.target.shape)

In [None]:
# store feature matrix in "X"
X = iris.data

# store response vector in "y"
y = iris.target

## Step 3. Split the data

## - Evaluation procedure #1: Train and test on the entire dataset

Problems with training and testing on the same data

- Goal is to estimate likely performance of a model on **out-of-sample data**
- But, maximizing training accuracy rewards **overly complex models** that won't necessarily generalize
- Unnecessarily complex models **overfit** the training data

<img src="images/Overfitting.png">

## - Evaluation procedure #2: Train/test split

1. Split the dataset into two pieces: a **training set** and a **testing set**.
2. Train the model on the **training set**.
3. Test the model on the **testing set**, and evaluate how well we did.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=4)

<img src="images/05_train_test_split.png">

### Stratification
Some classification problems can exhibit a large imbalance in the distribution of the target classes: for instance there could be several times more negative samples than positive samples. In such cases it is recommended to use stratified, especially for relatively small datasets.

In [None]:
import numpy as np
print ('TRAIN percentages: ', np.bincount(y_train)/float(len(y_train))*100)
print ('TEST percentages: ', np.bincount(y_test)/float(len(y_test))*100)

In order to stratify the aplit, we can pass the label array as an additional option to the *train_test_split* function:

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42, stratify=y)
print ('TRAIN percentages: ', np.bincount(y_train)/float(len(y_train))*100)
print ('TEST percentages: ', np.bincount(y_test)/float(len(y_test))*100)

What did this accomplish?

- Model can be trained and tested on **different data**
- Response values are known for the testing set, and thus **predictions can be evaluated**
- **Testing accuracy** is a better estimate than training accuracy of out-of-sample performance

In [None]:
# print the shapes of the new X objects
print(X_train.shape)
print(X_test.shape)

In [None]:
# print the shapes of the new y objects
print(y_train.shape)
print(y_test.shape)

## Step 4. Visualization

Goto Notebook [05_matplotlib_viz_gallery.ipynb](05_matplotlib_viz_gallery.ipynb)

## Step 4. Training a machine learning model with scikit-learn

## - K-nearest neighbors (KNN) classification

1. Pick a value for K.
2. Search for the K observations in the training data that are "nearest" to the measurements of the unknown iris.
3. Use the most popular response value from the K nearest neighbors as the predicted response value for the unknown iris.

### Example K-nearest neighbors

In [None]:
from IPython.display import Image
Image(filename='./images/04_5knn_dataset.png', width=400) 

In mathematics, the Euclidean distance or Euclidean metric is the "ordinary" (i.e. straight-line) distance between two points in Euclidean space. 
The Euclidean distance between points a and b is the length of the line segment connecting them:

$$d(a, b)= \sqrt{\sum\limits_{i=1}^n |a_i - b_i|^2}$$

It's important to have the same scale for features (i.e, Iris in cms).

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn_k5 = KNeighborsClassifier(n_neighbors=5)
knn_k5.fit(X_train, y_train)

In [None]:
knn_k5.predict([[3, 5, 4, 2]])

In [None]:
X_new = [[3, 5, 4, 2], [5, 4, 3, 2]]
knn_k5.predict(X_new)

In [None]:
from sklearn import metrics
y_pred_k5 = knn_k5.predict(X_test)
print(metrics.accuracy_score(y_test, y_pred_k5))

## Step 5. Cross-validation

In [None]:
# read in the iris data
iris = load_iris()

# create X (features) and y (response)
X = iris.data
y = iris.target

In [None]:
# use train/test split with different random_state values
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2) #random_state=1

# check classification accuracy of KNN with K=10
knn = KNeighborsClassifier(n_neighbors=10)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print(metrics.accuracy_score(y_test, y_pred))

### Steps for K-fold cross-validation

Diagram of **5-fold cross-validation:**

![5-fold cross-validation](images/07_cross_validation_diagram.png)

In [None]:
# simulate splitting a dataset of 25 observations into 5 folds
# from sklearn.cross_validation import KFold
from sklearn.model_selection import KFold
n = range(1, 25)  # Number of observations
kf = KFold()

# print the contents of each training and testing set
i = 0
print('{} {:^61} {}'.format('Iteration', 'Training set observations', 'Testing set observations'))
for train, test in kf.split(n):
    i = i + 1
    print('{:^9} {} {:^45}'.format(i, train, test))

### Cross Validation when K = 10 over KNeighborsClassifier

In [None]:
from sklearn.model_selection import cross_val_predict

# 10-fold cross-validation with K=10 for KNN (the n_neighbors parameter)
predicted = cross_val_predict(knn, X, y, cv=10)
metrics.accuracy_score(iris.target, predicted) 

## Step 6. Selecting the best K

**Goal:** Select the best tuning parameters (aka "hyperparameters") for KNN on the iris dataset.
More info about [cross_validation](http://scikit-learn.org/stable/modules/cross_validation.html) and [model scoring](http://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter).

In [None]:
from sklearn.model_selection import cross_val_score

# search for an optimal value of K for KNN
k_range = list(range(1, 15))
k_scores = []
for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, X_train, y_train, cv=10, scoring='accuracy')
    k_scores.append(scores.mean())
print(k_scores)

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

# plot the value of K for KNN (x-axis) versus the cross-validated accuracy (y-axis)
plt.plot(k_range, k_scores)
plt.xlabel('Value of K for KNN')
plt.ylabel('Cross-Validated Accuracy')

## - Using a different value for K

In [None]:
# instantiate the model (using the value K=5)
knn_k5 = KNeighborsClassifier(n_neighbors=5)
y_pred_k5 = cross_val_predict(knn_k5, X, y, cv=10)
print(metrics.accuracy_score(y, y_pred_k5) )

# instantiate the model (using the value K=9)
knn_k9 = KNeighborsClassifier(n_neighbors=9)
y_pred_k9 = cross_val_predict(knn_k9, X, y, cv=10)
print(metrics.accuracy_score(y, y_pred_k9) )

## - Visualize K=5 vs K=9

In [None]:
# import Matplotlib (scientific plotting library)
import matplotlib.pyplot as plt

# allow plots to appear within the notebook
%matplotlib inline

# Print correctly classified K=5
# print ('Samples correctly classified K5:')
correct_idx_K5 = np.where(y_pred_k5 == y)[0]
# print (correct_idx_K5)

print ('Samples incorrectly classified K5:')
incorrect_idx_K5 = np.where(y_pred_k5 != y)[0]
print (incorrect_idx_K5)

# Print correctly classified K=9
# print ('\nSamples correctly classified K9:')
correct_idx_K9 = np.where(y_pred_k9 == y)[0]
# print (correct_idx_K9)

print ('Samples incorrectly classified K9:')
incorrect_idx_K9 = np.where(y_pred_k9 != y)[0]
print (incorrect_idx_K9)

correct_incorrects = [[correct_idx_K5, incorrect_idx_K5],[correct_idx_K9, incorrect_idx_K9]]
i=5
for correct_incorrect in correct_incorrects:
    correct_idx = correct_incorrect[0]
    incorrect_idx = correct_incorrect[1]
    
    # Plot two dimensions
    colors = ["darkblue", "darkgreen", "gray"]

    for n, color in enumerate(colors):
        idx = np.where(y == n)[0]
        plt.scatter(X[idx, 1], X[idx, 2], color=color, label="Class %s" % str(n))
    plt.scatter(X[incorrect_idx, 1], X[incorrect_idx, 2], color="darkred")

    plt.xlabel('sepal width [cm]')
    plt.ylabel('petal length [cm]')
    plt.legend(loc=3)
    plt.title("Iris Classification results. K= "+str(i))
    plt.show()
    
    i = i + 4

## Step 7. Model Selection

**Goal:** Compare the best KNN model with logistic regression on the iris dataset

In [None]:
# 10-fold cross-validation with the best KNN model
knn = KNeighborsClassifier(n_neighbors=9)
print(cross_val_score(knn, X, y, cv=10, scoring='accuracy').mean())

In [None]:
# 10-fold cross-validation with logistic regression
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
print(cross_val_score(logreg, X, y, cv=10, scoring='accuracy').mean())

## Step 8. Efficiently searching for optimal tuning parameters using `GridSearchCV`

Allows you to define a **grid of parameters** that will be **searched** using K-fold cross-validation

In [None]:
from sklearn.model_selection import GridSearchCV

# define the parameter values that should be searched
k_range = list(range(1, 15))

# create a parameter grid: map the parameter names to the values that should be searched
param_grid = dict(n_neighbors=k_range)
print(param_grid)

# instantiate the grid
grid = GridSearchCV(estimator=knn, param_grid = param_grid, cv=10, scoring='accuracy')

- You can set **`n_jobs = -1`** to run computations in parallel (if supported by your computer and OS)

In [None]:
grid

In [None]:
# fit the grid with data
grid.fit(X_train, y_train)

In [None]:
# examine the best model
print(grid.best_score_)
print(grid.best_params_)
print(grid.best_estimator_)

### Model evaluation procedures

1. **Training and testing on the same data**
    - Rewards overly complex models that "overfit" the training data and won't necessarily generalize
2. **Train/test split**
    - Split the dataset into two pieces, so that the model can be trained and tested on different data
    - Better estimate of out-of-sample performance, but still a "high variance" estimate
    - Useful due to its speed, simplicity, and flexibility
3. **K-fold cross-validation**
    - Systematically create "K" train/test splits and average the results together
    - Even better estimate of out-of-sample performance
    - Runs "K" times slower than train/test split

## Step 9. Evaluating a classification model through Confusion MAtrix

Comparing the **true** and **predicted** response values

In [None]:
# print the first 25 true and predicted responses
print('True:', y_test[0:25])
print('Pred:', y_pred[0:25])

**Conclusion:**

- Classification accuracy is the **easiest classification metric to understand**
- And, it does not tell you what **"types" of errors** your classifier is making

In [None]:
# IMPORTANT: first argument is true values, second argument is predicted values
print(metrics.confusion_matrix(y_test, y_pred))

# ![Small confusion matrix](images/09_confusion_matrix_3.png)

- Every observation in the testing set is represented in **exactly one box**
- It's a 3x3 matrix because there are **3 response classes**
- The format shown here is **not** universal

# ![Small confusion matrix](images/09_confusion_matrix.png)

Basic terminology:
- True Positives (TP): we correctly predicted that they do have diabetes
- True Negatives (TN): we correctly predicted that they don't have diabetes
- False Positives (FP): we incorrectly predicted that they do have diabetes (a "Type I error")
- False Negatives (FN): we incorrectly predicted that they don't have diabetes (a "Type II error"

# SUMMARY

![superv_flowchart](images/supervised_learning_flowchart.png)

## Comments or Questions?

- Email: <izaskun.mendia@tecnalia.com>
- Github: https://github.com/izmendi/
