# Introduction to Machine Learning with `scikit-learn II`

## <img src='https://az712634.vo.msecnd.net/notebooks/python_course/v1/iris-setosa.jpg' alt="Smiley face" width="42" height="42" align="left">Learning Objectives
* * *
* Become familiar with how supervised learning works in `sklearn`
* Become familiar with how unsupervised learning works in `sklearn`

## Learning Algorithms - Supervised Learning
<img src='https://raw.githubusercontent.com/PythonWorkshop/intro-to-sklearn/master/imgs/ml_process_by_micheleenharris.png' alt="Smiley face" width="400"><br>
>  Reminder:  All supervised estimators in scikit-learn implement a `fit(X, y)` method to fit the model and a `predict(X)` method that, given unlabeled observations X, returns the predicted labels y. (direct quote from `sklearn` docs)

* Given that Iris is a fairly small, labeled dataset with relatively few features...what algorithm would you start with and why?

> "Often the hardest part of solving a machine learning problem can be finding the right estimator for the job."

> "Different estimators are better suited for different types of data and different problems."

<a href = "http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html" style = "float: right">-Choosing the Right Estimator from sklearn docs</a>


In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

<b>An estimator for recognizing a new iris from its measurements</b>

> Or, in machine learning parlance, we <i>fit</i> an estimator on known samples of the iris measurements to <i>predict</i> the class to which an unseen iris belongs.

Let's give it a try!  (We are actually going to hold out a small percentage of the `iris` dataset and check our predictions against the labels)

Below, we will use **`train_test_split`** helper method to split our data into a training and test set.  There will also be another option further on.

In [None]:
# Let's import sklearn modules
from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split

# Let's load the iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split data into training and test sets
#   - These namings are used very commonly
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

In [None]:
# Let's try a decision tree classification method
from sklearn import tree

t = tree.DecisionTreeClassifier(max_depth = 4,
                                    criterion = 'entropy', 
                                    class_weight = 'balanced',
                                    random_state = 2) # random_state used for reproducibility
t.fit(X_train, y_train)

t.score(X_test, y_test) # what performance metric is this, do you think?

Modify the following code in a new cell below.
```python
# Feed in test data - Fill in blanks
y_pred = t.predict(___)

# Compare first predicted value to first test value for labels
print("Prediction: %d, Original label: %d" % (y_pred[0], ___))
```

In [None]:
# Try here

In [None]:
# Here's a nifty way to cross-validate (useful for quick model evaluation!)
from sklearn import cross_validation
from sklearn import tree

t = tree.DecisionTreeClassifier(max_depth = 4,
                                    criterion = 'entropy', 
                                    class_weight = 'balanced',
                                    random_state = 2)

# Splits, fits and predicts all in one with scoring (does this multiple times on random splits)
score = cross_validation.cross_val_score(t, X, y)
score

QUESTIONS:  What do these scores tell you?  Are they too high or too low you think?  If it's 1.0, what does that mean?

### What does the graph look like for this decision tree?  
* i.e. what are the "questions" and "decisions" for this tree...
* Note:  You need both Graphviz app and the python package `graphviz` (It's worth it for this cool decision tree graph, I promise!)
* To install both on OS X:
```
sudo port install graphviz
sudo pip install graphviz
```
* For general Installation see [this guide](http://graphviz.readthedocs.org/en/latest/manual.html)

In [None]:
# Let's reimport sklearn modules
from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split

# Let's load the iris dataset again - what goes here?
iris = load_iris()
___, ___ = iris.data, iris.target

# Split data into training and test sets
#   - These namings are used very commonly
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

In [None]:
!pip install graphviz

In [None]:
from sklearn.tree import export_graphviz
import graphviz

# Let's rerun the decision tree classifier
from sklearn import tree

t = tree.DecisionTreeClassifier(max_depth = 4,
                                    criterion = 'entropy', 
                                    class_weight = 'balanced',
                                    random_state = 2)

# What method do we use to train our model?
t.___(X_train, y_train)

t.score(X_test, y_test) # what performance metric is this?

export_graphviz(t, out_file="mytree.dot",  
                         feature_names=iris.feature_names,  
                         class_names=iris.target_names,  
                         filled=True, rounded=True,  
                         special_characters=True)

with open("mytree.dot") as f:
    dot_graph = f.read()

graphviz.Source(dot_graph, format = 'png')

### From Decision Tree to Random Forest

In [None]:
from sklearn.datasets import load_iris
import pandas as pd
import numpy as np

iris = load_iris()
X, y = iris.data, iris.target

Modify the code here in a new cell and fill in the blank.

```python
# Split into training and test (30% into test) - Fill in the blank
X_train, X_test, y_train, y_test = ___(X, y, test_size = ___)
```

In [None]:
# Try here

Modify the following code in a new cell below.

```python
from sklearn.ensemble import RandomForestClassifier

forest = RandomForestClassifier(max_depth=4,
                                criterion = 'entropy', 
                                n_estimators = 100, 
                                class_weight = 'balanced',
                                n_jobs = -1,
                               random_state = 2)
# Fill in the blanks below
forest.fit(___, y_train)

y_preds = iris.target_names[forest.predict(___)]

forest.score(X_test, y_test)
```

In [None]:
# Try here

In [None]:
# Let's use cross validation / scoring method - Fill in the blank
from sklearn import cross_validation

# Reinitialize classifier
forest = RandomForestClassifier(max_depth=4,
                                criterion = 'entropy', 
                                n_estimators = 100, 
                                class_weight = 'balanced',
                                n_jobs = -1,
                               random_state = 2)

Modify the code below in a new cell to fill in the appropriate method

```python
# Cross validation / scoring method
score = cross_validation.___(forest, X, y)
score
```

In [None]:
# Try here

QUESTION:  Comparing to the decision tree method, what do these accuracy scores tell you?  Do they seem more reasonable?

### Splitting into train and test set vs. cross-validation

<p>We can be explicit and use the `train_test_split` method in scikit-learn ( [train_test_split](http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html) ) as in (and as shown above for `iris` data):<p>

```python
# Create some data by hand and place 70% into a training set and the rest into a test set
# Here we are using labeled features (X - feature data, y - labels) in our made-up data
import numpy as np
from sklearn import linear_model
from sklearn.cross_validation import train_test_split
X, y = np.arange(10).reshape((5, 2)), range(5)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.70)
clf = linear_model.LinearRegression()
clf.fit(X_train, y_train)
```

OR

Be more concise and

```python
import numpy as np
from sklearn import cross_validation, linear_model
X, y = np.arange(10).reshape((5, 2)), range(5)
clf = linear_model.LinearRegression()
score = cross_validation.cross_val_score(clf, X, y)
```

<p>There is also a `cross_val_predict` method to create estimates rather than scores and is very useful for cross-validation to evaluate models ( [cross_val_predict](http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.cross_val_predict.html) )

## Learning Algorithms - Unsupervised Learning
>  Reminder:  In machine learning, the problem of unsupervised learning is that of trying to find hidden structure in unlabeled data. Since the training set given to the learner is unlabeled, there is no error or reward signal to evaluate a potential solution. Basically, we are just finding a way to represent the data and get as much information from it that we can.

HEY!  Remember PCA from above?  PCA is actually considered unsupervised learning in the ML world.  We just put it up there because it's a good way to visualize data at the beginning of the ML process.

Let's revisit it in a little more detail using the `iris` dataset.

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

### PCA revisited

In [None]:
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris

iris = load_iris()

# Subset data to have only sepal width (cm) and petal length (cm) for simplification
X = iris.data[:, 1:3]
print(iris.feature_names[1:3])

# Scale it to a gaussian distribution (try it by uncommenting):
# from sklearn import preprocessing
# X = preprocessing.scale(X)

# If n_components is an integer just the top n components that explain
#   the most variance are kept (if it's a decimal, it represents percentage of variance explained needed)
pca = PCA(n_components = 2)
pca.fit(X)

print("% of variance attributed to components: "+ \
      ', '.join(['%.2f' % (x * 100) for x in pca.explained_variance_ratio_]))
print('\ncomponents of each feature:', pca.components_)

print(list(zip(pca.explained_variance_, pca.components_)))

The `pca.explained_variance_` is like the magnitude of a components influence (amount of variance explained) and the `pca.components_` is like the direction of influence for each feature in each component.

<p style="text-align:right"><i>Code in next cell adapted from Jake VanderPlas's code [here](https://github.com/jakevdp/sklearn_pycon2015)</i></p>

In [None]:
# Plot the original data in X (before PCA)
plt.plot(X[:, 0], X[:, 1], 'o', alpha=0.5)

# Grab the component means to get the center point for plot below
means = pca.mean_

# Here we use the direction of the components in pca.components_
#  and the magnitude of the variance explaine by that component in
#  pca.explained_variane_

# We plot the vector (manginude and direction) of the components
#  on top of the original data in X
for length, vector in zip(pca.explained_variance_, pca.components_):
    v = vector * 3 * np.sqrt(length)
    plt.plot([means[0], v[0]+means[0]], [means[1], v[1]+means[1]], '-k', lw=3)


# Axis limits
plt.xlim(0, max(X[:, 0])+3)
plt.ylim(0, max(X[:, 1])+3)

# Original feature labels of our data X
plt.xlabel(iris.feature_names[1])
plt.ylabel(iris.feature_names[2])

QUESTION:  In which direction in the data is the most variance explained?

Recall, in the ML I module: unsupervised models have a `fit()`, `transform()` and/or `fit_transform()` in `sklearn`.


If you want to both get a fit and new dataset with reduced dimensionality, which would you use below? (Fill in blank in code)

Modify this code in a new cell and fill in the blanks.

```python
# Get back to our 4D dataset
X, y = iris.data, iris.target

pca = PCA(n_components = 0.95) # keep 95% of variance

# What method is useful here (to combine two steps...)
X_trans = pca.___(X)

print(X.shape)
print(X_trans.shape)
```

In [None]:
# Try here

In [None]:
# Plot components
plt.scatter(X_trans[:, 0], X_trans[:, 1], c = iris.target, edgecolor='none', alpha=0.5,
           cmap = plt.cm.get_cmap('spring', 10))
plt.ylabel('Component 2')
plt.xlabel('Component 1')

### Clustering
We will go over the most straightforward of clustering methods, **KMeans**.  KMeans finds cluster centers that are the mean of the points within them.  Likewise, a point is in a cluster because the cluster center is the closest cluster center for that point.


> If you don't have ipywidgets package installed, go ahead and install it now by running the cell below uncommented.

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

In [None]:
from ipywidgets import interact
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.datasets.samples_generator import make_blobs
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA

iris = load_iris()
X, y = iris.data, iris.target
pca = PCA(n_components = 2) # keep 2 components which explain most variance
X = pca.fit_transform(X)

X.shape

In [None]:
# I have to tell KMeans how many cluster centers I want
n_clusters  = 3

# for consistent results when running the methods below
random_state = 2

<p style="text-align:right"><i>Code in next cell adapted from Jake VanderPlas's code [here](https://github.com/jakevdp/sklearn_pycon2015)</i></p>

In [None]:
def _kmeans_step(frame=0, n_clusters=n_clusters):
    rng = np.random.RandomState(random_state)
    labels = np.zeros(X.shape[0])
    centers = rng.randn(n_clusters, 2)

    nsteps = frame // 3

    for i in range(nsteps + 1):
        old_centers = centers
        if i < nsteps or frame % 3 > 0:
            dist = euclidean_distances(X, centers)
            labels = dist.argmin(1)

        if i < nsteps or frame % 3 > 1:
            centers = np.array([X[labels == j].mean(0)
                                for j in range(n_clusters)])
            nans = np.isnan(centers)
            centers[nans] = old_centers[nans]


    # plot the data and cluster centers
    plt.scatter(X[:, 0], X[:, 1], c=labels, s=50, cmap='rainbow',
                vmin=0, vmax=n_clusters - 1);
    plt.scatter(old_centers[:, 0], old_centers[:, 1], marker='o',
                c=np.arange(n_clusters),
                s=200, cmap='rainbow')
    plt.scatter(old_centers[:, 0], old_centers[:, 1], marker='o',
                c='black', s=50)

    # plot new centers if third frame
    if frame % 3 == 2:
        for i in range(n_clusters):
            plt.annotate('', centers[i], old_centers[i], 
                         arrowprops=dict(arrowstyle='->', linewidth=1))
        plt.scatter(centers[:, 0], centers[:, 1], marker='o',
                    c=np.arange(n_clusters),
                    s=200, cmap='rainbow')
        plt.scatter(centers[:, 0], centers[:, 1], marker='o',
                    c='black', s=50)

    plt.xlim(-4, 5)
    plt.ylim(-2, 2)
    plt.ylabel('PC 2')
    plt.xlabel('PC 1')

    if frame % 3 == 1:
        plt.text(4.5, 1.7, "1. Reassign points to nearest centroid",
                 ha='right', va='top', size=8)
    elif frame % 3 == 2:
        plt.text(4.5, 1.7, "2. Update centroids to cluster means",
                 ha='right', va='top', size=8)
    else:
        plt.text(4.5, 1.7, "3. Updated",
                 ha='right', va='top', size=8)

KMeans employ the <i>Expectation-Maximization</i> algorithm which works as follows: 

1. Guess cluster centers
* Assign points to nearest cluster
* Set cluster centers to the mean of points
* Repeat 1-3 until converged

In [None]:
min_clusters, max_clusters = 1, 6
interact(_kmeans_step, frame=[0, 20],
                    n_clusters=[min_clusters, max_clusters])

> <b>Warning</b>! There is absolutely no guarantee of recovering a ground truth. First, choosing the right number of clusters is hard. Second, the algorithm is sensitive to initialization, and can fall into local minima, although scikit-learn employs several tricks to mitigate this issue.<br>  --Taken directly from sklearn docs

### Novelty detection aka anomaly detection
QUICK QUESTION:
What is the diffrence between outlier detection and anomaly detection?

Below we will use a one-class support vector machine classifier to decide if a point is anomalous or not given our original data. (The code was adapted from sklearn docs [here](http://scikit-learn.org/stable/auto_examples/svm/plot_oneclass.html#example-svm-plot-oneclass-py) for the `iris` dataset)

In [None]:
%matplotlib inline
from matplotlib import rcParams, font_manager
rcParams['figure.figsize'] = (14.0, 7.0)
fprop = font_manager.FontProperties(size=14)

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.font_manager
from sklearn import svm

from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split


# Iris data (load it) - fill in blank
iris = ___()
X, y = iris.data, iris.target

# Taking only two columns...
labels = iris.feature_names[1:3]
X = X[:, 1:3]

# split into training and test sets - fill in blank
X_train, X_test, y_train, y_test = ___(X, y, test_size = 0.3, random_state = 0)

In [None]:
# make some outliers
X_weird = np.random.uniform(low=-2, high=9, size=(20, 2))

# fit the model - fill in blank
clf = svm.OneClassSVM(nu=0.1, kernel="rbf", gamma=1, random_state = 0)
clf.___(X_train)

# predict labels
y_pred_train = clf.predict(X_train)
y_pred_test = clf.predict(X_test)
y_pred_outliers = clf.predict(X_weird)


n_error_train = y_pred_train[y_pred_train == -1].size
n_error_test = y_pred_test[y_pred_test == -1].size
n_error_outliers = y_pred_outliers[y_pred_outliers == 1].size

Don't worry too much about the plotting below.  It's really just so you can visualize the results of your One Class SVM.

In [None]:
# set up grids for plot and decision boundaries
xx, yy = np.meshgrid(np.linspace(-2, 9, 500), np.linspace(-2, 9, 500)) # 500x500
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

# plot decision boundary and region within
plt.contourf(xx, yy, Z, levels=np.linspace(Z.min(), 0, 7), cmap=plt.cm.Blues_r)
a = plt.contour(xx, yy, Z, levels=[0], linewidths=2, colors='red')
plt.contourf(xx, yy, Z, levels=[0, Z.max()], colors='orange')

# observation points
b1 = plt.scatter(X_train[:, 0], X_train[:, 1], c='white')
b2 = plt.scatter(X_test[:, 0], X_test[:, 1], c='green')
c = plt.scatter(X_weird[:, 0], X_weird[:, 1], c='red')

# helpful info and basic plot settings
plt.title("Novelty Detection aka Anomaly Detection")
plt.axis('tight')
plt.xlim((-2, 9))
plt.ylim((-2, 9))
plt.ylabel(labels[1], fontsize = 14)
plt.legend([a.collections[0], b1, b2, c],
           ["learned frontier", "training observations",
            "new regular observations", "new abnormal observations"],
           loc="best",
           prop=fprop)
plt.xlabel(
    "%s\nerror train: %d/200 ; errors novel regular: %d/40 ; "
    "errors novel abnormal: %d/10"
    % (labels[0], n_error_train, n_error_test, n_error_outliers), fontsize = 14)

TRY changing the value of the parameters in the SVM classifier above especially `gamma`.  More information on `gamma` and support vector machine classifiers [here](http://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html).

Created by a Microsoft Employee.
	
The MIT License (MIT)<br>
Copyright (c) 2016