# Supervised Learning - Part II (Chapter 5)

This module focuses on a particular class of supervised machine learning: classification, where we have a finite number of choices to label an observation. 

Topics for this module:
* Cross-Validation
* Learning Curves
* Support Vector Machines
* Random Forest

In [None]:
import numpy as np
import matplotlib.pylab as plt

We ended the last workbook by comparing various classification methods built into the Scikit-Learn toolbox, using the iris dataset.   Lets reproduce those plots using the wine dataset.

In [None]:
from sklearn import datasets # import standard datasets
wine = datasets.load_wine() # load wine data set

as well as the KNN classifier, decision tree, and support vector classification

In [None]:
from sklearn import neighbors
from sklearn import tree
from sklearn import svm

We generated boxplots of accuracy by splitting, fitting and predicting training/test data respectively.  

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [None]:
nn1 = neighbors.KNeighborsClassifier(n_neighbors = 1)
nn3 = neighbors.KNeighborsClassifier(n_neighbors = 3)
nn5 = neighbors.KNeighborsClassifier(n_neighbors = 5)
svc = svm.SVC()
dt = tree.DecisionTreeClassifier()

acc = np.zeros((20,5))

for i in xrange(20):
    # split the data
    x_train, x_test, y_train, y_test = train_test_split(wine.data, wine.target, test_size=0.4)

    # train the classifier using the training data
    nn1.fit(x_train, y_train)
    nn3.fit(x_train, y_train)
    nn5.fit(x_train, y_train)
    svc.fit(x_train, y_train)
    dt.fit(x_train, y_train)

    # compute the prediction of the test set using the model
    yhat_nn1 = nn1.predict(x_test)
    yhat_nn3 = nn3.predict(x_test)
    yhat_nn5 = nn5.predict(x_test)
    yhat_svc = svc.predict(x_test)
    yhat_dt = dt.predict(x_test)

    acc[i][0] = accuracy_score(yhat_nn1,y_test)
    acc[i][1] = accuracy_score(yhat_nn3,y_test)
    acc[i][2] = accuracy_score(yhat_nn5,y_test)
    acc[i][3] = accuracy_score(yhat_svc,y_test)
    acc[i][4] = accuracy_score(yhat_dt,y_test)
    
    
# generate box plot
plt.boxplot(acc);
for i in xrange(5):
    # add jigger to plot data
    xderiv = (i+1)*np.ones(acc[:,i].shape)+(np.random.rand(20,)-0.5)*0.1
    plt.plot(xderiv,acc[:,i],'bo',alpha=0.3)
    
ax = plt.gca()
ax.set_xticklabels(['1-NN','3-NN','5-NN','SVM','Decission Tree'])
_ = plt.ylabel('Accuracy')


Decision Tree appears to perform the best, based on our simple cross-validation test.  Here, we ran several experiments where we randomly split the data set into a training and a test set, did the prediction, and fit.  This is known as repeated random sub-sampling validation (or Monte Carlo cross-validation).  There are other modes of cross validation:
* *leave-one-out*:  Given $N$ samples, model is trained with $N-1$ samples and tested with the remaining one.  This is repeated $N$ times, once per training sample, and the result is averaged
* *leave-p-out*:  Given $N$ samples, model is trained with $N-p$ samples and tested with the remaining $p$ samples.  This is repeated $N \choose p$ times, and the result is averaged.  This approach is impractical for most choices of $n$ and $p$.
* *k-fold cross-validation*: the data is split into $k$ non-overlapping splits.  Use $k-1$ splits for training, and the remaining split for testing.  Repeat $k$ times, leaving one split out each time, then average the results.

# Decision Trees

Lets learn about decision trees (DT), before switching to errors associated with the learning task.  DTs are a popular method for various machine learning tasks because 
* they are invariant under scaling and various other transformation of feature values (i.e. normalization not needed)
* one can include irrelevant features without affecting the final outcome
* can handle both numerical and categorical data 
* the final outcome is easy to understand / interpret / visualized. 

DT have various downsides however:
* prone to over-fitting (overly-complex trees)
* unstable: small variations in data might result in a completely different generated tree
* optimal decision trees are NP-complete; hence, most algorithms are based on heuristic  algorithms

What does a tree look like?

In [None]:
dt = tree.DecisionTreeClassifier(max_depth=2)
x_train, x_test, y_train, y_test = train_test_split(wine.data, wine.target, test_size=0.3)
dt.fit(x_train, y_train)

import graphviz 
dot_data = tree.export_graphviz(dt, out_file=None, 
                         feature_names=wine.feature_names,  
                         class_names=wine.target_names,  
                         filled=True, rounded=True,  
                         special_characters=True)  
graph = graphviz.Source(dot_data)  
graph 

How to read this tree? 
* top entry in each box gives the condition being tested.  
* gini: is a measure of impurity; how often an randomly chosen element from the set would be incorrectly labeled, 
* samples: number of samples in each leaf
* value: (???) presumably, some values in the data?  I'm not sure.
* class: the targeted output

## How is a tree built?

The general approach is to split a set of samples into subsets based on some attributes, and repeating the process in a recursive manner.  The simplest algorithm is a top-down, greedy search through the space of possible decision trees.
First, one computes the *entropy* for the data set, a measure of uncertainty in the data set, 
$$ H(S) = \sum_{c \in C} -p(c) \log_2{p(c)}.$$
Here,
* $S$ is the current sample set for which entropy is being calculated
* $C$ is the set of classes in $S$
* $p(c)$ is the probability of encountering element in class $c$ in the set $S$.

For a binary classification problem:
* an entropy will be zero if all samples are either true, or if all samples are false.
* if half the samples are true, half the samples are false, then the entropy will be one (i.e. high).

The greedy algorithm proceeds as follows:
* compute entropy of parent
* for each feature/attribute
    * compute information gain:  Entropy(parent) - Weighted Sum of Entropy(Children).
* pick feature that gives largest information gain.

This is then repeated recursively for each leaf in the tree.  Wikipedia has a nice example for a toy data set with four attributes / 14 data points.  https://en.wikipedia.org/wiki/Decision_tree_learning#Information_gain

Alternatively, Scikit-learn uses a hierarchical approach known as CART (Classification and Regression Trees). The main difference, is the Gini Index is used as the cost function to evaluate splits in the data set.  It is beyond the scope of this course to explore CART and it's variants.  Interested readers should look at the collection: "The Top Ten Algorithms in Data Mining", edited by Wu and Kumar. A PDF of the relevant chapter appears to presently be posted at: http://www.uta.edu/faculty/rcli/Teaching/math6310/materials/Ten.pdf#page=192


# Learning Curves

Denote the training error (i.e. in-sample error) as $E_{in}$, i.e., the error in the model measured over all data in the *training* set.   Denote the testing error (i.e. out-of sample error / generalization error) as $E_{out}$, i.e., the error expected on unseen data.  Some intuitive statements:
* $E_{out} \ge E_{in}$
* want $E_{in} \to 0$
* want $E_{out} \approx E_{n}$, i.e.,
$\quad  E_{in} \le E_{out} \le E_{in} + \Omega, \quad \text{with } \Omega \to 0,$ where $\Omega$ typically depends on the number of samples $N$, complexity of the model, ...

Learning curve: shows relationship between training/test errors as a function of ML problem parameters.  We'll use the Decision Tree classifier which seems to work best for this problem.  Additionally, we will control the maximum depth of the tree, in some sense, controlling the complexity of the ML algorithm.

In [None]:
dt = tree.DecisionTreeClassifier(max_depth=1)

Ein = np.zeros((99,40))
Eout = np.zeros((99,40))

for nratio in xrange(99):

    ratio = 1 - (nratio+1)/100.0
    for i in xrange(40):
        # split the data
        x_train, x_test, y_train, y_test = train_test_split(wine.data, wine.target, test_size=ratio)

        # train the classifier using the training data
        dt.fit(x_train, y_train)

        # compute the prediction of the test set using the model
        yhat_train = dt.predict(x_train)
        yhat_test = dt.predict(x_test)

        Ein[nratio][i] = 1 - accuracy_score(yhat_train,y_train)
        Eout[nratio][i] = 1 - accuracy_score(yhat_test,y_test)
    
p1,=plt.plot(np.mean(Ein[:,:].T,axis=0),'pink')
p2,=plt.plot(np.mean(Eout[:,:].T,axis=0),'c')
fig = plt.gcf()
#fig.set_size_inches(12,5)
plt.xlabel('Percent of samples used for training')
plt.ylabel('Error rate')
_ = plt.legend([p1,p2],["Training Error, Depth = 1","Test Error, Depth = 1"])    


Observations:
* ? 
* ? 

Lets repeat with a more complicated model (increasing depth of tree permitted)

In [None]:
dt = tree.DecisionTreeClassifier(max_depth=2)

# precomputed from above execution block
p1,=plt.plot(np.mean(Ein[:,:].T,axis=0),'pink')
p2,=plt.plot(np.mean(Eout[:,:].T,axis=0),'c')

Ein = np.zeros((99,40))
Eout = np.zeros((99,40))

for nratio in xrange(99):

    ratio = 1 - (nratio+1)/100.0
    for i in xrange(40):
        # split the data
        x_train, x_test, y_train, y_test = train_test_split(wine.data, wine.target, test_size=ratio)

        # train the classifier using the training data
        dt.fit(x_train, y_train)

        # compute the prediction of the test set using the model
        yhat_train = dt.predict(x_train)
        yhat_test = dt.predict(x_test)

        Ein[nratio][i] = 1 - accuracy_score(yhat_train,y_train)
        Eout[nratio][i] = 1 - accuracy_score(yhat_test,y_test)
    
p3,=plt.plot(np.mean(Ein[:,:].T,axis=0),'red')
p4,=plt.plot(np.mean(Eout[:,:].T,axis=0),'blue')
fig = plt.gcf()
#fig.set_size_inches(12,5)
plt.xlabel('Percent of samples used for training')
plt.ylabel('Error rate')
_ = plt.legend([p1,p2,p3,p4],["Training Error, Depth = 1","Test Error, Depth = 1","Training Error, Depth = 2","Test Error, Depth = 2"])  

These graphs are not as instructive as those in your textbook, since we are working with the wine data set, rather than the toy problem generated on page 79.  Some terms:
* the training and test errors often converge for each ML model.  The value it converges to is known as the bias.  Here the curves have not converged, which is often the case when there are insufficient samples.
* the difference between the bias and the test error is known as the variance.  


### OverFitting

If one creates a learning curve for a fixed number of data samples, but increase the complexity of the model, one often observes the following behavior.  (We'll use the data from the textbook since we don't really have enough samples to see such a curve in the wine data set).


In [None]:
from sklearn import metrics

MAXC=20
N=1000
NTEST=4000
ITERS=3

yhat_test=np.zeros((ITERS,MAXC,2))
yhat_train=np.zeros((ITERS,MAXC,2))

#Repeat ten times to get smooth curves
for i in xrange(ITERS):
    X = np.concatenate([1.25*np.random.randn(N,2),5+1.5*np.random.randn(N,2)]) 
    X = np.concatenate([X,[8,5]+1.5*np.random.randn(N,2)])
    y = np.concatenate([np.ones((N,1)),-np.ones((N,1))])
    y = np.concatenate([y,np.ones((N,1))])
    perm = np.random.permutation(y.size)
    X = X[perm,:]
    y = y[perm]

    X_test = np.concatenate([1.25*np.random.randn(NTEST,2),5+1.5*np.random.randn(NTEST,2)]) 
    X_test = np.concatenate([X_test,[8,5]+1.5*np.random.randn(NTEST,2)])
    y_test = np.concatenate([np.ones((NTEST,1)),-np.ones((NTEST,1))])
    y_test = np.concatenate([y_test,np.ones((NTEST,1))])

    j=0
    for C in xrange(1,MAXC+1):
        #Evaluate the model
        clf = tree.DecisionTreeClassifier(min_samples_leaf=1, max_depth=C)
        clf.fit(X,y.ravel())
        yhat_test[i,j,0] = 1. - metrics.accuracy_score(clf.predict(X_test), y_test.ravel())
        yhat_train[i,j,0] = 1. - metrics.accuracy_score(clf.predict(X), y.ravel())
        j=j+1

p1, = plt.plot(np.mean(yhat_test[:,:,0].T,axis=1),'r')
p2, = plt.plot(np.mean(yhat_train[:,:,0].T,axis=1),'b')
fig = plt.gcf()
fig.set_size_inches(12,5)
plt.xlabel('Complexity')
plt.ylabel('Error rate')
plt.legend([p1, p2], ["Testing error", "Training error"])
plt.savefig("learning_curve_4.png",dpi=300, bbox_inches='tight')

* Here, we see that as the complexity increases, the training error decreases.
* Above a certain level of complexity, the test error starts increasing.  This is known as **over-fitting**

Most models are parameterized by hyper-parameters.
* e.g., nearest neighbors: have to specify number of neighbors to use
* e.g., decision tree: have to specify depth

A good heuristic for selecting the model is to choose the value of the hyper-parameter that yields the smallest estimated test error.  (test this using cross-validation).  To address over-fitting, the following approaches are also used:
* regularization: penalizing complex models
* ensemble techniques: e.g., bagging.  The idea is to generate several subsets of data from the training sample chosen randomly with replacement.

# Recap: general classification process

Since we often need to select the best hyper-parameters for our model, we now need to divide the data into three sets: training, validation and test sets.  
* training set: what we train the models on
* validation set: selecting a model based on reducing the out-of-sample error
* testing set: use exclusively for accessing performance.  (never used for learning)

Practically, because we are now suggesting to split the data into three sets, the classifier is trained with a smaller fraction of the data.  Often, it is best to use a *nested* cross-validation.  We call it a nested cross-validation because we have an *inner* and *outer* cross-validation; the inner cross validation is what we discussed earlier, testing to select the best model from various algorithms. The outer cross validation applies cross-validation to find the best hyper-parameter.  

Here is a five-fold cross validation of the wine data set.  

In [None]:
from sklearn import cross_validation

#Create a 10-fold cross validation set
kf=cross_validation.KFold(n=wine.data.shape[0], n_folds=5, shuffle=True)
      
#S possible hyper-parameters to check for the decision tree
C=np.arange(2,10,)

# we'll run 20 
acc = np.zeros((5,9))
i=0
for train_index, test_index in kf:
    x_train, x_test = wine.data[train_index], wine.data[test_index]
    y_train, y_test = wine.target[train_index], wine.target[test_index]
    j=0
    for c in C:
        dt = tree.DecisionTreeClassifier(min_samples_leaf=1, max_depth=c)
        dt.fit(x_train,y_train)
        yhat = dt.predict(x_test)
        acc[i][j] = metrics.accuracy_score(yhat, y_test)
        j=j+1
    i=i+1
    
plt.boxplot(acc);
for i in xrange(4):
    xderiv = (i+1)*np.ones(acc[:,i].shape)+(np.random.rand(5,)-0.5)*0.1
    plt.plot(xderiv,acc[:,i],'ro',alpha=0.3)

print 'Mean accuracy: ' + str(np.mean(acc,axis = 0))
print 'Selected model index: ' + str(np.argmax(np.mean(acc,axis = 0)))
print 'Complexity: ' + str(C[np.argmax(np.mean(acc,axis = 0))])
plt.ylim((0.7,1.))
fig = plt.gcf()
fig.set_size_inches(12,5)
plt.xlabel('Complexity')
plt.ylabel('Accuracy')
plt.savefig("model_selection.png",dpi=300, bbox_inches='tight')

# Support Vector Machines

Many practitioners believe that support vector machines (SVMs) are the best off-the-shelf supervised learning algorithms.  I have insufficient evidence/experience to confirm to refute this claim.  What I can say, is of the four researchers who utilize ML learning that I interact with, half use SVMs, half use random forest (RF).

Section 5.7.2.1 gives one viewpoint on how to derive the optimization problem that is posed by SVM.  It was mathematically unsatisfying for me, and makes some implicit assumptions on the coefficients of the hyperplanes, $\vec{a}$.  

I was going to write up a set of notes, but found this nice presentation that was given at the AMIA 2009 meeting.  https://med.nyu.edu/chibi/sites/default/files/chibi/Final.pdf


# Random Forest

The random forest (RF) is a decision tree type algorithm, where different trees are created by "randomly" hiding features available, and generating many trees, each constructed with different features available, and each with access to a random set of the training samples.  The idea, is that each tree brings a different background / view of the information to the problem.  Pooling decisions made by the collection of decision trees, will be more accurate, on average.