## EXERCISE: Decision trees

[Adapted from http://scikit-learn.org/stable/modules/tree.html]

Let's try a decision tree on Iris data.

### Train and view a tree

In [None]:

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris

iris = load_iris()
key=', '.join(['{}={}'.format(i,name) for i,name in enumerate(iris.target_names)])

# First let's create a train and test split
X_train, X_test, Y_train, Y_test = train_test_split(iris.data, iris.target, test_size=0.33,
                                                    random_state=5) # so we get the same results

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
# Let's fit a model
tree = DecisionTreeClassifier(max_depth=2)
_ = tree.fit(X_train, Y_train)

# Evaluate
print('Classification report ({}):\n'.format(key))
print(classification_report(Y_test, tree.predict(X_test)))

In [None]:
#From http://chrisstrelioff.ws/sandbox/2015/06/08/decision_trees_in_python_with_scikit_learn_and_pandas.html
def get_code(tree, feature_names, target_names, spacer_base="    "):
    """Produce psuedo-code for decision tree.

    Args
    ----
    tree -- scikit-leant DescisionTree.
    feature_names -- list of feature names.
    target_names -- list of target (class) names.
    spacer_base -- used for spacing code (default: "    ").

    Notes
    -----
    based on http://stackoverflow.com/a/30104792.
    """
    left      = tree.tree_.children_left
    right     = tree.tree_.children_right
    threshold = tree.tree_.threshold
    features  = [feature_names[i] for i in tree.tree_.feature]
    value = tree.tree_.value

    def recurse(left, right, threshold, features, node, depth):
        spacer = spacer_base * depth
        if (threshold[node] != -2):
            print(spacer + "if ( " + features[node] + " <= " + \
                  str(threshold[node]) + " ) {")
            if left[node] != -1:
                    recurse(left, right, threshold, features,
                            left[node], depth+1)
            print(spacer + "}\n" + spacer +"else {")
            if right[node] != -1:
                    recurse(left, right, threshold, features,
                            right[node], depth+1)
            print(spacer + "}")
        else:
            target = value[node]
            for i, v in zip(np.nonzero(target)[1],
                            target[np.nonzero(target)]):
                target_name = target_names[i]
                target_count = int(v)
                print(spacer + "return " + str(target_name) + \
                      " ( " + str(target_count) + " examples )")

    recurse(left, right, threshold, features, 0, 0)
    
print('Decision tree:\n')
get_code(tree, iris.feature_names, iris.target_names)

## EXERCISE: Model selection on test data





### McNemar's test

McNemar's test is [recommended when we have a single test split](http://sci2s.ugr.es/keel/pdf/algorithm/articulo/dietterich1998.pdf).

Under H0, the two algorithms should have the same error rate.

In [None]:
from scipy.stats import chi2

def mcnemar(x, y):
    n1 = np.sum(x < y)
    n2 = np.sum(x > y)
    stat = (np.abs(n1-n2)-1)**2 / (n1+n2)
    df = 1
    pval = chi2.sf(stat,1)
    return stat, pval

### TODO Compare classifiers

- Choose the decision tree max_depth in [2..6], criterion in ['entropy', 'gini'] and splitter in ['best', 'random']. What are the best parameters? Print out all grid scores to sanity check the selection. Is there a unique best set of parameters?
- Use `np.array` create `l_yn` and `t_yn` arrays showing respectively for logistic regression and decision tree whether each test instance is predicted correctly (`1`) or incorrectly (`0`). Are the classifiers significantly different at p<=0.05 according to McNemar's test?(use the logistic regression code from previous week)
- Which classifier is significantly better at p<=0.05 using paired t-test?(use f-score measure)

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

# Perform grid search
param_grid = [
    {'max_depth': [2, 3, 4, 5, 6],
     'criterion': ['entropy', 'gini'],
     'splitter': ['best', 'random']}
]
tree = GridSearchCV(DecisionTreeClassifier(), param_grid)
tree.fit(X_train, Y_train)

# Print grid search results
print('Grid search mean and stdev:\n')
scoring = tree.cv_results_
for mean_score, std, params in zip(scoring['mean_test_score'],scoring['std_test_score'],scoring['params']):
    print("{:0.3f} (+/-{:0.03f}) for {}".format(
            mean_score, std * 2, params))
    
#for params, mean_score, scores in tree.cv_results_:
   # print("{:0.3f} (+/-{:0.03f}) for {}".format(
       #     mean_score, scores.std() * 2, params))

# Print best params
print('\nBest parameters:', tree.best_params_)

# Evaluate on held-out test
print('\nClassification report ({}):\n'.format(key))
print(classification_report(Y_test, tree.predict(X_test)))

In [None]:
logreg = LogisticRegression()
_ = logreg.fit(X_train, Y_train)
# Calculate whether each test prediction is correct
l_yn = np.array([int(p==t) for p,t in zip(logreg.predict(X_test), Y_test)])
t_yn = np.array([int(p==t) for p,t in zip(tree.predict(X_test), Y_test)])

# There's very little difference in this data set
print(l_yn)
print(t_yn)

# We cannot reject H0. Accuracy is different but not reliably so.
# Therefore, we can select either classifier, e.g.,
# decision tree for interpretability.
print('\nCan we reject H0?', 'Yes' if mcnemar(l_yn, t_yn)[1]<0.05 else 'No')

print('Classification report ({}):\n'.format(key))
print(classification_report(Y_test, logreg.predict(X_test)))

# From data to decisions


## EXERCISE: Ensembling classifiers

### Load and split data

scikit-learn provides a `train_test_split` function (in `sklearn.cross_validation`). However, there is no function to do a three-way split into training, development and held-out test data.

Let's create a three-way 50/25/25 train/dev/test split by using `train_test_split` two times.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_digits

# load digits data
digits = load_digits()
X, y = digits.data, digits.target
X_td, X_test, y_td, y_test = train_test_split(digits.data, digits.target, test_size=0.25,
                                              random_state=5) # so we get the same results
X_train, X_dev, y_train, y_dev = train_test_split(X_td, y_td, test_size=0.33,
                                                  random_state=5) # so we get the same results

### Plot error vs complexity for decision tree

In [None]:
%matplotlib inline
from sklearn.tree import DecisionTreeClassifier
import matplotlib.pyplot as plt
import random

NUM_SAMPLES = 10
NUM_TRAIN_SETS = 10

def subsample(X, y, sample_size):
    xy_tuples = list(zip(X, y))
    xy_sample = [random.choice(xy_tuples) for _ in range(sample_size)]
    X_sample, y_sample = zip(*xy_sample)
    return X_sample, y_sample

def error(clf, X, y):
    "Calculate error as 1-accuracy"
    return 1-clf.score(X,y)

def bootstrap_error(clf, X_train, y_train, X_test, y_test, sample_size, num_samples=NUM_SAMPLES):
    train_errors = []
    test_errors = []
    for _ in range(num_samples):
        X_sample, y_sample = subsample(X_train, y_train, sample_size)
        clf.fit(X_sample, y_sample)
        train_errors.append(error(clf,X_sample,y_sample))
        test_errors.append(error(clf,X_test,y_test))
    train_error = sum(train_errors)/len(train_errors)
    test_error = sum(test_errors)/len(test_errors)
    return train_error, test_error

complexities = []
train_errors = []
test_errors = []
for max_depth in [2,4,8,16,32,None]:
    clf = DecisionTreeClassifier(max_depth=max_depth)
    sample_size = len(y_train)
    train_error, test_error = bootstrap_error(clf, X_train, y_train, X_dev, y_dev, sample_size)
    complexities.append(max_depth)
    train_errors.append(train_error)
    test_errors.append(test_error)
plt.plot(complexities, train_errors, c='b', label='Training error')
plt.plot(complexities, test_errors, c='r', label='Generalisation error')
plt.ylim(0,1)
plt.ylabel('Error')
plt.xlabel('Model complexity')
plt.title('Decision tree')
plt.legend()
plt.show()

### TODO Assessing decision tree fit

- Does training or generalisation error level out first? Why?
- What is the best value of max_depth based on this plot?
- Why doesn't generalisation error increase on the right

In [None]:
# 1 - Generalisation error levels out first.
#     This suggests that higher values of max_depth may lead to overfitting.

# 2 - max_depth=8. This gives the best generalisation error with lower 
#     model complexity and less risk of overfitting.

# 3 - The algorithm has other mechanisms to prevent overfitting.
#     And overfitting does seem to hurt generalisation too much on this data.
#     Nevertheless, decision trees can overfit so use with caution.

### Plot error vs complexity for random forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

complexities = []
train_errors = []
test_errors = []
for n_estimators in [1,2,4,8,16,32,64]:
    clf = RandomForestClassifier(n_estimators=n_estimators, max_depth=8)
    sample_size = len(y_train)
    train_error, test_error = bootstrap_error(clf, X_train, y_train, X_dev, y_dev, sample_size)
    complexities.append(n_estimators)
    train_errors.append(train_error)
    test_errors.append(test_error)
plt.plot(complexities, train_errors, c='b', label='Training error')
plt.plot(complexities, test_errors, c='r', label='Generalisation error')
plt.ylim(0,1)
plt.ylabel('Error')
plt.xlabel('Model complexity')
plt.title('Random forest')
plt.legend()
plt.show()

### Plot error vs number of training samples

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import random

def plot_error_curves(clf, X_train, y_train, X_test, y_test, num_train_sets=NUM_TRAIN_SETS, title=None):
    data_sizes = []
    train_errors = []
    test_errors = []
    for i in range(num_train_sets):
        sample_size = int(len(y_train) * (i+1)/num_train_sets)
        train_error, test_error = bootstrap_error(clf, X_train, y_train, X_test, y_test, sample_size)
        data_sizes.append(sample_size)
        train_errors.append(train_error)
        test_errors.append(test_error)
    plt.plot(data_sizes, train_errors, c='b', label='Training error')
    plt.plot(data_sizes, test_errors, c='r', label='Generalisation error')
    plt.ylim(0,1)
    plt.ylabel('Error')
    plt.xlabel('Number of training samples')
    if title:
        plt.title(title)
    plt.legend()
    plt.show()

# Note that we're passing dev data for estimating generalisation error here, not test data
dt = DecisionTreeClassifier(max_depth=8)
plot_error_curves(dt, X_train, y_train, X_dev, y_dev, title='Decision Tree')
rf = RandomForestClassifier(max_depth=8, n_estimators=16)
plot_error_curves(rf, X_train, y_train, X_dev, y_dev, title='Random Forest')

### TODO Comparing fit and data needed

- Which classifier would you use?
- Would it be useful to collect more training data?
- The decision tree has a larger spread between training and generalisation error. Why is this?
- Note we haven't yet used test data. When is it OK to use the held-out test data from our train/dev/test split?

In [None]:
# 1 - Random forest is clearly better (error rates under 10% vs 20% for decision tree)

# 2 - Yes, almost always. However, it looks like both classifiers are close to their asymptotes.
#     So the benefit might not be worth the cost.
#     The decision tree would benefit more from additional data.

# 3 - The decision tree suffers more from overfitting.
#     The random forest on this particular data has 0 training error.
#     This is a bit of a surprise as random forests tend to increas bias.
#     With high bias, we would expect underfitting which tends to be 
#     characterised by both high training and high generalisation error.
#     However, random forests generally also reduce variance enough
#     to cancel out any increase in bias.
#     Here we end up with a nice generalisation error plot that seems
#     to be close to its asymptote and not too different from the
#     training error.

# 4 - As little as possible. Ideally only once for our final 
#     generalisation error/accuracy calculation. 