LAB 5 - Decision Trees
=========================

In this lab, as in the next two, we will explore different classification methods, and how we can compare their results. For those three labs, we will use the MNIST database of hand-written digits. In this lab, we will use decision trees and random forest methods. In Lab 6, we will use neural networks, and in Lab 7 Support Vector Machines.

Important note
---------------

For this lab and the next two, students must write a report (one report for the three labs) which will be used during the oral exam. This report should highlight the different methods used during the labs, but also how you validated each method and compared their results.

Lab objectives
---------------

* Classification with decision trees and random forests.
* Cross-validation and evaluation.

The Data
----

The dataset is the MNIST database of handwritten digits, from LeCun et al : http://yann.lecun.com/exdb/mnist/
It contains a training set of 60 000 examples, and a test set of 10 000 examples.

The dataset is split into a training set (for learning and cross-validation) and a test set (for evaluation of the model). Each 28x28 pixels image is flattened into a 784-length vector ('images') and the correct label is encoded in a 10-length vector ('labels'). The following piece of code shows a sample of the training set with the correct label.

In [None]:
%matplotlib inline
from sklearn.datasets import load_digits
from matplotlib import pyplot as plt
from MNISTData import MNISTData

mnist = MNISTData(train_dir='MNIST_data', one_hot=True)

plt.gray()
for i in range(9):
    plt.subplot(3,3,i+1)
    plt.imshow(mnist.train['images'][i].reshape((28,28)))
    plt.title(mnist.train['labels'][i].argmax())
    plt.axis('off')
plt.show()

1. Decision Trees
----

Create a simple Decision Tree classifier using scikit-learn and train it on the MNIST dataset. Use it to predict the classes of the test dataset. Evaluate the performance of the classifier.

http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

In [None]:
from sklearn import tree

clf = tree.DecisionTreeClassifier()
# --- your code here --- #

**Use cross-validation** to improve the results of your classifier by changing its parameters.

In [None]:
import numpy as np

# Randomly split the data in n pieces for cross-validation
def randomsplit_data(x, y, n):
    l = x.shape[0]
    index = np.arange(0,l)
    np.random.shuffle(index)
    
    dataset = []
    for i in range(n):
        imin = i*l/n
        imax = (i+1)*l/n
        dataset.append({'x': x[index[imin:imax]], 'y': y[index[imin:imax]]})
    
    return dataset

# Example use : split the data in 5
dataset = randomsplit_data(mnist.train['images'], mnist.train['labels'], 5)


# --- your code here --- #

**Evaluate your best classifier** on the test set. How can you compare it to the classifier with default parameters ?

In [None]:
# --- your code here --- #

2. Random Forests
----

Random Forest classifiers use multiple decision trees trained on sub-samples of the dataset, averaging the results so as to reduce over-fitting. 

Use scikit-learn to **create Random Forest classifiers** on the MNIST data. **Use cross-validation to test different parameters**, and **evaluate your best classifier on the test set**. Compare the results with the previous classifiers.

http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

In [None]:
from sklearn import ensemble

clf = ensemble.RandomForestClassifier()
# --- your code here --- #