## Training a decision tree

`sklearn.datasets.load_digits` is a dataset of 8 by 8 images of numbers.
In this assignment, you will train a decision tree classifier with sklearn and tune the parameter to get better accuracy.

In [1]:
# Run the following code to get your training data and test data
seed = 20190327
import sklearn.datasets
from sklearn.model_selection import train_test_split
Load_digits = sklearn.datasets.load_digits()
X_train, X_test, y_train, y_test = train_test_split(Load_digits.data,
                                                    Load_digits.target, 
                                                    test_size=0.2, 
                                                    random_state=seed)

### In this assignment, you are required to:

1. Train a model and test its accuracy

    ***Note***: Use `random_state=seed` as an argument of the model so as to get consistent results.
    
2. Tune the parameter to get better performance 

 ***Note***: In order to get full marks, you need to show your work how you choose the best perameters, rather than just showing what the best parameter is.

In [2]:
# 1. import model from sklearn
# Your code here
from sklearn import tree
clf = tree.DecisionTreeClassifier()

In [3]:
# 2. train you model with X_train and y_train
# Your code here
clf = clf.fit(X_train, y_train)

In [4]:
# 3. test your performance on X_test and y_test
# You can use accuracy_score to get accuracy of you model. You may also compute the score manually.
# Your code here
from sklearn.metrics import accuracy_score 
from sklearn.metrics import classification_report
# y_predict = clf.predict(X_test)
print("The accuracy on this set is: ", '{0:.1f}%'.format(100 * clf.score(X_test, y_test)))
# print(classification_report(y_test, clf.predict(X_test)))

The accuracy on this set is:  86.4%


There are several parameters to tune in a decision tree model, (e.g., `max_depth`, `max_features`, `max_leaf_nodes`, `min_samples_leaf`, `min_samples_split`). Try to tune your model by choosing the values for 1 ~ 3 parameters using cross validation. For example:

In [5]:
# 4. Try different max_depth and pick the best one
# Your code here
accu = {}
for i in range(1,30):
    clf = tree.DecisionTreeClassifier(max_depth = i)
    clf = clf.fit(X_train, y_train)
    accu[i] = clf.score(X_test, y_test)
    
max_key = max(accu, key=accu.get)
print("The best mmodel has a depth of " + str(max_key) +  ". It's accuracy is " + '{0:.1f}%'.format(100 * accu[max_key]))

The best mmodel has a depth of 18. It's accuracy is 86.7%


In [6]:
%%time
# 4.1 You may choose more parameters to tune
# try to tune with changing max_depth and max_features 
accu = {}
for i in range(1,30):
    for j in range(1,64):
            clf = tree.DecisionTreeClassifier(max_depth = i, max_features = j)
            clf = clf.fit(X_train, y_train)
            my_list = [i, j]
            accu[tuple(my_list)] = clf.score(X_test, y_test)

max_key = max(accu, key=accu.get)
print("The best mmodel has a depth of " + str(max_key[0]) + 
      " with " + str(max_key[1]) + " features. It's accuracy is " + '{0:.1f}%'.format(100 * accu[max_key]))

The best mmodel has a depth of 21 with 43 features. It's accuracy is 90.3%
Wall time: 11 s


In [7]:
%%time
# try to tune with changing max_depth and max_leaf_nodes 
accu = {}
for i in range(1,30):
    for j in range(2,100):
            clf = tree.DecisionTreeClassifier(max_depth = i, max_leaf_nodes = j)
            clf = clf.fit(X_train, y_train)
            my_list = [i, j]
            accu[tuple(my_list)] = clf.score(X_test, y_test)

max_key = max(accu, key=accu.get)
print("The best mmodel has a depth of " + str(max_key[0]) + 
      " with max_leaf_nodes of " + str(max_key[1]) + ". It's accuracy is " + '{0:.1f}%'.format(100 * accu[max_key]))

The best mmodel has a depth of 14 with max_leaf_nodes of 95. It's accuracy is 86.7%
Wall time: 26.6 s


In [8]:
%%time
# try to tune with changing max_features and max_leaf_nodes 
accu = {}
for i in range(1,64):
    for j in range(2,100):
            clf = tree.DecisionTreeClassifier(max_features = i, max_leaf_nodes = j)
            clf = clf.fit(X_train, y_train)
            my_list = [i, j]
            accu[tuple(my_list)] = clf.score(X_test, y_test)

max_key = max(accu, key=accu.get)
print("The best mmodel has " + str(max_key[0]) + 
      " features with max_leaf_nodes of " + str(max_key[1]) + ". It's accuracy is " + '{0:.1f}%'.format(100 * accu[max_key]))

The best mmodel has 45 features with max_leaf_nodes of 92. It's accuracy is 90.0%
Wall time: 35.6 s


In [9]:
%%time
# try to tune with changing max_depth, max_features and max_leaf_nodes 
accu = {}
for i in range(1,30):
    for j in range(1,64):
        for k in range(2,100):
            clf = tree.DecisionTreeClassifier(max_depth = i, max_features = j, max_leaf_nodes = k)
            clf = clf.fit(X_train, y_train)
            my_list = [i, j, k]
            accu[tuple(my_list)] = clf.score(X_test, y_test)
    
max_key = max(accu, key=accu.get)

Wall time: 14min 52s


In [11]:
# 5. Show your best result
# Your code here
print("The best mmodel has a depth of " + str(max_key[0]) + 
      ", with " + str(max_key[1]) + " features and max_lead_nodes of " + str(max_key[2]) + 
      ". It's accuracy is " + '{0:.1f}%'.format(100*accu[max_key]))

The best mmodel has a depth of 9, with 39 features and max_lead_nodes of 86. It's accuracy is 91.1%
