## In-class notebook : Decision tree visualization and hyperparameter tuning

In this notebook, we will learn how to visualize a trained decision tree classifier. We will also manually tune the `hyperparameters` of the tree and visualize the results of that tuning. 

First we will run this cell that improrts the required libraries for this exercise. We will be using a python package called pydotplus. Make sure you install this in your python distribution. You will also have to download the graphviz software [https://graphviz.gitlab.io/download/]. 

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.externals.six import StringIO  
from IPython.display import Image  
from sklearn.tree import export_graphviz

import pydotplus

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier


To ensure that your graphviz executables are added to your PATH variable, please run the following cell. Replace the RHS of graphviz_path with your actual path to the Graphviz bin.

In [None]:
import os
graphviz_path = 'C:/Program Files (x86)/Graphviz2.38/bin/'
os.environ["PATH"] += os.pathsep + graphviz_path

In [None]:
df = pd.read_csv('data/residency.csv')
df = df.drop('Unnamed: 0',axis=1)
y = df['Residency']
X= df.drop('Residency',axis=1)
df.head()

In [None]:
clf = DecisionTreeClassifier(random_state=0)
clf.fit(X,y)

In [None]:
dot_data = StringIO()
export_graphviz(clf, out_file=dot_data, filled=True, rounded=True, special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
Image(graph.create_png())

In [None]:
 ### There is another way to make the tree graph, but the visualization process is similar (we use export_graphviz)
from sklearn import tree
tree.plot_tree(clf.fit(X, y))

In [None]:
new_cat_features=preprocessor.transformers_[1][1]['onehot']\
                         .get_feature_names(categorical_features)
# dot_data = tree.export_graphviz(clf, out_file=None,feature_names=list(numeric_features)+list(new_cat_features),class_names=['No','Yes'],filled=True, rounded=True,special_characters=True)
dot_data = tree.export_graphviz(clf, out_file=None,feature_names=['Age','Salary','Degree'],class_names=['No','Yes'],filled=True, rounded=True,special_characters=True)

In [None]:
import graphviz
graph = graphviz.Source(dot_data)


In [None]:
graph

In [None]:
Image(graph.create_png())


In [None]:
df = pd.read_csv("data/Heart_cleaned.csv")
df = df.drop('Unnamed: 0',axis=1)

y= df['AHD']
X = df.drop('AHD',axis=1)

In [None]:
clf = DecisionTreeClassifier(random_state=0)
clf.fit(X,y)

In [None]:
#new_cat_features=preprocessor.transformers_[1][1]['onehot']\
                         #.get_feature_names(categorical_features)
    
new_cat_features = list(X.columns)
dot_data = tree.export_graphviz(clf, out_file=None,feature_names=new_cat_features,class_names=['No','Yes'],filled=True, rounded=True,special_characters=True)

In [None]:
import graphviz
graph = graphviz.Source(dot_data)


In [None]:
graph

## Exercises: Manual and automatic hyperparameter tuning

### 1. Grow a decision tree classifier and change its options and visualize the tree to check what's happening
- 1.1 `max_depth`
- 1.2 `min_samples_split`
- 1.3 `min_samples_leaf`
- 1.4 `max_features`
- 1.5 `min_impurity_decrease`    
See the [document](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier.fit) for details.
                   

In [None]:
clf = DecisionTreeClassifier()

In [None]:
clf.get_params() #check the default options

In [None]:
# First, the max_depth is None by default. Seeing the visualization above, the depth grows over 10. 
# So we can pick a max_depth that's smaller than 10, for example let's pick 5.
# Note that the example numbers here are for demo purpose and would not necessarily be the best choice.


### 2. Pick a performance metric (for classification) and optimize those tuning parameters. Does a tree perform better when fully grown or early stopped using those parameters?

In [None]:
#Specify the parameter space (max_depth, min_sample_split, min_samples_leaf)
param_space={
'max_depth' : [3, 5, 7, 9, None],
'min_samples_split' : [2, 3, 5],
'min_samples_leaf' : [1, 2, 3, 5],
'max_features' : [4, 8, None]}

In [None]:
# sklearn has a convenient function that can do grid search, as well as cross validation.
# see more in https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html


#If you want to use different types of cv (e.g. stratified- which also takes care of class label imbalance), you can construct cv object.
# see more in https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html#sklearn.model_selection.KFold
# and https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html#sklearn.model_selection.StratifiedKFold