# Decision Trees
In this notebook we will learn to build a real Decision Tree classifier, based on the training data generated by the previous notebook. 

As the name suggests, decision trees are tree-like structures in which:
- The **nodes** represent the features on which the decision must be based
- The **branches** represent the values of the feature from which they derive
- Leaves **represent** decisions

So let's start!

## 1. Load the knowledge base
First of all, you need to load the knowledge base, ie the training data contained in one of the files generated in the previous notebook. Use `m`, `N` and `num_of_matches` to load the right file.

To do this:

In [None]:
import pandas

# These parameters must be set to load the correct training set

m = 1
N = 10
num_of_matches = 10

# These instructions are used to smartly visualize the dataset: you can skip them and go on

feature_name = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H']
class_name = ['UP', 'DOWN', 'EAST', 'WEST']
if m == 2:
    others = ['I', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z']
    for i in others:
        feature_name.append(i)
        
feature_name.append('MOVES')
################################################################################################################
    
path = 'output/train_set_m{}/num_of_matches_{}.txt'.format(m, num_of_matches)
dataset = pandas.read_csv(path, ',', delimiter=None, header=1, names = feature_name, index_col=False)
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values



print("Dataset: " + path + '\n')
print(dataset)
print("\nx:")
print(X)
print("\ny:")
print(y)

#### Stranger Things!

Pay attention to the size of your training set! In fact, the number of moves generated in the various games is much less than the number of moves in the training set. So why all this?


This is because if a fixed reference is used, for each state there are 3 other equivalent states (it is enough to rotate the configuration 3 times by 90 degrees): therefore it is possible to increase the examples of the training set by exploiting these symmetries. Not bad, right?



The images below represent the visual field of the ant for each orientation and justify the names given to the features in the dataset as the parameter m changes;
- $m = 1$
<img src="images/m1.png"  height="500" width="500">
- $m = 2$
<img src="images/m2.png"  height="500" width="500">

## 2.  Split your data!
All you have to do is divide your training data into **training set** and **test set** because later we want to evaluate our classifier's performance.

To do is invoke these simple commands:

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=4)

# print the shapes of the new X objects
print("\nTraining set dimensions (X_train):")
print(X_train.shape)
print("\nTest set dimensions (X_test):")
print(X_test.shape)

# print the shapes of the new y objects
print("\nTraining set dimensions (y_train):")
print(y_train.shape)
print("\nTest set dimensions (y_test):")
print(y_test.shape)


## 3.  Build the Classifier

Now we are ready to build our classifier: use the *fit* function to train the classifier with training data

In [None]:
from sklearn.tree import DecisionTreeClassifier

tree_clf = DecisionTreeClassifier(criterion='entropy', min_samples_leaf=2, random_state=4)
tree_clf.fit(X_train, y_train)

#### Notes
The default values for the parameters controlling the size of the trees (e.g. max_depth, min_samples_leaf, etc.) lead to fully grown and unpruned trees which can potentially be very large on some data sets. To reduce memory consumption, the complexity and size of the trees should be controlled by setting those parameter values.
The features are always randomly permuted at each split. Therefore, the best found split may vary, even with the same training data and max_features=n_features, if the improvement of the criterion is identical for several splits enumerated during the search of the best split. To obtain a deterministic behaviour during fitting, random_state has to be fixed.

## 3.  Evaluate the Classifier
This phase is very important and allows us to evaluate the model based on some standard metrics, such as **accuracy** and **confusion matrix**.

In [None]:
import numpy as np
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_score


# make prediction on the test set
y_pred = tree_clf.predict(X_test)

# compute classification accuracy
print("\nAccuracy (splitting training/test sets)")
print(metrics.accuracy_score(y_test, y_pred))

# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred)
np.set_printoptions(precision=2)
print("\nConfusion matrix:")
print(cnf_matrix)

# 10-fold cross validation
scores = cross_val_score(tree_clf, X, y, cv=10, scoring='accuracy')
print("\nAccuracy (10-fold cross validation):")
print(scores.mean())

## 4. Visualize your model!

Now that the model has been built and evaluated, you can also view it! The peculiarity of decision trees is that the rule they generate is "easily" viewable: to see it, it is enough to walk through the tree starting from the root.

The tree can easily be translated into a set of IF ... THEN ... ELSE clauses

In [None]:
import matplotlib.pyplot as plt
from sklearn.tree import plot_tree

name_plot = 'Decision Tree (m = {}, num of matches = {})'.format(m, num_of_matches)
plt.figure(figsize=(7.0, 7.0), dpi=400, num=name_plot)
plot_tree(tree_clf, filled=True, feature_names=feature_name, class_names=class_name)
plt.show()


#### Are you having trouble viewing your tree? (Optional)

If $m = 2$, you will not be able to visualize the whole tree. Don't be afraid, because you can print it on a pdf file that will be saved in your notebook's workspace.

In [None]:
# To print the tree on a pdf file
import graphviz
from sklearn import tree

feature_name = feature_name[:-1]

dot_data = tree.export_graphviz(tree_clf, out_file=None,
                     feature_names=feature_name,
                     class_names=class_name,
                     filled=True, rounded=True,
                     special_characters=True)

graph = graphviz.Source(dot_data)
graph.render(name_plot)

## 5. Save your model

You have reached the last step of this notebook. If the developed model satisfies you, all you have to do is save it in the same folder that contains the training data. This is important because this model will be loaded into the next notebook. Therefore good luck!

In [None]:
from joblib import dump

path_of_model = 'output/train_set_m{}/model_Tree_{}_{}.h5'.format(m, num_of_matches, num_of_matches)
dump(tree_clf, path_of_model)
print('model saved in {}'.format(path_of_model))

In [None]:
from IPython.display import Image
from IPython.core.display import HTML 
Image(url= "images/Baby.png")