# Implementing a Decision Tree with scikit-learn

Dr J Rogel-Salazar

[j.rogel.datascience@gmail.com](mailto:j.rogel.datascience@gmail.com)

Let us import some librarioes that we will use during the practice

In [None]:
%matplotlib inline
import numpy as np
import pandas as pd

We can read the data with the help of pandas using the `read_csv` method

In [None]:
iris_data = pd.read_csv('./data/iris.csv') 

Let us look at the first 6 records:

In [None]:
iris_data.head()

As we can see, the species is provided as a string, but the algorithms we are likely to use only take numerical values. 

Let us write a function that transforms the strings into numbers:

- Setosa: 0
- Versicolor: 1
- Virginica: 2

In [None]:
def get_num(x):
    if x == 'setosa':
        y=0
    elif x == 'versicolor':
        y=1
    elif x == 'virginica':
        y=2
    return y

In [None]:
iris_data['species'].value_counts()

We can now apply the function to the `species` field in our data:

In [None]:
iris_data['target']= iris_data['species'].apply(get_num)

In [None]:
iris_data

Just for kicks, let us look at the last 6 records:

In [None]:
iris_data.tail()

# Modelling the data

As we mentioned above the algorithm we are going to use requires data to be numerical and structures in arrays. 

We can extract the values from the pandas dataframe:

- `X`: the iris attributes
- `Y`: target species

In [None]:
feature_names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
X = iris_data[feature_names].values
Y = iris_data['target'].values

Let us import the `tree` method from the Scikit-Learn library

In [None]:
from sklearn import tree

Scikit-learn requires us to create an instance of the model, in this case we use the `DecisionTreeClassifier` method using `entropy` as the criterion used to partition our data.

Entropy in information theory tells us how much information there is in an event. In general, the more uncertain or random the event is, the more information it will contain. The concept of information entropy was created by mathematician Claude Shannon.


In [None]:
model = tree.DecisionTreeClassifier(criterion='entropy')

Once we have an instance of the model we can fit it with the `fit` model by providing the inputs and target:

In [None]:
IrisTree = model.fit(X, Y)

Remember that we are interested in predicting the likely species of a flower based on its characteristics. 

We can obtain the predictions given by the model with the help of the `predict` method.

In [None]:
iris_pred = IrisTree.predict(X)

Finally, we can see how well se have done by comparing the predictions to the targets:

In [None]:
iris_pred - Y

# Looking at the rules

Don't worry too much at this stage about the details of the function below.

We are using it to take a look at the rules that the decision tree we implementd has generated.

In [None]:
def get_code(tree, feature_names):
    left      = tree.tree_.children_left
    right     = tree.tree_.children_right
    threshold = tree.tree_.threshold
    features  = [feature_names[i] for i in tree.tree_.feature]
    value = tree.tree_.value

    def recurse(left, right, threshold, features, node):
        if (threshold[node] != -2):
            print("if ( " + features[node] + " <= " + str(threshold[node]) + " ) {")
            if left[node] != -1:
                recurse(left, right, threshold, features, left[node])
            print("} else {")
            if right[node] != -1:
                recurse(left, right, threshold, features, right[node])
            print("}")
        else:
            print("return " + str(value[node]))

    recurse(left, right, threshold, features, 0)
    
def plot_tree(model, feature_names):
    """Generate a tree visualisation export
    
    Returns a full tree of a corresponding sklean model
    use IPython.display.Image() for showing it in jupyter
    """
    import pydotplus
    
    dt_full = tree.export_graphviz(model, out_file = None,
                                   feature_names = feature_names)
    pydot_full = pydotplus.graph_from_dot_data(dt_full)
    
    return pydot_full.create_png()

Let us take a look at the rules:

In [None]:
get_code(IrisTree, iris_data.columns)

We need to use Anaconda to install a Python library that interfaces with a graph (a tree is a type of a graph) description language.  
```conda``` is the command that we use to install this package.  

In [None]:
!conda install pydotplus

We also need to install a Graphviz library to be able to visualize the tree.  
Graphviz uses a notation that allows to express the tree in a way that can be used for a tree visualization.  

In [None]:
!brew install graphviz

The resulting graphical representation of this tree:

In [None]:
from IPython.display import Image
Image(plot_tree(model, feature_names))