# Decision Tree trong Python

## Introduction


* Author: ntthanh.sdh221@hcmut.edu.vn

* Content:
  - Practice the DecisionTree classifier
  - Learn more about split the dataset into train and test.
  - Evaluation by Accuracy score
* Major steps:
  - Load data from CSV, split into trainning set and test set
  - Buidling kNN model.
  - Evaluating results
  - Training DecisionTree classifier model
  - Evaluating results
  - Visualize models

## Configuration

In [3]:
dataset_filename = "Iris.csv"

## Grab the data

Since our file is in CSV format, we will use panda's read_csv method to read our CSV data file. Execute the following script to do so:

In [4]:
import pandas as pd  
dataset = pd.read_csv(dataset_filename)
dataset.shape

(150, 6)

In [5]:
# Overview about dataset
dataset.keys()

Index(['Id', 'SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm',
       'Species'],
      dtype='object')

In [6]:
dataset =  dataset.drop(labels=['Id'], axis=1)

dataset.head()  

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


## Prepare train data and test set

The label is describe in **Species** column. So that we devide the dataset into attributes and labels

In [7]:
X = dataset.drop('Species', axis=1)  
y = dataset['Species']  

Here the X variable contains all the columns from the dataset, except the **Species** column, which is the label.

The y variable contains the values from the **Species** column. 

The X variable is our attribute set and y variable contains corresponding labels.

### Dividing our data into training and test sets. 

So, we split the test into 2 sets: training set and testing set. We use to split up 20% of the data in to the test set and 80% for training.

In [8]:
from sklearn.model_selection import train_test_split  
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)
print(X_train.shape, y_train.size)
print(X_test.shape, y_test.shape)

(120, 4) 120
(30, 4) (30,)


## Training with Decision Tree

Now, let's build a **Decision Tree** model

### a) criterion='gini'

In [9]:
from sklearn.tree import DecisionTreeClassifier  
dt = DecisionTreeClassifier(criterion='gini')  
dt.fit(X_train, y_train)  

In [10]:
y_pred_dt = dt.predict(X_test)  
y_pred_dt

array(['Iris-setosa', 'Iris-virginica', 'Iris-setosa', 'Iris-setosa',
       'Iris-versicolor', 'Iris-versicolor', 'Iris-virginica',
       'Iris-versicolor', 'Iris-versicolor', 'Iris-virginica',
       'Iris-setosa', 'Iris-virginica', 'Iris-versicolor',
       'Iris-virginica', 'Iris-setosa', 'Iris-setosa', 'Iris-versicolor',
       'Iris-versicolor', 'Iris-virginica', 'Iris-setosa',
       'Iris-virginica', 'Iris-versicolor', 'Iris-versicolor',
       'Iris-virginica', 'Iris-virginica', 'Iris-versicolor',
       'Iris-setosa', 'Iris-virginica', 'Iris-versicolor', 'Iris-setosa'],
      dtype=object)

##### Evaluating the Algorithm



In [11]:
dt_score = dt.score(X_test, y_test)
print(f"Decision Tree classifier accuracy score is {dt_score}")

Decision Tree classifier accuracy score is 1.0


### b) criterion='entropy'

In [13]:
from sklearn.tree import DecisionTreeClassifier  
dt2 = DecisionTreeClassifier(criterion='entropy')  
dt2.fit(X_train, y_train)  

In [14]:
y_pred_dt = dt2.predict(X_test)  
y_pred_dt

array(['Iris-setosa', 'Iris-virginica', 'Iris-setosa', 'Iris-setosa',
       'Iris-versicolor', 'Iris-versicolor', 'Iris-virginica',
       'Iris-versicolor', 'Iris-versicolor', 'Iris-virginica',
       'Iris-setosa', 'Iris-virginica', 'Iris-versicolor',
       'Iris-virginica', 'Iris-setosa', 'Iris-setosa', 'Iris-versicolor',
       'Iris-versicolor', 'Iris-virginica', 'Iris-setosa',
       'Iris-virginica', 'Iris-versicolor', 'Iris-versicolor',
       'Iris-virginica', 'Iris-virginica', 'Iris-versicolor',
       'Iris-setosa', 'Iris-virginica', 'Iris-versicolor', 'Iris-setosa'],
      dtype=object)

##### Evaluating the Algorithm



In [15]:
dt_score = dt.score(X_test, y_test)
print(f"Decision Tree classifier accuracy score is {dt_score}")

Decision Tree classifier accuracy score is 1.0


## Extra section

### Visualize decision tree

We can visualize the Decision Tree model using the [Graphviz](https://www.graphviz.org/) tool.

Graphviz is an easy tool for drawing graph. For example: the code

```
digraph G {Hello->World}
```

will generate the following graph

![digraph](https://graphviz.gitlab.io/_pages/Gallery/directed/hello.png)

* More example https://www.graphviz.org/gallery/
* You can play around at http://www.webgraphviz.com/ or https://dreampuf.github.io/GraphvizOnline/

So let's generate the graph.

In [16]:
from sklearn.tree import export_graphviz
dot_data = export_graphviz(dt, out_file=None)
print(dot_data)

digraph Tree {
node [shape=box, fontname="helvetica"] ;
edge [fontname="helvetica"] ;
0 [label="X[2] <= 2.6\ngini = 0.667\nsamples = 120\nvalue = [41, 39, 40]"] ;
1 [label="gini = 0.0\nsamples = 41\nvalue = [41, 0, 0]"] ;
0 -> 1 [labeldistance=2.5, labelangle=45, headlabel="True"] ;
2 [label="X[3] <= 1.75\ngini = 0.5\nsamples = 79\nvalue = [0, 39, 40]"] ;
0 -> 2 [labeldistance=2.5, labelangle=-45, headlabel="False"] ;
3 [label="X[2] <= 4.95\ngini = 0.206\nsamples = 43\nvalue = [0, 38, 5]"] ;
2 -> 3 ;
4 [label="X[3] <= 1.65\ngini = 0.053\nsamples = 37\nvalue = [0, 36, 1]"] ;
3 -> 4 ;
5 [label="gini = 0.0\nsamples = 36\nvalue = [0, 36, 0]"] ;
4 -> 5 ;
6 [label="gini = 0.0\nsamples = 1\nvalue = [0, 0, 1]"] ;
4 -> 6 ;
7 [label="X[3] <= 1.55\ngini = 0.444\nsamples = 6\nvalue = [0, 2, 4]"] ;
3 -> 7 ;
8 [label="gini = 0.0\nsamples = 3\nvalue = [0, 0, 3]"] ;
7 -> 8 ;
9 [label="X[2] <= 5.45\ngini = 0.444\nsamples = 3\nvalue = [0, 2, 1]"] ;
7 -> 9 ;
10 [label="gini = 0.0\nsamples = 2\nvalue = [0

Copy the code above, paste it to online graphviz service like http://viz-js.com/ and see the result.

### Jypiter does it all

If we want to display the graph automatically inside Jupiter, we'll need
* Install Graphviz binary
* Call the Graphviz inside Jupiter and grab the result.

**Note**: Google Colab does not let us install Graphviz binary, so that we cant make with work with Google Colab.


In [17]:
"""
Install lib that call Graphviz binary
"""

!pip install graphviz



In [18]:
import graphviz 
graph = graphviz.Source(dot_data)
graph

ExecutableNotFound: failed to execute PosixPath('dot'), make sure the Graphviz executables are on your systems' PATH

<graphviz.sources.Source at 0x7f4c9be013a0>