Decision Tree Example
Load Iris Data

In [1]:
#Importing required libraries
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

#Loading the iris data
data = load_iris()
print('Classes to predict: ', data.target_names)


Classes to predict:  ['setosa' 'versicolor' 'virginica']


Split the data into attributes and Target variable.

In [2]:
#Extracting data attributes
X = data.data
### Extracting target/ class labels
y = data.target

print('Number of examples in the data:', X.shape[0])

Number of examples in the data: 150


Take a look of the data

In [3]:
#First four rows in the variable 'X'
X[:4]


array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2]])

Split the dataset into Training and Test dataset

In [4]:
#Using the train_test_split to create train and test sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 47, test_size = 0.25)

Call the model

In [5]:
#Importing the Decision tree classifier from the sklearn library.
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(criterion = 'entropy')

In [6]:
#Training the decision tree classifier. 
clf.fit(X_train, y_train)

Validation (Apply to the Test dataset)

In [7]:
#Predicting labels on the test set.
y_pred =  clf.predict(X_test)

Check the accuracy (Here we will use accuracy first. We will change it to other validation methods later.)

In [8]:
#Importing the accuracy metric from sklearn.metrics library

from sklearn.metrics import accuracy_score
print('Accuracy Score on train data: ', accuracy_score(y_true=y_train, y_pred=clf.predict(X_train)))
print('Accuracy Score on test data: ', accuracy_score(y_true=y_test, y_pred=y_pred))


Accuracy Score on train data:  1.0
Accuracy Score on test data:  0.9473684210526315


0.95... may not be very good. Let's tune the parameter. One of the parameter in Decision Tree is min_samples_split which means the smallest number for a node that can split into 2 nodes in the next level. The default value is 2. It may lead to overfitting. Let's change it to 50 and see...

In [9]:
clf = DecisionTreeClassifier(criterion='entropy', min_samples_split=50)
clf.fit(X_train, y_train)
print('Accuracy Score on train data: ', accuracy_score(y_true=y_train, y_pred=clf.predict(X_train)))
print('Accuracy Score on the test data: ', accuracy_score(y_true=y_test, y_pred=clf.predict(X_test)))


Accuracy Score on train data:  0.9553571428571429
Accuracy Score on the test data:  0.9736842105263158


This is a typical overfitting example. High accuracy in training, but relatively weak in test dataset. After changing the min_samples_split, we overcome the overfitting problem!