## DECISION TREE USING IRIS DATASET
Uses gini index to split the data at binary level.
  * Strengths: Can select a large number of features that best determine the targets.
  * Weakness: Tends to overfit the data as it will split till the end. Pruning can be done to remove the leaves to prevent overfitting but that is not available in sklearn. Small changes in data can lead to different splits. Not very reproducible for future data (see random forest).

In [3]:
import pandas as pd
import numpy as py
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.cross_validation import train_test_split
import sklearn.metrics

In [4]:
# load iris dataset and dump into a dataframe
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [5]:
# Add target response into dataframe, i.e., species
df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [6]:
# convert species into factors (integers) so that it can be compared later in graphviz
# as sklearn converts them into factors during processing (still, categorical data can be used in model)
df['species_factorize'], _ = pd.factorize(df['species'])

print df.head(n=2)
print df['species'].unique()
print df['species_factorize'].unique()
# so, we can see that 0 = setosa, 1 = versicolor, 2 = virginica

   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                5.1               3.5                1.4               0.2   
1                4.9               3.0                1.4               0.2   

  species  species_factorize  
0  setosa                  0  
1  setosa                  0  
[setosa, versicolor, virginica]
Categories (3, object): [setosa, versicolor, virginica]
[0 1 2]


In [7]:
# time to sort the dataframes into two
# 1) the predictors 2) the response target
predictor = df[['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']]
target = df['species']

print predictor.head(n=2)
print target.head(n=2)

   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0                5.1               3.5                1.4               0.2
1                4.9               3.0                1.4               0.2
0    setosa
1    setosa
Name: species, dtype: category
Categories (3, object): [setosa, versicolor, virginica]


In [8]:
# split the dataframe randomly using sklearn train_test_split function
# define the size of test dataframe 25% for test here
train_predictor, test_predictor, train_target, test_target = train_test_split(predictor, target, test_size=0.25)

# print shape
# test sample is about 25% (38) of total sample while train sample is 75% (112)
print test_predictor.shape
print train_predictor.shape

(38, 4)
(112, 4)


## (1) Fit the Model

In [9]:
# set the classifier
clf = DecisionTreeClassifier()
# fit (train) the model. Arguments (training predictor, training response target)
model = clf.fit(train_predictor, train_target)
model

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

## (2) Test the Model

In [10]:
# put the test sample predictor in
predictions = model.predict(test_predictor)

## (3) Score the Model

In [11]:
# score the models, using a confusion matrix, and a percentage score
print sklearn.metrics.confusion_matrix(test_target,predictions)
print sklearn.metrics.accuracy_score(test_target, predictions)*100, '%'

[[14  0  0]
 [ 0 13  0]
 [ 0  1 10]]
97.3684210526 %


In [12]:
# it is easier to use this package that does everything nicely for a perfect confusion matrix
from pandas_confusion import ConfusionMatrix
ConfusionMatrix(test_target, predictions)



Predicted   setosa  versicolor  virginica  __all__
Actual                                            
setosa          14           0          0       14
versicolor       0          13          0       13
virginica        0           1         10       11
__all__         14          14         10       38

In [16]:
# rank the importance of features
df2= pd.DataFrame(model.feature_importances_, index=df.columns[:-2])
df2.sort_values(by=0,ascending=False)

# petal width is most important (very important in fact) followed by petal length
# this can be better visualised in a graph form, see next code below

Unnamed: 0,0
petal width (cm),0.952542
petal length (cm),0.029591
sepal length (cm),0.017867
sepal width (cm),0.0


## View your decision tree model in graphical form

In [47]:
# create a .dot file of the tree using graphviz
# arguments include your model, output name, and feature names (predictors) for labelling)
from sklearn import tree
tree.export_graphviz(model, out_file = 'tree.dot', feature_names=iris.feature_names)

In [1]:
# convert .dot to .ps (postscript) so the file can be open
# otherwise, visualise it online by pasting the .dot code at this link (http://www.webgraphviz.com)

import subprocess
subprocess.call(['dot', '-Tps', 'tree.dot', '-o' 'tree.ps'])
# print of '0' means success, '1' means no success

0