# Decision Tree Experimentation
First, we import all relevant packages

The crossvalidation's train_test_split() help us by splitting data into train & test set. This is easy way out before we do further processing:
We should preprocess the data by partioning with the same percentage for training, cross_validation and test set.

In [88]:
import numpy as np
import pandas as pd
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn import tree
from sklearn.model_selection import cross_val_score

## Loading dataset

Load Dataset and examine the initial features

In [89]:
input_data = pd.read_csv('processed_train.csv', index_col=0)

In [90]:
print ("Dataset Length:: ", len(input_data))
print ("Dataset Shape: ", input_data.shape)
input_data.info()
input_data.head(5)

Dataset Length::  2973
Dataset Shape:  (2973, 217)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2973 entries, 0 to 2972
Columns: 217 entries, Id to age-range_
dtypes: bool(1), float64(13), int64(201), object(2)
memory usage: 4.9+ MB


Unnamed: 0,Id,idhogar,Target,hacdor,hacapo,v14a,refrig,paredblolad,paredzocalo,paredpreb,...,escolari-min,escolari-max,escolari-sum,escolari-std,escolari-range_,age-min,age-max,age-sum,age-std,age-range_
0,ID_279628684,21eb7fcc1,4,0,0,1,1,1,0,0,...,10,10,10,0.0,0,43,43,43,0.0,0
1,ID_f29eb3ddd,0e5d7a658,4,0,0,1,1,0,0,0,...,12,12,12,0.0,0,67,67,67,0.0,0
2,ID_68de51c94,2c7317ea8,4,0,0,1,1,0,0,0,...,11,11,11,0.0,0,92,92,92,0.0,0
3,ID_ec05b1a7b,2b58d945f,4,0,0,1,1,1,0,0,...,2,11,33,4.272002,9,8,38,100,14.899664,30
4,ID_1284f8aad,d6dae86b7,4,1,0,1,1,1,0,0,...,0,11,23,5.123475,11,7,30,76,11.690452,23


As we are not doing any feature selection yet, we are gonna leave this section blank. 

In [91]:
#Split data into variables types - boolean, categorical, continuous, ID
bool_var = list(input_data.select_dtypes(['bool']))
cont_var = list(input_data.select_dtypes(['float64']))
cat_var = list(input_data.select_dtypes(['int64']))
id_var = list(input_data.select_dtypes(['object']))

#Get dataset with only categorical variables
cat_data = input_data[cat_var + bool_var]

#Get Continuous Variables from Data
cont_data = input_data[cont_var]

#Input Data can be from all except id details
final_input_data = input_data[cat_var + bool_var + cont_var]

In [92]:
final_input_data['Target'].head(5)

0    4
1    4
2    4
3    4
4    4
Name: Target, dtype: int64

Creating X and Y variables. 
As shown above, target feature is at index 3 and the rest of the variables are the predictor variables. 

In [93]:
X = final_input_data.loc[:, final_input_data.columns != 'Target'].values
Y = final_input_data['Target'].values
Y=Y.astype('int')

In [94]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3, 
                                                    random_state = 100 )

## Decision Tree Modelling

### Creating basic Decision Tree using Gini Index as Criterion

In [95]:
clf_gini = DecisionTreeClassifier(criterion = "gini", random_state = 100,
                               max_depth=3, min_samples_leaf=5)
clf_gini.fit(X_train, Y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=5,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=100, splitter='best')

In [96]:
# predict( ) will do the model prediction, predict y based on the input x
Y_predict_gini = clf_gini.predict(X_test)
print ('testing acc for gini is %f' %accuracy_score(Y_predict_gini, Y_test))

testing acc for gini is 0.646861


### Creating basic Decision Tree using Information Gain as Criterion

Using Information Gain is the same as using entropy as metric

In [97]:
clf_entropy = DecisionTreeClassifier(criterion = "entropy", random_state = 100,
                                     max_depth=3, min_samples_leaf=5)
clf_entropy.fit(X_train, Y_train)

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=3,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=5,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=100, splitter='best')

In [98]:
# predict( ) will do the model prediction, predict y based on the input x
Y_predict_entropy = clf_entropy.predict(X_test)
print ('testing acc for entropy is %f' %accuracy_score(Y_predict_entropy, Y_test))

testing acc for entropy is 0.646861


## Tuning Parameters
Since there is no difference at the accuracy level between using information gain or gini coefficient, we should try changing the other parameters. 
Lets try to increase the max depth to see how that would change the result

In [99]:
clf_gini = DecisionTreeClassifier(criterion = "gini", random_state = 100,
                               max_depth=5, min_samples_leaf=5)
clf_gini.fit(X_train, Y_train)

clf_entropy = DecisionTreeClassifier(criterion = "entropy", random_state = 100,
                                     max_depth=5, min_samples_leaf=5)
clf_entropy.fit(X_train, Y_train)

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=5,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=5,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=100, splitter='best')

In [100]:
# predict( ) will do the model prediction, predict y based on the input x
Y_predict_gini = clf_gini.predict(X_test)
print ('testing acc for gini is %f' %accuracy_score(Y_predict_gini, Y_test))
Y_predict_entropy = clf_entropy.predict(X_test)
print ('testing acc for entropy is %f' %accuracy_score(Y_predict_entropy, Y_test))

testing acc for gini is 0.647982
testing acc for entropy is 0.626682


Gini Coefficient seems to do a lot better with the max depth of the tree increased. 

## Other Metrics
Lets compute the cross-validation scores