# A New Decision Tree builder model for Python
** Author: ** Siddharth Verma, Nayan Chaudhary

# Introduction
Random forest (more specifically Decision trees) are used widely in classification as well as regression tasks. There are a number of packages in Python, R as well as other programming languages that have bulit-in methods for creating decision trees.

One of the most widely used Python packages _scikit-learn_ has its own implementation of a _DecisionTreeClassifier_ and a _DecisionTreeRegressor_. However, the current version of scikit-learn does not handle categorical data and requires the categorical input to be encoded in some form (numeric).

In this project, we propose to create a DecisionTreeBuilder that can automatically handle categorical variables, without requiring the user to encode them. We also propose to show that the prediction/regression results obtained from our approach are comparable to those obtained from using the built-in sklearn models (with encoded categorical attributes wherever necessary), across various evaluation criteria.

In [1]:
import SimpleTree as st
import pandas as pd
import numpy as np
import operator
from sklearn import tree
from sklearn import metrics
from sklearn import preprocessing

# Datasets
In order to conduct an exhaustive analysis of our model performance, we considered 3 datasets (2 for classification task, 1 for a regression task). We used the **Iris Species, Boston Housing** and the **Breast Cancer Diagnosis** datasets, which have been widely used in academia as standard examples.

To observe the difference between our approach and encoding categorical variables with sklean, we also compare our results on the **Car Evaluation** dataset, which contains categorical variables as a majority.

All the datasets are available through the UCI Machine Learning Reposir

## Loading Iris species dataset

In [2]:
iris = pd.read_csv('./iris.csv')
iris.drop('Id', axis =1, inplace=True)

In [3]:
iris.head()

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


## Loading Boston housing dataset

In [4]:
boston = pd.read_csv('./boston.csv', nrows=150)

In [5]:
boston.head()

Unnamed: 0,TOWN,TRACT,LON,LAT,MEDV,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO
0,Nahant,2011,-70.955,42.255,24.0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3
1,Swampscott,2021,-70.95,42.2875,21.6,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8
2,Swampscott,2022,-70.936,42.283,34.7,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8
3,Marblehead,2031,-70.928,42.293,33.4,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7
4,Marblehead,2032,-70.922,42.298,36.2,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7


## Loading Breast cancer diagnosis dataset

In [6]:
cancer = pd.read_csv('./breast_cancer.csv', nrows=150)

In [7]:
y = cancer.diagnosis
cancer.drop('diagnosis', axis=1, inplace=True)
cancer['diagnosis'] = y

In [8]:
cancer.head()

Unnamed: 0,id,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,diagnosis
0,842302,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,M
1,842517,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,M
2,84300903,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,M
3,84348301,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,M
4,84358402,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,M


# Building Decision Trees: Using sklearn

In [9]:
iris_X = iris.loc[:, iris.columns!= 'Species']
boston_X = boston.loc[:, boston.columns!= 'PTRATIO']
cancer_X = cancer.loc[:, cancer.columns!= 'diagnosis']

We see that the response variable is catgorical for the iris and cancer datasets, so we would have to encode them to numeric types for evaluation later on. We can do that by using the labelEncoder function present in sklearn.

In [10]:
le = preprocessing.LabelEncoder()
le.fit(iris['Species'])
iris['Species'] = le.transform(iris['Species'])
le.fit(cancer['diagnosis'])
cancer['diagnosis'] = le.transform(cancer['diagnosis'])

In [11]:
iris_Y = iris.loc[:, iris.columns == 'Species']
boston_Y = boston.loc[:, boston.columns == 'PTRATIO']
cancer_Y = cancer.loc[:, cancer.columns == 'diagnosis']

In [12]:
iris_clf = tree.DecisionTreeClassifier()
boston_clf = tree.DecisionTreeRegressor()
cancer_clf = tree.DecisionTreeClassifier()

In [13]:
iris_clf.fit(iris_X[:120], iris_Y[:120])

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [14]:
boston_clf.fit(boston_X[:120], boston_Y[:120])

ValueError: could not convert string to float: Malden

** Why did we get an error here? **

Because scikit_learn, in its current form doesn't automatically handle categorical variables. As we've seen before the boston housing dataset contains 'TOWN' attribute which is of categorical type.

** How can we resolve this issue? **

We need to look at ways to convert the categorical data into numeric data. One way would be to one-hot encode the data. One-hot encoding creates as many features as there are categories, where each category is now represented by a binary (0/1) feature indicating its presence.

In [15]:
boston = pd.concat([boston, pd.get_dummies(boston['TOWN'])], axis=1)

In [16]:
boston.drop(columns=['TOWN', 'PTRATIO'], inplace=True)
boston['PTRATIO'] = boston_Y

In [17]:
boston_X = boston.drop('PTRATIO', axis=1)
boston_Y = boston['PTRATIO']

In [18]:
boston.head()

Unnamed: 0,TRACT,LON,LAT,MEDV,CRIM,ZN,INDUS,CHAS,NOX,RM,...,Somerville,Stoneham,Swampscott,Topsfield,Wakefield,Wenham,Wilmington,Winchester,Woburn,PTRATIO
0,2011,-70.955,42.255,24.0,0.00632,18.0,2.31,0,0.538,6.575,...,0,0,0,0,0,0,0,0,0,15.3
1,2021,-70.95,42.2875,21.6,0.02731,0.0,7.07,0,0.469,6.421,...,0,0,1,0,0,0,0,0,0,17.8
2,2022,-70.936,42.283,34.7,0.02729,0.0,7.07,0,0.469,7.185,...,0,0,1,0,0,0,0,0,0,17.8
3,2031,-70.928,42.293,33.4,0.03237,0.0,2.18,0,0.458,6.998,...,0,0,0,0,0,0,0,0,0,18.7
4,2032,-70.922,42.298,36.2,0.06905,0.0,2.18,0,0.458,7.147,...,0,0,0,0,0,0,0,0,0,18.7


We have completed the one-hot encoding, which can be seen by the additon of new columns containing only binary values. Now, we should be able to fit the data to the model correctly.

In [19]:
boston_clf.fit(boston_X[:120], boston_Y[:120])

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')

In [20]:
cancer_clf.fit(cancer_X[:120], cancer_Y[:120])

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

# Prediction: Using sklearn

In [21]:
y_act_iris = iris_Y[120:]
y_pred_skl_iris = iris_clf.predict(iris_X[120:])

In [22]:
y_act_boston = boston_Y[120:]
y_pred_skl_boston = boston_clf.predict(boston_X[120:])

In [23]:
y_act_cancer = cancer_Y[120:]
y_pred_skl_cancer = cancer_clf.predict(cancer_X[120:])

# Building Decision Trees: Using our Approach

In [24]:
%%time
dt_iris = st.build_tree(iris[:120], 'Species')
st.print_tree(dt_iris)

The splitter condition is: PetalWidthCm < 1.0
|--> True:
  Predict Counter({0: 50})
|--> False:
  The splitter condition is: PetalWidthCm < 1.7
  |--> True:
    The splitter condition is: PetalLengthCm != 5.0
    |--> True:
      Predict Counter({1: 48})
    |--> False:
      Predict Counter({2: 1})
  |--> False:
    The splitter condition is: PetalLengthCm != 4.8
    |--> True:
      The splitter condition is: PetalWidthCm >= 1.8
      |--> True:
        Predict Counter({2: 18})
      |--> False:
        The splitter condition is: PetalLengthCm != 4.5
        |--> True:
          Predict Counter({1: 1})
        |--> False:
          Predict Counter({2: 1})
    |--> False:
      Predict Counter({1: 1})
CPU times: user 4.79 s, sys: 47.8 ms, total: 4.83 s
Wall time: 5.11 s


In [None]:
%%time
dt_boston = st.build_tree(boston[:120], 'PTRATIO')
st.print_tree(dt_boston)

In [None]:
%%time
dt_cancer = st.build_tree(cancer[:120], 'diagnosis')
st.print_tree(dt_cancer)

# Prediction: Using our Approach

In [None]:
def getModeClass(dictionary):
    return max(dictionary.iteritems(), key=operator.itemgetter(1))[0]

In [None]:
y_pred_oa_iris = []
for index, obs in iris[120:].iterrows():
    y_pred_oa_iris.append(getModeClass(st.print_leaf(st.classify(obs, dt_iris))))

In [None]:
y_pred_oa_boston = []
for index, obs in boston[120:].iterrows():
    y_pred_oa_boston.append(getModeClass(st.print_leaf(st.classify(obs, dt_boston))))

In [None]:
y_pred_oa_cancer = []
for index, obs in cancer[120:].iterrows():
    y_pred_oa_cancer.append(getModeClass(st.print_leaf(st.classify(obs, dt_cancer))))

# Evaluation: An extensive comparison of the two methods
We look at various evaluation metrics (such as precision, recall, accuracy, f-1 score, etc.) of the two methods to give more insight into how the performance of the classifier varies depending on the approach taken.

## Computing Evaluation Metrics
We compare the model predictions obtained from sklearn and our approach, and evaluate them using F-1 Score, Accuracy, MAE, MSE.
### Using sklearn

In [26]:
# Uncomment to compute a particular score

# metrics.f1_score(y_pred_skl_iris, y_act_iris.Species, pos_label=2)
# metrics.accuracy_score(y_pred_skl_iris, y_act_iris.Species)

# metrics.mean_absolute_error(y_pred_skl_boston, y_act_boston)
# metrics.mean_squared_error(y_pred_skl_boston, y_act_boston)

# metrics.f1_score(y_pred_skl_cancer, y_act_cancer.diagnosis, pos_label=1)
# metrics.accuracy_score(y_pred_skl_cancer, y_act_cancer.diagnosis)

### Using Our Approach

In [27]:
# Uncomment to compute a particular score

# metrics.f1_score(y_pred_skl_iris, y_act_iris.Species, pos_label=2)
# metrics.accuracy_score(y_pred_oa_iris, y_act_iris.Species)

# metrics.mean_absolute_error(y_pred_oa_boston, y_act_boston)
# metrics.mean_squared_error(y_pred_oa_boston, y_act_boston)

# metrics.f1_score(y_pred_oa_cancer, y_act_cancer.diagnosis, pos_label=1)
# metrics.accuracy_score(y_pred_oa_cancer, y_act_cancer.diagnosis)

In [28]:
%%html
<style>
table {float:left}
table {float:left}
table {float:left}
</style>

The final tables below show the scores for each dataset that was used in the comparison analysis

## Iris species Dataset
| Method | Accuracy | F-1 Score |
|----------|:---------:|:--------:|
| **sklearn**  | 0.733    | 0.846 | 
| **Our method** | 0.833  | 0.846 | 

## Boston housing prices Dataset

| Method | MAE | MSE |
|----------|:---------:|:--------:|
| **sklearn**  | 2.22   | 11.08 | 
| **Our method** | 2.83 | 8.74 | 


## Breast Cancer diagnosis Dataset

| Method | Accuracy | F-1 Score |
|----------|:---------:|:--------:|
| **sklearn**  | 0.800   | 0.769 | 
| **Our method** | 0.733  | 0.714 | 

# Conclusion
From the above experiments we observe the following
* We are able to perform classification and regression tasks on a variety of datasets and obtain performance scores close to that of the sklearn python package.
* Additionally, we are able to handle categorical data automatically in our approach, while sklearn requires us to perform an encoding of some sort (one-hot encoding in this case).

# References
* We have used some of the Tree builder methods and other function defintions based on the guidelines provided by [Google Developers Tutorials.](https://github.com/random-forests/tutorials/blob/master/decision_tree.ipynb)
* The datasets for the experiments were used from the [UCI Machine Learning Repository.](https://archive.ics.uci.edu/ml/datasets.html)