## Decision Tree classifier implementation in Python with sklearn Library

http://dataaspirant.com/2017/02/01/decision-tree-algorithm-python-with-scikit-learn/

### Import Libraries
<p>In this tutorial, we are going to use <b>*numpy*</b>, <b>*pandas*</b> and <b>*sklearn*</b>. And the first step, of couse, is to import these libraries.</p>
<p>
<div class="btn-group">
  <a class="btn btn-info" href="http://www.numpy.org" style="border-radius: 2px">numpy</button>
  <a class="btn btn-info" href="http://pandas.pydata.org" style="border-radius: 2px">pandas</button>
  <a class="btn btn-info" href="http://scikit-learn.org/stable/" style="border-radius: 2px">sklearn</button>
</div>
</p>

In [25]:
import numpy as np
import pandas as pd
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

### Downloading Data
After downloading the data file, we will use Pandas <b>read_csv()</b> method to import data into pandas dataframe. 

In [26]:
balance_data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/balance-scale/balance-scale.data',
                           sep= ',', header= None)

### Balance Scale Data Set Description

Balance Scale data set consists of 5 attributes, 4 as feature attributes and 1 as the target attribute. We will try to build a classifier for predicting the Class attribute. The index of target attribute is 1st.

1. Target-Attribute: 3 values (L, B, R)
2. Left-Weight: 5 values (1, 2, 3, 4, 5)
3. Left-Distance: 5 values (1, 2, 3, 4, 5)
4. Right-Weight: 5 values (1, 2, 3, 4, 5)
5. Right-Distance: 5 values (1, 2, 3, 4, 5)

### 1. To view the shape of dataset

In [27]:
#run code here to view the shape of dataset
balance_data.shape

(625, 5)

### 2. To view the first five records of dataset

In [28]:
#run code here to view what data looks like
balance_data.head()

Unnamed: 0,0,1,2,3,4
0,B,1,1,1,1
1,R,1,1,1,2
2,R,1,1,1,3
3,R,1,1,1,4
4,R,1,1,1,5


### 3. Data Slicing

Data slicing is a step to split data into train and test set. Training data set can be used specifically for our model building. Test dataset should not be mixed up while building model. Even during standardization, we should not standardize our test set.

The below snippet divides data into feature set & target set. 
The “X ” set consists of predictor variables. It consists of data from 2nd column to 5th column. The “Y” set consists of the outcome variable. It consists of data in the 1st column. We are using <b>“.values”</b> of numpy converting our dataframes into numpy arrays.

In [29]:

X = balance_data.values[:,1:5]
Y = balance_data.values[:,0]


### 4. Let’s split our data into training and test set.

<p>We will use sklearn’s <b>train_test_split()</b> method.</p>

The below snippet will split data into training and test set. X_train, y_train are training data, and X_test, y_test belongs to the test dataset.

The parameter test_size is given value 0.3; it means test sets will be 30% of whole dataset and training dataset’s size will be 70% of the entire dataset. random_state variable is a pseudo-random number generator state used for random sampling. Here we use random_state = 100.

In [30]:

X_train, X_test, y_train, y_test = train_test_split( X, Y, test_size = 0.3, random_state = 100)


### 5. Decision Tree Training

Now we fit Decision tree algorithm on training data, predicting labels for validation dataset and printing the accuracy of the model using various parameters.

DecisionTreeClassifier(): This is the classifier function for DecisionTree. It is the main function for implementing the algorithms. Some important parameters are:

In [31]:
# Decision Tree Classifier with criterion gini index
clf_gini = DecisionTreeClassifier(criterion = "gini", random_state = 100,max_depth=3, min_samples_leaf=5)
clf_gini.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=5, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=100,
            splitter='best')

<button type="button" class="btn btn-success" data-toggle="collapse" data-target="#info1">
Click here for the more information</button>
<div id="info1" class="collapse">
<p style="color:#339933"><b>clf_gini = DecisionTreeClassifier(criterion = "gini", random_state = 100,max_depth=3, min_samples_leaf=5)</b><br>
<small><small>- Define the model function.</small></small><br>
<small><small>- The value of parameter <b>criterion</b> inclueds "gini" and "entropy", which are statistical index to calculate information gain.</small></small><br>
<small><small>- The value of parameter <b>min_sample_leaf</b> equals to 5 means in one leaf node, the minimum samples number is 5.</small></small><br>
<b>clf_gini.fit(X_train, y_train)</b><br>
<small><small>- Model fit, in this step, the data fit in the model, then the tree start to split.</small></small><br>
</p>

</div>

In [32]:
# Decidion tree Classifier with criterion information gain(entropy)
clf_entropy = DecisionTreeClassifier(criterion = "entropy", random_state = 100, max_depth=3, min_samples_leaf=5)
clf_entropy.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=3,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=5, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=100,
            splitter='best')

### 6. Prediction

Now, we have modeled 2 classifiers. One classifier with gini index and another one with information gain as the criterion. We are ready to predict classes for our test set. We can use <b>predict()</b> method. Let’s try to predict target variable for test set’s 1st record.

In [33]:
# Let’s try to predict target variable for test set’s 1st record.
# Recall that the 1st record is [4, 4, 3, 3]
clf_gini.predict([[4, 4, 3, 3]])

array(['R'], dtype=object)

In [34]:
# Prediction for Decision Tree classifier with criterion as gini index
y_pred = clf_gini.predict(X_test)
y_pred

array(['R', 'L', 'R', 'R', 'R', 'L', 'R', 'L', 'L', 'L', 'R', 'L', 'L',
       'L', 'R', 'L', 'R', 'L', 'L', 'R', 'L', 'R', 'L', 'L', 'R', 'L',
       'L', 'L', 'R', 'L', 'L', 'L', 'R', 'L', 'L', 'L', 'L', 'R', 'L',
       'L', 'R', 'L', 'R', 'L', 'R', 'R', 'L', 'L', 'R', 'L', 'R', 'R',
       'L', 'R', 'R', 'L', 'R', 'R', 'L', 'L', 'R', 'R', 'L', 'L', 'L',
       'L', 'L', 'R', 'R', 'L', 'L', 'R', 'R', 'L', 'R', 'L', 'R', 'R',
       'R', 'L', 'R', 'L', 'L', 'L', 'L', 'R', 'R', 'L', 'R', 'L', 'R',
       'R', 'L', 'L', 'L', 'R', 'R', 'L', 'L', 'L', 'R', 'L', 'R', 'R',
       'R', 'R', 'R', 'R', 'R', 'L', 'R', 'L', 'R', 'R', 'L', 'R', 'R',
       'R', 'R', 'R', 'L', 'R', 'L', 'L', 'L', 'L', 'L', 'L', 'L', 'R',
       'R', 'R', 'R', 'L', 'R', 'R', 'R', 'L', 'L', 'R', 'L', 'R', 'L',
       'R', 'L', 'L', 'R', 'L', 'L', 'R', 'L', 'R', 'L', 'R', 'R', 'R',
       'L', 'R', 'R', 'R', 'R', 'R', 'L', 'L', 'R', 'R', 'R', 'R', 'L',
       'R', 'R', 'R', 'L', 'R', 'L', 'L', 'L', 'L', 'R', 'R', 'L

In [35]:
# Prediction for Decision Tree classifier with criterion as information gain(entropy)
y_pred_en = clf_entropy.predict(X_test)
y_pred_en

array(['R', 'L', 'R', 'L', 'R', 'L', 'R', 'L', 'R', 'R', 'R', 'R', 'L',
       'L', 'R', 'L', 'R', 'L', 'L', 'R', 'L', 'R', 'L', 'L', 'R', 'L',
       'R', 'L', 'R', 'L', 'R', 'L', 'R', 'L', 'L', 'L', 'L', 'L', 'R',
       'L', 'R', 'L', 'R', 'L', 'R', 'R', 'L', 'L', 'R', 'L', 'L', 'R',
       'L', 'L', 'R', 'L', 'R', 'R', 'L', 'R', 'R', 'R', 'L', 'L', 'R',
       'L', 'L', 'R', 'L', 'L', 'L', 'R', 'R', 'L', 'R', 'L', 'R', 'R',
       'R', 'L', 'R', 'L', 'L', 'L', 'L', 'R', 'R', 'L', 'R', 'L', 'R',
       'R', 'L', 'L', 'L', 'R', 'R', 'L', 'L', 'L', 'R', 'L', 'L', 'R',
       'R', 'R', 'R', 'R', 'R', 'L', 'R', 'L', 'R', 'R', 'L', 'R', 'R',
       'L', 'R', 'R', 'L', 'R', 'R', 'R', 'L', 'L', 'L', 'L', 'L', 'R',
       'R', 'R', 'R', 'L', 'R', 'R', 'R', 'L', 'L', 'R', 'L', 'R', 'L',
       'R', 'L', 'R', 'R', 'L', 'L', 'R', 'L', 'R', 'R', 'R', 'R', 'R',
       'L', 'R', 'R', 'R', 'R', 'R', 'R', 'L', 'R', 'L', 'R', 'R', 'L',
       'R', 'L', 'R', 'L', 'R', 'L', 'L', 'L', 'L', 'L', 'R', 'R

### 7. Calculating Accuracy Score
<p>The function <b>accuracy_score()</b> will be used to print accuracy of Decision Tree algorithm. By accuracy, we mean the ratio of the correctly predicted data points to all the predicted data points. Accuracy as a metric helps to understand the effectiveness of our algorithm. It takes 4 parameters.</p>
<p style = "color:#cc6600">
    y_true,<br>
    y_pred,<br>
    normalize,<br>
    sample_weight.<br>
</p>
<p>Out of these 4, normalize & sample_weight are optional parameters. The parameter y_true  accepts an array of correct labels and y_pred takes an array of predicted labels that are returned by the classifier. It returns accuracy as a float value.</p>

In [36]:
# Accuracy for Decision Tree classifier with criterion as gini index
print ('accuracy is ')
accuracy_score(y_test,y_pred)*100

accuracy is 


73.40425531914893

In [37]:
# Accuracy for Decision Tree classifier with criterion as information gain
print ('accuracy is ')
accuracy_score(y_test,y_pred_en)*100

accuracy is 


70.744680851063833