# Decision Trees in Python # 

Killian McKee

### Overview ### 

1. [What are Decision Trees?](#section1)
2. [Key Terms](#section2) 
3. [Pros and Cons of Decision Trees](#section3)
4. [When to use Decision Trees](#section4)
5. [Key Parameters](#section5) 
6. [Walkthrough: Building a Classification Tree](#section6) 
7. [Walkthrough: Building a Regression Tree](#section7)
8. [Conclusion](#section8) 
9. [Additional Reading](#section9)

<a id='section1'></a>

### What are Decision Trees? ###

Decision trees are a family of non-parametric algorithms capable of handling both classification and regression tasks. A decision tree takes input features and splits input data recursively based on those features (see example below). These splits ultimately lead to a shape resembling a tree with roots, nodes, and leaves (more on these later). 

<img src='dt_ex_simple.png'>

<a id='section2'></a>

### Key Terms ###

This sections overviews the terminology we need to understand before diving more deeply into decision trees. 

**Root**: The top of the decision tree. This represents the most important variable related to answering our question being asked e.g. if our question is 'did a certain passenger survive their voyage on the titanic?', our root might have a split on the passenger's gender, since women were much more likely to survive than men.

**Branches**: Further splits in the decision tree beneath the initial root.

**Node**: Each split representing an 'either-or' scenario beneath the initial root. For example, in our titanic example our root question was 'male or female?', but a node beneath that might ask 'age>15?' with the branches 'yes' and 'no' as young children were put into lifeboats before adults.

**Leaf**: The terminal node in a tree is a leaf

**Pruning**: The act of removing nodes and branches from the tree in order to avoid overfitting and improve generalized performance

**Purity**: Used by the algorithms that decide where to split a decision tree. A node is 100% impure when a node is split evenly and 100% pure when all of its data belongs to a single class. An effective decision tree model maximizes purity while avoiding impurity.

**Gini Index**: The Gini index measures model purity by calculating how often randomly chosen elements are labeled incorrectly.The goal of the gini index is to reach 0, where the model is minimally impure and all the data falls into one decision at a split. 

**Information Gain**: the metric used to determine at which point to split each feature at each step in the tree. 
information gain = entropy(parent) - weighted sum of entropy(children). Maximizing information gain leads to a model with near perfect bias at the expense of very high variance (overfitting training data). To overcome these shortcomings we can employ pruning, boosting, and random forests (boosting and random forests discussed in separate guides). 




<img src='dt_example.png'>



#### Refresher on the Bias/Variance Trade Off ####
<img src='bias_variance.png' width="400">

<a id='section3'></a>

### Pros and Cons of Decision Trees ###

Pros: 

1. **Easy to visualize**: Decision trees are easy to view schematically and can be displayed more easily than many algorithms. Understanding how a tree breaks down, where certain decisions are made, the impact of modifications to the algorithm, and the most critical features in a dataset can be viewed quickly with decision trees. 
2. **Flexible**: Decision trees can be used in both classification and regression tasks, can handle missing data more readily than many other algorithms, and can be easily pruned/tuned to alter model performance. 
3. **Fast**: Decision trees are fast to fit since they typically use greedy algorithms to fit the tree as opposed to fitting every possible tree. 
4. **Easy to implement**: Decision trees are straightforward to fit thanks to packages like scikit learn.

Cons: 

1. **Prone to overfitting**: Decision trees are prone to overfitting. We can overcome this by using by pruning our trees or using a boosted decision tree or random forest (extensions of decision trees, discussed in separate guides). 


<a id='section4'></a>

### When to Use Decision Trees ###

Decision trees are an appropriate starting point for most classification and regression tasks. They are especially valauble when you want to be able to visualize your model easily and understand its component pieces or when you need to make a model that runs very quickly. If your primary concern is accuracy and you have a solid of understanding of decision trees, random forests and boosted decision trees are superior extensions of the decision tree framework (covered in separate guides). 

<a id='section5'></a>

### Key Parameters ### 

There are 5 main levers we can move to influence our decision trees: 

1. Criterion: the function used to measure the quality of a split, the typical default is the gini index but alternatives will focus on things like information gain. 
2. Split: the strategy used to split a node, typically either 'best' or 'random'. Best typically leads to more accurate models, but random can help with overfitting. 
3. Max Depth: how deep the decision tree will grow. A deeper decision tree has lower variance but usually higher bias. 
4. Min Samples Split: the minimum number of samples required to split a node of our decision tree. Increasing this parameter constrains our decision tree by forcing it to consider more samples at each split. 
5. Min Samples Leaf: the minimum number of samples required to be at each leaf. Increasing this number can help smooth out the model (especially in regression), but constrains the growth of the model. 

<a id='section6'></a>

### Building a Classification Decision Tree with Python ### 


We will be building a classifcation model to classify balance scale data from a generated psychological examination dataset.Our aim is to accurately classify whether something falls to the left or the right based on their exam answers. More info available on the data [here](http://archive.ics.uci.edu/ml/datasets/balance+scale). 

In [2]:
# import necessary packages 

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn import tree
from sklearn.metrics import classification_report, confusion_matrix  

In [3]:
# import the data 

balance_data = pd.read_csv(
'https://archive.ics.uci.edu/ml/machine-learning-databases/balance-scale/balance-scale.data',
                           sep= ',', header= None)


In [4]:
# intial dataset examination

print("Dataset Lenght: ", len(balance_data))
print("Dataset Shape: ", balance_data.shape)
balance_data.head()

Dataset Lenght:  625
Dataset Shape:  (625, 5)


Unnamed: 0,0,1,2,3,4
0,B,1,1,1,1
1,R,1,1,1,2
2,R,1,1,1,3
3,R,1,1,1,4
4,R,1,1,1,5


In [5]:
# next, we split the data into features and our target variable so that we can split it into training and testing data 

x = balance_data.values[:, 1:5]
y = balance_data.values[:,0]

In [6]:
# Now we perform the train-test split on the data 

X_train, X_test, y_train, y_test = train_test_split( x, y, test_size = 0.3, random_state = 100)

In [7]:
# here we fit the decision classifier with the gini criterion selected

clf_gini = DecisionTreeClassifier(criterion = "gini", random_state = 100,
                               max_depth=10, min_samples_leaf=5)
clf_gini.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=10,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=5, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=100,
            splitter='best')

In [8]:
# let's try and use our classifier to predict a single value 

clf_gini.predict([X_test[0]])

array(['L'], dtype=object)

In [9]:
# now let's predict on the entire test set 

y_pred=clf_gini.predict(X_test)
y_pred

array(['L', 'L', 'R', 'L', 'R', 'L', 'R', 'L', 'L', 'L', 'L', 'L', 'R',
       'L', 'R', 'R', 'L', 'L', 'R', 'L', 'R', 'L', 'B', 'L', 'R', 'L',
       'L', 'L', 'R', 'L', 'L', 'L', 'R', 'L', 'L', 'L', 'L', 'R', 'L',
       'L', 'R', 'L', 'R', 'L', 'R', 'R', 'L', 'B', 'R', 'L', 'L', 'R',
       'L', 'L', 'R', 'L', 'R', 'R', 'L', 'B', 'L', 'R', 'L', 'L', 'L',
       'R', 'R', 'R', 'R', 'L', 'L', 'R', 'R', 'L', 'R', 'L', 'R', 'R',
       'R', 'L', 'R', 'R', 'L', 'R', 'L', 'R', 'R', 'L', 'R', 'L', 'R',
       'R', 'L', 'L', 'L', 'R', 'R', 'L', 'L', 'L', 'R', 'L', 'R', 'R',
       'R', 'R', 'R', 'R', 'L', 'L', 'R', 'L', 'R', 'R', 'L', 'L', 'R',
       'R', 'R', 'R', 'L', 'R', 'R', 'L', 'L', 'L', 'L', 'L', 'L', 'R',
       'R', 'R', 'R', 'L', 'R', 'R', 'R', 'L', 'L', 'R', 'L', 'L', 'L',
       'R', 'L', 'L', 'R', 'L', 'R', 'R', 'L', 'L', 'L', 'R', 'R', 'R',
       'L', 'R', 'R', 'B', 'R', 'R', 'B', 'L', 'R', 'R', 'L', 'R', 'R',
       'R', 'L', 'R', 'B', 'R', 'L', 'R', 'R', 'L', 'R', 'R', 'B

In [10]:
#let's see how the model performed by creating a confusion matrix and a classification report

print(confusion_matrix(y_test, y_pred))  
print(classification_report(y_test, y_pred))  

[[ 0  3 10]
 [ 2 77  6]
 [ 5  8 77]]
             precision    recall  f1-score   support

          B       0.00      0.00      0.00        13
          L       0.88      0.91      0.89        85
          R       0.83      0.86      0.84        90

avg / total       0.79      0.82      0.81       188



From the cell above, we can see that our model had an accuracy around 82%, which is acceptable on a dataset this small and with minimal parameter tuning.

<a id='section7'></a>

### Building a Regression Decision Tree with Python ### 

In this walkthrough we will build a decision tree regressor to predict home prices in boston based on data such as the crime rate, average age, taxes, etc. 

In [11]:
# import the necessary packages

from sklearn.datasets import load_boston
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn import tree,metrics
import numpy as np 
import pandas as pd 

In [12]:
# Load the dataset 

boston=load_boston()

In [13]:
# examine the dataset

# sklearn documentation on the dataset
print(boston.DESCR)

# check out the column headers
print('these are the feature columns', boston.feature_names)

# check out the feature data 
print('\n this is the data:\n',boston['data'])

# check out the target data 
print('\n this is the target:',boston['target'])


Boston House Prices dataset

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
      

In [14]:
# split the data into our x and y columns 

x=boston.data
y=boston.target

In [15]:
# split the data into traniing and test sets 

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0) 

In [16]:
# create and fit the regressor 

regressor = DecisionTreeRegressor(min_samples_split=30, min_samples_leaf=10)  
regressor.fit(X_train, y_train)  

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=10,
           min_samples_split=30, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')

In [17]:
#generate some predictions 

y_pred = regressor.predict(X_test)  

In [18]:
# comparing actual to predicted values, we see the model did a pretty good job

df=pd.DataFrame({'Actual':y_test, 'Predicted':y_pred})  
df  

Unnamed: 0,Actual,Predicted
0,22.6,23.935714
1,50.0,20.063636
2,23.0,23.430769
3,8.3,9.265385
4,21.2,21.272727
5,19.9,19.490476
6,20.6,21.272727
7,18.7,19.490476
8,16.1,20.929630
9,18.6,21.272727


In [19]:
# getting the model metrics
# we can see from the values below that our model did an ok, but not great job at making predictions. 

print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred))) 
print('R^2:',metrics.r2_score(y_test, y_pred))

Mean Absolute Error: 3.171497451350393
Mean Squared Error: 30.984776299315893
Root Mean Squared Error: 5.566397066264308
R^2: 0.6194845847927618


<a id='section8'></a>

### Conclusion ### 

In this guide we stepped through decision trees and learned they are useful for both classification and regression tasks. Next, we covered the key parameters,some pros and cons (beware of overfitting!), and key terminology we need to know to call ourselves proper data science arborists. Lastly, we walked through two basic classification and regression tree examples to see how people faired in a psychological exam and to predict housing prices respectively. Moving forward, check out the guides on boosted decision trees and random forests, which are more powerful extensions of the decision tree.

<a id='section9'></a>

### Additional Reading ### 

1. Math behind decision trees [here](https://medium.com/@srnghn/the-mathematics-of-decision-trees-random-forest-and-feature-importance-in-scikit-learn-and-spark-f2861df67e3)
2. More on the sklearn documentation [here](https://scikit-learn.org/stable/modules/tree.html)

### Sources ### 
1. https://www.youtube.com/watch?v=BqOgaENTr08
2. https://towardsdatascience.com/random-forests-and-the-bias-variance-tradeoff-3b77fee339b4
3. http://www.cs.ubbcluj.ro/~gabis/DocDiplome/DT/DecisionTrees.pdf
4. https://chrisalbon.com/machine_learning/trees_and_forests/random_forest_classifier_example/ 
5. https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
6. http://dataaspirant.com/2017/02/01/decision-tree-algorithm-python-with-scikit-learn/

