# Comparing Random Forest to Decision Trees -- Kristofer Schobert

Here, we will be using a dataset from kaggle to predict if a patient has heart disease. We will first use a Random Forest Classifier to do so. Then try to beat it's cross validation score with a single desicion tree. We will also compare the runtimes of each model to get a sence of the computational demand of each model. 

The data can be found here:

https://www.kaggle.com/ronitf/heart-disease-uci/version/1

In [62]:
# importing necessary packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import ensemble
from sklearn.model_selection import cross_val_score
from sklearn import tree
import time
%matplotlib inline

In [2]:
# importing dataset
df = pd.read_csv('heart.csv')

In [61]:
# viewing the first few rows of the data
# The final column target is our output variable.
# It has values of 0 (paitent does not have heart disease) and 1 (paitent has heart disease).
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [34]:
# Separating the outcome variable from the rest of the data.
# X is our model's input variables.
# Y is the outcome variable.
X = df.drop('target', 1)
Y = df['target']

## Random Forest Classifier

We will now run the random forest classifier. The output of the model will be different each time. The individual trees of the forest are made of randomly selected data from our dataset and randomly selected features for each branch.

Let's run the model ten times. We will observe the runtime for the total of these ten models, and the accuracy of each. 

In [69]:
# Creating a for loop to run the model 10 times
# Calculating the runtime for the 10 random forest models
start_time = time.time()

for i in range(10):
    # rfc will be the name of our random forest classifier
    rfc = ensemble.RandomForestClassifier()
    
    # We will examine the mean and 2*sigma value of the cross validation scores 
    # for each of the 10 iterations of model. 
    score = cross_val_score(rfc, X, Y, cv=10)
    print("Accuracy: %0.2f (+/- %0.2f)" % (score.mean(), score.std() * 2))
    
print("\n--- %s seconds ---" % (time.time() - start_time))    


Accuracy: 0.80 (+/- 0.12)
Accuracy: 0.81 (+/- 0.10)
Accuracy: 0.81 (+/- 0.14)
Accuracy: 0.79 (+/- 0.16)
Accuracy: 0.80 (+/- 0.17)
Accuracy: 0.82 (+/- 0.13)
Accuracy: 0.82 (+/- 0.12)
Accuracy: 0.82 (+/- 0.16)
Accuracy: 0.81 (+/- 0.10)
Accuracy: 0.80 (+/- 0.15)

--- 1.4997000694274902 seconds ---


Our Random Forest Classifier seems to have a roughly 80% accuracy as it's mean. It does require a rather long amount of time for a dataset containing only 300 rows. 

Let's see if we can create a Decision Tree that has a similar or better accuracy. We will also compare it's runtime to that of the Random Forest above.

## Decision Tree Classifier 1

In [70]:
# Creating a for loop to run the model 10 times
# Calculating the runtime for the 10 decision trees
start_time = time.time()

for i in range(10):
    # we will at first use default criteria for sklearns decision tree classifier 
    decision_tree = tree.DecisionTreeClassifier(
        criterion='entropy',
        #max_features=1,
        #max_depth=1,
    )
    
    # We will examine the mean and 2*sigma value of the cross validation scores 
    # for each of the 10 iterations of model. 
    
    score = cross_val_score(decision_tree, X, Y, cv=10)
    print("Accuracy: %0.2f (+/- %0.2f)" % (score.mean(), score.std() * 2))
print("\n--- %s seconds ---" % (time.time() - start_time)) 

Accuracy: 0.77 (+/- 0.13)
Accuracy: 0.76 (+/- 0.12)
Accuracy: 0.79 (+/- 0.14)
Accuracy: 0.78 (+/- 0.14)
Accuracy: 0.79 (+/- 0.11)
Accuracy: 0.79 (+/- 0.13)
Accuracy: 0.79 (+/- 0.09)
Accuracy: 0.78 (+/- 0.09)
Accuracy: 0.76 (+/- 0.10)
Accuracy: 0.76 (+/- 0.12)

--- 0.3110029697418213 seconds ---


The total accuracy of these decision trees is close but not as great as the random forest's. The mean accuracy over the ten models here is roughly 78%. None, of the ten decision trees have a mean accuracy at or above 80%. Also, the variances of these desicion trees are a bit less than the random forests. 

The runtime for the decision trees is far less than that for the random forests. 

We chose the criteria of entropy gain to be the deciding factor when spliting.

The default for max_features (the number of features that may be chosen from when forming a split) is to use all the total number of features. 

The default for max_depth is the depth necessary to form leaves that are pure. 

Let's try using some criteria for the decision trees to help increase its accuracy.

## Decision Tree Classifier 2

In [96]:
# Creating a for loop to run the model 10 times
# Calculating the runtime for the 10 decision trees
start_time = time.time()

for i in range(10):
    # we will at first use default criteria for sklearns decision tree classifier 
    decision_tree = tree.DecisionTreeClassifier(
        criterion='entropy',
        max_features=10,
        max_depth=3,
    )
    
    # We will examine the mean and 2*sigma value of the cross validation scores 
    # for each of the 10 iterations of model. 
    
    score = cross_val_score(decision_tree, X, Y, cv=10)
    print("Accuracy: %0.2f (+/- %0.2f)" % (score.mean(), score.std() * 2))
print("\n--- %s seconds ---" % (time.time() - start_time)) 

Accuracy: 0.79 (+/- 0.12)
Accuracy: 0.80 (+/- 0.11)
Accuracy: 0.79 (+/- 0.08)
Accuracy: 0.81 (+/- 0.13)
Accuracy: 0.80 (+/- 0.11)
Accuracy: 0.80 (+/- 0.11)
Accuracy: 0.80 (+/- 0.12)
Accuracy: 0.80 (+/- 0.11)
Accuracy: 0.79 (+/- 0.11)
Accuracy: 0.82 (+/- 0.13)

--- 0.2963099479675293 seconds ---


By lowering the number of features to select from at each split and creating a tree with a depth of only 3, we have increased the accuracy of our decision tree! Using a max_depth of 2 or 4 lowers the accuracy. Three seems to be a great fit for our tree. It likely reduces overfitting. It seems a depth of 2 or less leads to underfitting and 4 or more leads to overfitting. 

While this is not as accurate as our decision tree, it is very close! It is also much less computationally expesive as apparent by the runtime. 

# Conclusion

While Random Forests are generally more accurate, averaging many Decision Trees trained on a subset of the data, they are more computationally expensive. The computer needs to create many Decision Trees and evaluate testing data by running these datapoints through each of these trees. 

On the other hand, Decision Trees are prone to overfitting the data. They branch based on the training data thus are a perfect fit for the training data. However, the slight differences in the training data and testing data can lead to misclassification of the testing data. 

Limiting a Decision Tree to a certain depth can increase the accuracy of the tree. It limits the models tendancy to overfit the training data. By stoping the tree at a certain depth, the leaves represent general trends in the training data rather than an exact description of said training data. 

