## If a tree falls in the Forest: Random Forest vs. Decision Tree [accuracy & cost test]
In this notebook, I'll compare the Decision Tree algorithm against Random Forest. As we know, Random Forest Classifier is the robust version of the Decision Tree. For this purpose, I picked a Pulsar Star dataset from Kaggle. The goal is to compare the runtime and accuracy of the two algorithms.

In [53]:
# import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn import ensemble
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import r2_score
from sklearn.tree import DecisionTreeClassifier

import time

In [54]:
# load the dataset
df = pd.read_csv(r'C:\Users\hafeez_poldz\Desktop\Thinkful\Unit 3\data\pulsar_stars.csv')
df.head()               

Unnamed: 0,Mean of the integrated profile,Standard deviation of the integrated profile,Excess kurtosis of the integrated profile,Skewness of the integrated profile,Mean of the DM-SNR curve,Standard deviation of the DM-SNR curve,Excess kurtosis of the DM-SNR curve,Skewness of the DM-SNR curve,target_class
0,140.5625,55.683782,-0.234571,-0.699648,3.199833,19.110426,7.975532,74.242225,0
1,102.507812,58.88243,0.465318,-0.515088,1.677258,14.860146,10.576487,127.39358,0
2,103.015625,39.341649,0.323328,1.051164,3.121237,21.744669,7.735822,63.171909,0
3,136.75,57.178449,-0.068415,-0.636238,3.642977,20.95928,6.896499,53.593661,0
4,88.726562,40.672225,0.600866,1.123492,1.17893,11.46872,14.269573,252.567306,0


In [55]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17898 entries, 0 to 17897
Data columns (total 9 columns):
 Mean of the integrated profile                  17898 non-null float64
 Standard deviation of the integrated profile    17898 non-null float64
 Excess kurtosis of the integrated profile       17898 non-null float64
 Skewness of the integrated profile              17898 non-null float64
 Mean of the DM-SNR curve                        17898 non-null float64
 Standard deviation of the DM-SNR curve          17898 non-null float64
 Excess kurtosis of the DM-SNR curve             17898 non-null float64
 Skewness of the DM-SNR curve                    17898 non-null float64
target_class                                     17898 non-null int64
dtypes: float64(8), int64(1)
memory usage: 1.2 MB


Our dataset has 9 columns and 17898 entries. There are no missing values in the dataset. Also, 8 variables are continuous and only outcome variable (target_class) is binary represented.  The target_class column shows the class of pulsar star. 1 for pulsar star, 0 for not a star.

### Model 1. Random Forest Classifier

In [56]:
start_time = time.time()

# random forest classifier
rfc = ensemble.RandomForestClassifier()

# define input and outcome variables
X = df.drop('target_class', 1)
Y = df.target_class

# normalization
scalerX = MinMaxScaler(feature_range = (0,1))
X[X.columns] = scalerX.fit_transform(X[X.columns])

# fit the model
rfc.fit(X,Y)

# predicted outcome variable
Y_rfc = rfc.predict(X)

print('score: ', rfc.score(X,Y))
score = cross_val_score(rfc, X, Y, cv = 10)
print('R-squared: ', r2_score(Y, Y_rfc))
print("10 folds cross-validation Accuracy: %0.2f (+/- %0.2f)" % (score.mean(), score.std() * 2))
print("\n--- %s seconds ---" % (time.time() - start_time))



score:  0.9968711587886915
R-squared:  0.9623885786296197
10 folds cross-validation Accuracy: 0.98 (+/- 0.01)

--- 6.626705169677734 seconds ---


### Model 2. Decision Tree Classifier

In [57]:
start_time = time.time()

# decision tree classifier
dtc = DecisionTreeClassifier()

# define input and outcome variables
X = df.drop('target_class', 1)
Y = df.target_class

# normalization
scalerX = MinMaxScaler(feature_range = (0,1))
X[X.columns] = scalerX.fit_transform(X[X.columns])

# fit the model
dtc.fit(X,Y)

# predicted outcome variable
Y_dtc = dtc.predict(X)

print('score: ', dtc.score(X,Y))
score = cross_val_score(dtc, X, Y, cv = 10)
print('R-squared: ', r2_score(Y, Y_dtc))
print('10 folds cross-validation accuracy: %0.2f (+/-%0.2f)' % (score.mean(), score.std() * 2))
print('\n-- %s seconds --' %(time.time()- start_time))

score:  1.0
R-squared:  1.0
10 folds cross-validation accuracy: 0.97 (+/-0.01)

-- 3.913625955581665 seconds --


### Model 3. Decision Tree Classifier [playing with configuration]

The calculation above shows that the Decision Tree Classifiers runs faster than Random Forest. However, R-squared is 1 which makes me doubt about overfitting. The model seems to be complex and needs to be simplified. I'll try the next model with a reduced number of features and tree depth.

In [58]:
start_time = time.time()

# configured decision tree classifier 
dtc = DecisionTreeClassifier(criterion = 'entropy', max_features = 6, max_depth = 9, random_state = 42)

# define input variables and outcome variable
X = df.drop('target_class', 1)
Y = df.target_class

# normalization
scalerX = MinMaxScaler(feature_range = (0,1))
X[X.columns] = scalerX.fit_transform(X[X.columns])

# fit the model
dtc.fit(X,Y)

# predicted outcome variable
Y_dtc = dtc.predict(X)

print('score: ', dtc.score(X,Y))
score = cross_val_score(dtc, X, Y, cv = 10)
print('R-squared: ', r2_score(Y, Y_dtc))
print('10 folds cross-validation accuracy: %0.2f (+/-%0.2f)' % (score.mean(), score.std() * 2))
print('\n-- %s seconds --' %(time.time()- start_time))

score:  0.9875963794837412
R-squared:  0.8508975795674211
10 folds cross-validation accuracy: 0.98 (+/-0.01)

-- 3.0451254844665527 seconds --


After playing with the number of features and the depth of the tree, I came up with the optimal parameters for the model as the number of maximum features 6 and the maximum depth of tree 9. The deeper tree, the longer it takes to run, but still faster than Random Forest though it's less accurate. 