So here's what you should do. Pick a dataset. It could be one you've worked with before or it could be a new one. Then build the best decision tree you can.

Now try to match that with the simplest random forest you can. For our purposes measure simplicity with runtime. Compare that to the runtime of the decision tree. This is imperfect but just go with it.

In [1]:
import pandas as pd
import numpy as np
from sklearn import ensemble
from sklearn import tree
from sklearn.model_selection import cross_val_score
import time
from IPython.display import Image
import pydotplus
import graphviz

In [2]:
df = pd.read_csv('data/waves.csv')
df['Date/Time'] = pd.to_datetime(df['Date/Time'])
df['Date/Time'] = df['Date/Time'].dt.hour.astype('int64')
for i in range(len(df.Hs)):
    if df.at[i, 'Hs'] > 1:
        df.at[i, 'Hs_sort'] = 1
    else:
        df.at[i, 'Hs_sort'] = 0

df.drop([0, 1], inplace = True)

In [3]:
df

Unnamed: 0,Date/Time,Hs,Hmax,Tz,Tp,Peak Direction,SST,Hs_sort
2,1,0.763,1.15,4.520,5.513,49.0,25.65,0.0
3,1,0.770,1.41,4.582,5.647,75.0,25.50,0.0
4,2,0.747,1.16,4.515,5.083,91.0,25.45,0.0
5,2,0.718,1.61,4.614,6.181,68.0,25.45,0.0
6,3,0.707,1.34,4.568,4.705,73.0,25.50,0.0
7,3,0.729,1.21,4.786,4.484,63.0,25.50,0.0
8,4,0.733,1.20,4.897,5.042,68.0,25.50,0.0
9,4,0.711,1.29,5.019,8.439,66.0,25.50,0.0
10,5,0.698,1.11,4.867,4.584,64.0,25.55,0.0
11,5,0.686,1.14,4.755,5.211,56.0,25.55,0.0


In [4]:
X = df.drop(columns = ['Hs', 'Hs_sort'])
Y = df.Hs_sort

In [5]:
start_time = time.time()
# Initialize and train our tree.
dt = tree.DecisionTreeClassifier(
    criterion='entropy',
    max_features=1,
    max_depth=4
)
dt.fit(X, Y)
print(cross_val_score(dt, X, Y, cv = 10))
print("--- %s seconds ---" % (time.time() - start_time))

[0.92590899 0.96341185 0.63823462 0.9220215  0.85479076 0.82734965
 0.49279671 0.70912417 0.62708762 0.69045985]
--- 0.18919706344604492 seconds ---


In [7]:
start_time = time.time()
rfc = ensemble.RandomForestClassifier(n_estimators = 10)
rfc.fit(X, Y)
print(cross_val_score(rfc, X, Y, cv = 10))
print("--- %s seconds ---" % (time.time() - start_time))

[0.90921564 0.9515207  0.89389435 0.90784359 0.96409787 0.93002515
 0.79624971 0.91950606 0.93479753 0.92633265]
--- 3.6029088497161865 seconds ---


In [8]:
start_time = time.time()
rfc = ensemble.RandomForestClassifier(n_estimators = 50)
rfc.fit(X, Y)
print(cross_val_score(rfc, X, Y, cv = 10))
print("--- %s seconds ---" % (time.time() - start_time))

[0.91516122 0.95586554 0.90418477 0.91378916 0.96684198 0.93688543
 0.80676881 0.92179282 0.93319606 0.92953558]
--- 16.990887880325317 seconds ---


The random forest classifier is immensely more accurate but also 17 times slower. Improving the random forest model by increasing the number of trees improves accuracy slightly, but increases time by another five fold, while decreasing the number of trees doesn't appear to have an effect on time or accuracy.