## 3.2.6 Challenge: If a tree falls in the forest...

## Problem Statement¶

Now that you've learned about random forests and decision trees let's do an exercise in accuracy. You know that random forests are basically a collection of decision trees. But how do the accuracies of the two models compare?

So here's what you should do. Pick a dataset. It could be one you've worked with before or it could be a new one. Then build the best decision tree you can.

Now try to match that with the simplest random forest you can. For our purposes measure simplicity with runtime. Compare that to the runtime of the decision tree. This is imperfect but just go with it.

Hopefully out of this you'll see the power of random forests, but also their potential costs. Remember, in the real world you won't necessarily be dealing with thousands of rows. It could be millions, billions, or even more.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn import ensemble # Random Forest model
from sklearn.model_selection import cross_val_score
from sklearn import tree # Decision tree model
from IPython.display import Image # Display tree
import pydotplus # Render tree
import graphviz # Render tree
import time
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
data_GBP = '/home/mache/Desktop/Thinkful/Course/Unit 3/Forest/GBPUSD_15 Mins_Bid_2017.08.02_2018.04.26.csv'
df_gbp = pd.read_csv(data_GBP)
print (df_gbp.shape)
display(df_gbp.head(5))

(18379, 6)


Unnamed: 0,Time (ART),Open,High,Low,Close,Volume
0,2017.08.02 00:00:00,1.32039,1.32046,1.32001,1.32039,928.33
1,2017.08.02 00:15:00,1.32038,1.32044,1.31955,1.3198,885.6
2,2017.08.02 00:30:00,1.31978,1.32004,1.31952,1.31988,895.0
3,2017.08.02 00:45:00,1.31986,1.31997,1.31959,1.31979,492.38
4,2017.08.02 01:00:00,1.31979,1.32052,1.31965,1.32009,1461.16


In [3]:
data_UKX = '/home/mache/Desktop/Thinkful/Course/Unit 3/Forest/GBRIDXGBP_15 Mins_Bid_2017.08.02_2018.04.26.csv'
df_ukx = pd.read_csv(data_UKX)
print (df_ukx.shape)
display(df_ukx.head(5))

(18379, 6)


Unnamed: 0,Time (ART),Open,High,Low,Close,Volume
0,2017.08.02 00:00:00,7422.969,7422.969,7422.969,7422.969,0.0
1,2017.08.02 00:15:00,7422.969,7422.969,7422.969,7422.969,0.0
2,2017.08.02 00:30:00,7422.969,7422.969,7422.969,7422.969,0.0
3,2017.08.02 00:45:00,7422.969,7422.969,7422.969,7422.969,0.0
4,2017.08.02 01:00:00,7422.969,7422.969,7422.969,7422.969,0.0


In [4]:
#Merging both datasets

In [5]:
df = pd.merge(df_gbp, df_ukx, on=['Time (ART)'])

In [6]:
print(df.shape)
display(df.head(5))

(18379, 11)


Unnamed: 0,Time (ART),Open_x,High_x,Low_x,Close_x,Volume _x,Open_y,High_y,Low_y,Close_y,Volume _y
0,2017.08.02 00:00:00,1.32039,1.32046,1.32001,1.32039,928.33,7422.969,7422.969,7422.969,7422.969,0.0
1,2017.08.02 00:15:00,1.32038,1.32044,1.31955,1.3198,885.6,7422.969,7422.969,7422.969,7422.969,0.0
2,2017.08.02 00:30:00,1.31978,1.32004,1.31952,1.31988,895.0,7422.969,7422.969,7422.969,7422.969,0.0
3,2017.08.02 00:45:00,1.31986,1.31997,1.31959,1.31979,492.38,7422.969,7422.969,7422.969,7422.969,0.0
4,2017.08.02 01:00:00,1.31979,1.32052,1.31965,1.32009,1461.16,7422.969,7422.969,7422.969,7422.969,0.0


In [7]:
df.dtypes

Time (ART)     object
Open_x        float64
High_x        float64
Low_x         float64
Close_x       float64
Volume _x     float64
Open_y        float64
High_y        float64
Low_y         float64
Close_y       float64
Volume _y     float64
dtype: object

In [8]:
# Define X and Y
X = df[['High_x', 'Low_x', 'Close_x', 'Volume _x', 'Open_y', 'High_y', 'Low_y', 'Close_y', 'Volume _y']]
Y = df['Open_x']

## Decision Tree

In [9]:
# Initialize and train our tree.
dt_start_time = time.time()
decision_tree = tree.DecisionTreeRegressor(
    criterion='mse',
    max_features=9,
    max_depth=9,
    random_state = 1337
)
decision_tree.fit(X, Y)
display(cross_val_score(decision_tree, X, Y, cv = 5))
print ("Decision Tree runtime: {}".format(time.time() - dt_start_time))

array([ 0.8134119 ,  0.99550563,  0.99562141,  0.99131817,  0.99788215])

Decision Tree runtime: 0.7497870922088623


In [10]:
# Render our tree.
dot_data = tree.export_graphviz(
    decision_tree, out_file=None,
    feature_names=X.columns,
    class_names=['Not Returning', 'Returning'],
    filled=True
)
graph = pydotplus.graph_from_dot_data(dot_data)
Image(graph.create_png())

dot: graph is too large for cairo-renderer bitmaps. Scaling by 0.640318 to fit



IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.


## Random Forest

In [11]:
# Random Forest
rf_start_time = time.time()
rfc = ensemble.RandomForestRegressor()
display(cross_val_score(rfc, X, Y, cv=5))
print ("Random Forest runtime: {}".format(time.time() - rf_start_time))

array([ 0.8014332 ,  0.99778579,  0.99768062,  0.99541322,  0.99811447])

Random Forest runtime: 3.685774087905884


In [12]:
print(3.370710611343384/0.6402187347412109)

5.264935917106349


## write-up

The objective of this training was to predict the Open price for the GBPUSD foreign exchange market in the 15 minute time frame.

I downloaded approximately 6 months worth of data from GBPUSD, and for the same period time and time frame, I downloaded the FTSE100 price market.

As the FTSE 100, is a share index that broadly consists of the largest 100 qualifying UK companies by full market value, I though in using it's data to correlated with the GBPUSD market price.

I merged both datasets by TIME, defined X and Y as my data and target dataframes.

After merging, I constructed a Regression Decision Tree, as the data is consists of continuos values. I defined the max features to 9, corresponding to the data dataframe and, define the depth, after trying different ones, to 9.

I tried visualizing my Decision Tree, but it was too large for the default parameters of Jupyter Notebook to handle. The results were interesting, with an accuracy of approximately 99%.

I also constructed a Random Forest based on Regression, which had also a great accuracy, of approximately 99%, however, the running time was more than 5 times the running time of the decision tree.