# Decision Tree Models
The Decision Tree model can be used to discover complex linear relationships between variables for either prediction, binary classification or multi-output classification. Obviously in this case I am looking for price prediction given a relatively small number of features.

Importing the dataset and dependancies is the first step.

In [1]:
import numpy as np
import sklearn as sk
import pandas as pd
from sklearn.metrics import mean_squared_error
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

# import data
bitcoin = pd.read_csv("C:/Users/lynst/Documents/GitHub/machine-learning-projects/machine-learning/BTC_CAD.csv")

# select data subset
df = pd.DataFrame(bitcoin).dropna(axis=0)

# select data for modeling
X = df[["Open", "High", "Low", "Volume"]]
y = df["Close"]

Initializing the linear model and fitting the regression line to the entire dataset based on predictors and labelled data first:

In [13]:
# instantiate model
lin_reg = linear_model.LinearRegression()

# fit model
lin_reg.fit(X, y)

LinearRegression()

Then, the entire dataset needs to be split and trained.

In [14]:
# split data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

In [15]:
# predict
y_pred = lin_reg.predict(X_train)

Now trying a prediction on the working linear model (first 5 values):

In [16]:
print(y_pred[:5])

[14398.87824859 23615.43973771 43221.96880896 30312.88923232
 14945.24131451]


Measuring the RMSE and r-squared score for the linear model (based on training set):

In [17]:
y_pred = lin_reg.predict(X_train)
lin_mse = mean_squared_error(y_train, y_pred)
lin_rmse = np.sqrt(lin_mse)
print(lin_rmse)

r2_train = r2_score(y_train, y_pred)
print(r2_train)

677.2239361854425
0.9990817412202545


And once again finding the RMSE and r-squared for the linear model (based on the test set) this time:

In [18]:
y_pred = lin_reg.predict(X_test)
lin_mse = mean_squared_error(y_test, y_pred)
lin_rmse = np.sqrt(lin_mse)
print(lin_rmse)

r2_test = r2_score(y_test, y_pred)
print(r2_test)

455.4471597043939
0.9994954639778587


So this linear model appears to generalize well to the unseen (test) data having reduced the overall variance and improving accuracy. 

# Decision Tree Model Selection
Next, it's time to apply a Decision Tree model to seek even further improvement.

In [19]:
from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor()
tree_reg.fit(X, y)

DecisionTreeRegressor()

In [20]:
# split data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

In [23]:
# predict
y_pred = tree_reg.predict(X_train)

Now trying a prediction on the working linear model (first 5 values):

In [24]:
print(y_pred[:5])

[14452.49 23303.57 43794.73 30525.81 14984.18]


Measuring the RMSE and r-squared score for the linear model (based on training set):

In [25]:
y_pred = tree_reg.predict(X_train)
lin_mse = mean_squared_error(y_train, y_pred)
lin_rmse = np.sqrt(lin_mse)
print(lin_rmse)

r2_train = r2_score(y_train, y_pred)
print(r2_train)

0.0
1.0


This definitely appears to be overfitting with perfect scores for both RMSE and r-squared. Let's see if there is a different outcome for the test data.

In [27]:
y_pred = tree_reg.predict(X_test)
lin_mse = mean_squared_error(y_test, y_pred)
lin_rmse = np.sqrt(lin_mse)
print(lin_rmse)

r2_test = r2_score(y_test, y_pred)
print(r2_test)

0.0
1.0


This appears unlikely also. In order to establish a more likely outcome I will try dividing the dataframe into several smaller training and validation sets and perform the decision tree analysis on each. This is done using K-Fold Cross Validation.

## Cross Validation

In [30]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(tree_reg, X_train, y_train, scoring="neg_mean_squared_error", cv=10)
tree_rmse_scores = np.sqrt(-scores)
r2_test = r2_score(y_test, y_pred)

def display_scores(scores):
    print("Scores:", scores)
    print("Mean", scores.mean())
    print("Standard Deviation", scores.std())
    print("R-Squared:", r2_test)
    
display_scores(tree_rmse_scores)

Scores: [1616.18461785 2617.82551977  942.67031066 1329.26504226 1111.90422042
 1607.42123308 1938.65847736 1569.79318207 1657.21080466 1144.4827816 ]
Mean 1553.5416189722325
Standard Deviation 456.5840831682563
R-Squared: 1.0


Comparing the scores from cross validation to those from the linear regression model:

In [31]:
lin_scores = cross_val_score(lin_reg, X_train, y_train, scoring="neg_mean_squared_error", cv=10)
lin_rmse_scores = np.sqrt(-lin_scores)
display_scores(lin_rmse_scores)

Scores: [873.89562772 978.49089723 708.25187258 389.29477414 908.56488605
 505.32560637 741.94581029 766.55343858 671.3117333  507.85427655]
Mean 705.1488922789055
Standard Deviation 181.51221092233794
R-Squared: 1.0


Next I will try the Random Forest Regressor model to try and improve on these scores and their accuracy.

In [None]:
from sklearn.ensemble import RandomForestRegressor

forest_reg = RandomForestRegressor()
forest_reg.fit(X_train, y_train)

forest_mse = mean_squared_error(y_train, price_predictions)
forest_rmse_scores = np.sqrt(forest_mse)
display_scores(forest_rmse_scores)

Because this score appears perfect with a value of zero we need to approach this with some scepticism and try out some alternative models, starting with Grid Search CV. Saving the file as a pickle file will ensure some consistency when comparing scores, parameters and hyperparameters and enable me to start where I left off!

I first need import pickle and joblib.

In [None]:
!curl pip install joblib

In [None]:
from sklearn import joblib
import pickle

joblib.dump(decision_tree, "my_model.pkl")
# and later...
my_model_loaded = joblib.load("decision_tree.pkl")