# Decision Tree Models
The Decision Tree model can be used to discover complex linear relationships between variables for either prediction, binary classification or multi-output classification. Obviously in this case I am looking for price prediction given a relatively small number of features.

Importing the dataset and dependancies is the first step.

In [5]:
import numpy as np
import sklearn as sk
import pandas as pd
from sklearn.metrics import mean_squared_error
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

# import data
bitcoin = pd.read_csv("C:/Users/lynst/Documents/GitHub/machine-learning-projects/machine-learning/BTC_CAD.csv")

# select data subset
df = pd.DataFrame(bitcoin).dropna(axis=0)

# select data for modeling
X = df[["Open", "High", "Low", "Volume"]]
y = df["Close"]

Initializing the linear model and fitting the regression line to the entire first:

In [6]:
# instantiate model
lin_reg = linear_model.LinearRegression()

# fit model
lin_reg.fit(X, y)

LinearRegression()

Then, the data needs to be split and trained.

In [7]:
# split data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

In [11]:
# predict
y_pred = model.predict(X_train)

Now trying a prediction on the working linear model:

In [12]:
print(y_pred)

[14397.48984    23621.11524338 43241.201114   30343.12275151
 14950.27576399 59976.83742331 68718.53704518 69597.91104639
 12562.80258658 10030.31754586 14054.52712612 12036.87270379
 44255.96407877 75935.65528688 24762.6096622  46736.8240224
 13649.27255236 68557.2749307  46667.70067986 36549.10052378
 12602.85520768 74172.13480586 12385.4637041  43387.52205204
 14901.11879612 12498.95446154 23645.66665522 22685.16554083
 79092.8460428  60830.22346878 72379.85762814 15589.84243587
 13019.08559787 72398.5429078  62788.0781389  12408.21959286
 13320.93295561 73903.73304631 14015.72470178 26907.09443059
 13351.3857297  50155.91974286 73059.88957928 59842.92951805
 64360.38872615 13732.84304359 50901.26559416 15011.20598847
 14266.52506071 14141.0128721  13354.72074968 46823.93739954
 14998.72570254 29848.98178711 35596.25176114 12258.61370432
 24490.02301387 23977.17966753 23461.87571172 23919.11165705
 13171.65527712 76950.21736567 15059.17014179 12765.92136064
 16010.79527868 14159.602

Measuring the RMSE and r-squared score for the linear model:

In [15]:
y_pred = model.predict(X_train)
lin_mse = mean_squared_error(y_train, y_pred)
lin_rmse = np.sqrt(lin_mse)
lin_rmse

675.1045771626677

And once again finding 

In [None]:
from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor()
tree_reg.fit(X_train, y_train)

Having trained the model it needs to be evaluated on the training set: 

In [None]:
price_predictions = tree_reg.predict(X_train)
tree_mse = mean_squared_error(y_train, price_predictions)
tree_rmse = np.sqrt(tree_mse)
print(tree_rmse)

This is a perfect error score, but it's more likely the model will have overfitted the data. I will use cross-validation before evaluating on my hold-out set or test set to try and get a more accurate determination of the root mean squared error.

### Cross Validation
This method will evaluate the Decsion Tree model by splitting the training set into several smaller training and validation sets for training and evaluation separately. This is achieved by using the K-fold cross validation technique and I have split the data into 10 separate folds, cv=10 (which can be changed).

In [None]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(tree_reg, X_train, y_train, scoring="neg_mean_squared_error", cv=10)
tree_rmse_scores = np.sqrt(-scores)

def display_scores(scores):
    print("Scores:", scores)
    print("Mean", scores.mean())
    print("Standard Deviation", scores.std())
    
display_scores(tree_rmse_scores)

Comparing the scores from cross validation to those from the linear regression model:

In [None]:
lin_scores = cross_val_score(lin_reg, X_train, y_train, scoring="neg_mean_squared_error", cv=10)
lin_rmse_scores = np.sqrt(-lin_scores)
display_scores(lin_rmse_scores)

Next I will try the Random Forest Regressor model to try and improve on these scores and their accuracy.

In [None]:
from sklearn.ensemble import RandomForestRegressor

forest_reg = RandomForestRegressor()
forest_reg.fit(X_train, y_train)

forest_mse = mean_squared_error(y_train, price_predictions)
forest_rmse_scores = np.sqrt(forest_mse)
display_scores(forest_rmse_scores)

Because this score appears perfect with a value of zero we need to approach this with some scepticism and try out some alternative models, starting with Grid Search CV. Saving the file as a pickle file will ensure some consistency when comparing scores, parameters and hyperparameters and enable me to start where I left off!

I first need import pickle and joblib.

In [None]:
!curl pip install joblib

In [None]:
from sklearn import joblib
import pickle

joblib.dump(decision_tree, "my_model.pkl")
# and later...
my_model_loaded = joblib.load("decision_tree.pkl")