# Decision Tree Models
The Decision Tree model can be used to discover complex linear relationships between variables for either prediction, binary classification or multi-output classification. Obviously in this case I am looking for price prediction given a small number of features.

Importing the dataset and dependancies:

In [None]:
import numpy as np
import sklearn as sk
import pandas as pd
from sklearn.metrics import mean_squared_error
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

# import data
bitcoin = pd.read_csv("C:/Users/lynst/Documents/GitHub/machine-learning-projects/machine-learning/BTC_CAD.csv")

# select data subset
df = bitcoin[["Open", "High", "Low", "Close", "Volume"]]

# select data for modeling
X = df[["Open", "High", "Low", "Volume"]].dropna(axis=0)
y = df["Close"].dropna(axis=0)

In [None]:
# instantiate model
lin_reg = linear_model.LinearRegression()

# fit model
lin_reg.fit(X, y)

First, the data needs to be trained.

In [None]:
# split data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

In [None]:
# instantiate 
model = linear_model.LinearRegression()

# fit
model.fit(X_train, y_train)

# predict
y_pred = model.predict(X_test)

In [None]:
from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor()
tree_reg.fit(X_train, y_train)

Having trained the model it needs to be evaluated on the training set: 

In [None]:
price_predictions = tree_reg.predict(X_train)
tree_mse = mean_squared_error(y_train, price_predictions)
tree_rmse = np.sqrt(tree_mse)
print(tree_rmse)

This is a perfect error score, but it's more likely the model will have overfitted the data. I will use cross-validation before evaluating on my hold-out set or test set to try and get a more accurate determination of the root mean squared error.

### Cross Validation
This method will evaluate the Decsion Tree model by splitting the training set into several smaller training and validation sets for training and evaluation separately. This is achieved by using the K-fold cross validation technique and I have split the data into 10 separate folds, cv=10 (which can be changed).

In [None]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(tree_reg, X_train, y_train, scoring="neg_mean_squared_error", cv=10)
tree_rmse_scores = np.sqrt(-scores)

def display_scores(scores):
    print("Scores:", scores)
    print("Mean", scores.mean())
    print("Standard Deviation", scores.std())
    
display_scores(tree_rmse_scores)

Comparing the scores from cross validation to those from the linear regression model:

In [None]:
lin_scores = cross_val_score(lin_reg, X_train, y_train, scoring="neg_mean_squared_error", cv=10)
lin_rmse_scores = np.sqrt(-lin_scores)
display_scores(lin_rmse_scores)

Next I will try the Random Forest Regressor model to try and improve on these scores and their accuracy.

In [None]:
from sklearn.ensemble import RandomForestRegressor

forest_reg = RandomForestRegressor()
forest_reg.fit(X_train, y_train)

forest_mse = mean_squared_error(y_train, price_predictions)
forest_rmse_scores = np.sqrt(forest_mse)
display_scores(forest_rmse_scores)

Because this score appears perfect with a value of zero we need to approach this with some scepticism and try out some alternative models, starting with Grid Search CV. Saving the file as a pickle file will ensure some consistency when comparing scores, parameters and hyperparameters and enable me to start where I left off!

I first need import pickle and joblib.

In [None]:
!curl pip install joblib

In [None]:
from sklearn import joblib
import pickle

joblib.dump(decision_tree, "my_model.pkl")
# and later...
my_model_loaded = joblib.load("decision_tree.pkl")