# Decision Tree Models
The Decision Tree model can be used to discover complex linear relationships between variables for either prediction, binary classification or multi-output classification. Obviously in this case I am looking for price prediction given a small number of features.

Importing the dataset and dependancies:

In [16]:
import numpy as np
import sklearn as sk
import pandas as pd
from sklearn.metrics import mean_squared_error
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

# import data
bitcoin = pd.read_csv("C:/Users/lynst/Documents/GitHub/machine-learning-projects/machine-learning/BTC_CAD.csv")

# select data subset
df = bitcoin[["Open", "High", "Low", "Close", "Volume"]]

# select data for modeling
X = df[["Open", "High", "Low", "Volume"]].dropna(axis=0)
y = df["Close"].dropna(axis=0)

In [17]:
# instantiate model
lin_reg = linear_model.LinearRegression()

# fit model
lin_reg.fit(X, y)

LinearRegression()

First, the data needs to be trained.

In [18]:
# split data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

In [19]:
# instantiate 
model = linear_model.LinearRegression()

# fit
model.fit(X_train, y_train)

# predict
y_pred = model.predict(X_test)

In [20]:
from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor()
tree_reg.fit(X_train, y_train)

DecisionTreeRegressor()

Having trained the model it needs to be evaluated on the training set: 

In [21]:
price_predictions = tree_reg.predict(X_train)
tree_mse = mean_squared_error(y_train, price_predictions)
tree_rmse = np.sqrt(tree_mse)
print(tree_rmse)

0.0


This is a perfect error score, but it's more likely the model will have overfitted the data. I will use cross-validation before evaluating on my hold-out set or test set to try and get a more accurate determination of the root mean squared error.

### Cross Validation
This method will evaluate the Decsion Tree model by splitting the training set into several smaller training and validation sets for training and evaluation separately. This is achieved by using the K-fold cross validation technique and I have split the data into 10 separate folds, cv=10 (which can be changed).

In [27]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(tree_reg, X_train, y_train, scoring="neg_mean_squared_error", cv=10)
tree_rmse_scores = np.sqrt(-scores)

def display_scores(scores):
    print("Scores:", scores)
    print("Mean", scores.mean())
    print("Standard Deviation", scores.std())
    
display_scores(tree_rmse_scores)

Scores: [1420.31462535 2723.96469484 1065.03027289 1445.51402165 1174.34537821
 1610.72845    1746.90226016 1578.34949088 1712.9068318  1152.48127694]
Mean 1563.053730271387
Standard Deviation 448.1471479891698


Comparing the scores from cross validation to those from the linear regression model:

In [28]:
lin_scores = cross_val_score(lin_reg, X_train, y_train, scoring="neg_mean_squared_error", cv=10)
lin_rmse_scores = np.sqrt(-lin_scores)
display_scores(lin_rmse_scores)

Scores: [873.89562772 978.49089723 708.25187258 389.29477414 908.56488605
 505.32560637 741.94581029 766.55343858 671.3117333  507.85427655]
Mean 705.1488922789055
Standard Deviation 181.51221092233794


Next I will try the Random Forest Regressor model to try and improve on these scores and their accuracy.

In [29]:
from sklearn.ensemble import RandomForestRegressor

forest_reg = RandomForestRegressor()
forest_reg.fit(X_train, y_train)

forest_mse = mean_squared_error(y_train, price_predictions)
forest_rmse = np.sqrt(forest_mse)
forest_rmse

0.0