# Decision Tree Models
The Decision Tree model can be used to discover complex linear relationships between variables for either prediction, binary classification or multi-output classification. Obviously in this case I am looking for price prediction given a relatively small number of features.

Importing the dataset and dependancies is the first step.

In [2]:
import numpy as np
import sklearn as sk
import pandas as pd
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

# import data
bitcoin = pd.read_csv("C:/Users/lynst/Documents/GitHub/machine-learning-projects/machine-learning/BTC_CAD.csv")

# select data subset
df = pd.DataFrame(bitcoin).dropna(axis=0)

# select data for modeling
X = df[["Open", "High", "Low", "Volume"]]
y = df["Close"]

Initializing the linear model and fitting the regression line to the entire dataset based on predictors and labelled data first:

In [3]:
# instantiate model
lin_reg = linear_model.LinearRegression()

# fit model
lin_reg.fit(X, y)

LinearRegression()

Then, the entire dataset needs to be split and trained.

In [4]:
# split data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

In [5]:
# predict
y_pred = lin_reg.predict(X_train)

Now trying a prediction on the working linear model (first 5 values):

In [6]:
print(y_pred[:5])

[14398.87824859 23615.43973771 43221.96880896 30312.88923232
 14945.24131451]


Measuring the RMSE and r-squared score for the linear model (based on training set):

In [7]:
y_pred = lin_reg.predict(X_train)

Now trying a prediction on the working linear model (first 5 values):

In [8]:
print(y_pred[:5])

[14398.87824859 23615.43973771 43221.96880896 30312.88923232
 14945.24131451]


Measuring the RMSE and r-squared score for the linear model (based on training set):

In [9]:
y_pred = lin_reg.predict(X_train)
lin_mse = mean_squared_error(y_train, y_pred)
lin_rmse = np.sqrt(lin_mse)
print(lin_rmse)

r2_train = r2_score(y_train, y_pred)
print(r2_train)

677.2239361854425
0.9990817412202545


And once again finding the RMSE and r-squared for the linear model (based on the test set) this time:

In [10]:
y_pred = lin_reg.predict(X_test)
lin_mse = mean_squared_error(y_test, y_pred)
lin_rmse = np.sqrt(lin_mse)
print(lin_rmse)
    
r2_test = r2_score(y_test, y_pred)
print(r2_test)

455.4471597043939
0.9994954639778587


So this linear model appears to generalize well to the unseen (test) data having reduced the overall variance improving accuracy.

# Decision Tree Model Selection
Next, it's time to apply a Decision Tree model to the entire dataset before seeking further improvement.

In [11]:
from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor()
tree_reg.fit(X, y)

DecisionTreeRegressor()

In [12]:
# split data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

In [13]:
# predict
y_pred = tree_reg.predict(X_train)

Now trying a prediction on the working linear model (first 5 values):

In [14]:
print(y_pred[:5])

[14452.49 23303.57 43794.73 30525.81 14984.18]


Measuring the RMSE and r-squared score for the linear model (based on training set):

In [15]:
y_pred = tree_reg.predict(X_train)
lin_mse = mean_squared_error(y_train, y_pred)
lin_rmse = np.sqrt(lin_mse)
print(lin_rmse)
    
r2_train = r2_score(y_train, y_pred)
print(r2_train)

0.0
1.0


This definitely appears to be overfitting with perfect scores for both RMSE and r-squared. Let's see if there is a different outcome for the test data.

In [16]:
y_pred = tree_reg.predict(X_test)
lin_mse = mean_squared_error(y_test, y_pred)
lin_rmse = np.sqrt(lin_mse)
print(lin_rmse)
    
r2_test = r2_score(y_test, y_pred)
print(r2_test)

0.0
1.0


This appears unlikely also. In order to establish a more likely outcome I will try dividing the dataframe into several smaller training and validation sets and perform the decision tree analysis on each. This is done using K-Fold Cross Validation.

## Cross Validation
This method will evaluate the Decsion Tree model by splitting the training set into several smaller training and validation sets for training and evaluation separately. This is achieved by using the K-fold cross validation technique and I have split the data into 10 separate folds, cv=10 (which can be changed).

In [17]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(tree_reg, X_train, y_train, scoring="neg_mean_squared_error", cv=10)
tree_rmse_scores = np.sqrt(-scores)
r2_test = r2_score(y_test, y_pred)
                         
def display_scores(scores):
    print("Scores:", scores)
    print("Mean", scores.mean())
    print("Standard Deviation", scores.std())
    print("R-Squared:", r2_test)
          
display_scores(tree_rmse_scores)

Scores: [1632.34178217 2576.43835781 1179.73439576 1327.10807347 1038.38290389
 1608.54411368 1624.67770866 1672.65698066 1042.90466061 1081.41410478]
Mean 1478.4203081498893
Standard Deviation 442.5399601154709
R-Squared: 1.0


Comparing the scores from cross validation to those from the linear regression model:

In [18]:
lin_scores = cross_val_score(lin_reg, X_train, y_train, scoring="neg_mean_squared_error", cv=10)
lin_rmse_scores = np.sqrt(-lin_scores)
display_scores(lin_rmse_scores)

Scores: [873.89562772 978.49089723 708.25187258 389.29477414 908.56488605
 505.32560637 741.94581029 766.55343858 671.3117333  507.85427655]
Mean 705.1488922789055
Standard Deviation 181.51221092233794
R-Squared: 1.0


# Random Forest Model Selection
Next I will try the Random Forest Regressor model to try and improve on these scores and their accuracy. Using a Random Forest model should provide a more accurate prediction because it's an aggregate of several individual decision tree models.

In [21]:
from sklearn.ensemble import RandomForestRegressor

forest_reg = RandomForestRegressor()
forest_reg.fit(X, y)

RandomForestRegressor()

In [22]:
# split data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

In [23]:
# predict
y_pred = forest_reg.predict(X_train)

Now trying a prediction on the working linear model (first 5 values):

In [24]:
print(y_pred[:5])

[14421.8195 23459.939  43544.2968 30478.4002 15015.1844]


Measuring the RMSE and r-squared score for the linear model (based on training set):

In [28]:
y_pred = forest_reg.predict(X_train)
lin_mse = mean_squared_error(y_train, y_pred)
lin_rmse = np.sqrt(lin_mse)
print(lin_rmse)

r2_train = r2_score(y_train, y_pred)
print(r2_train)

401.0368526738507
0.9996779902240511


In [29]:
y_pred = forest_reg.predict(X_test)
lin_mse = mean_squared_error(y_test, y_pred)
lin_rmse = np.sqrt(lin_mse)
print(lin_rmse)

r2_test = r2_score(y_test, y_pred)
print(r2_test)

235.68164177245325
0.9998648961601764


This generalizes well with the test set data but I aim to use the cross-validation method one more time.

In [31]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(forest_reg, X_train, y_train, scoring="neg_mean_squared_error", cv=10)
tree_rmse_scores = np.sqrt(-scores)
r2_test = r2_score(y_test, y_pred)

def display_scores(scores):
    print("Scores:", scores)
    print("Mean", scores.mean())
    print("Standard Deviation", scores.std())
    print("R-Squared:", r2_test)
          
display_scores(tree_rmse_scores)

Scores: [ 938.17362663 1321.49445757  929.87085945  625.46036564  878.25550382
 1043.35416649 1543.07519308  834.44839068 1507.62039414  990.03364801]
Mean 1061.1786605513507
Standard Deviation 284.9154322249237
R-Squared: 0.9998648961601764


So the cross validation appears to have reduced the standard deviation considerably using the random forest ensemble method. Once again, I am comparing the scores from cross validation to those from the linear regression model as follows:

In [32]:
lin_scores = cross_val_score(lin_reg, X_train, y_train, scoring="neg_mean_squared_error", cv=10)
lin_rmse_scores = np.sqrt(-lin_scores)
display_scores(lin_rmse_scores)

Scores: [873.89562772 978.49089723 708.25187258 389.29477414 908.56488605
 505.32560637 741.94581029 766.55343858 671.3117333  507.85427655]
Mean 705.1488922789055
Standard Deviation 181.51221092233794
R-Squared: 0.9998648961601764


So evaluating each of the 10 subsets using K-Folds Cross Validation has produced the most accurate score and lowest margin of error so far.

Saving the file as a pickle file will ensure some consistency when comparing scores, parameters and hyperparameters and enable me to start where I left off!

I first need import pickle and joblib.

In [None]:
import pickle

with open('decision_tree.pickle','wb') as f:
    pickle.dump(clf, f)
    
pickle_in = open('decision_tree.pickle','rb')
clf = pickle.load(pickle_in)