# Decision Tree Models
The Decision Tree model can be used to discover complex linear relationships between variables for either prediction, binary classification or multi-output classification. Obviously in this case I am looking for price prediction given a relatively small number of features.

The first step involves importing the relevant packages and the Iris dataset.

In [1]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier

iris = load_iris()
X = iris.data[:,2:]
y = iris.target



## Predictions
Looking at 'Actual' values for y:

In [17]:
y = df["Close"]
y.head()

0     9763.94
1    10096.28
2    10451.16
3    10642.81
4    10669.64
Name: Close, dtype: float64

Now look at the 'Predicted' values for y:

In [18]:
# try predicting y given X
y = lin_reg.predict(X)

# convert y to DataFrame
y_pred = pd.DataFrame(y)

# view the first 5 rows
y_pred.head()

Unnamed: 0,0
0,9792.447701
1,10045.060832
2,10382.77825
3,10609.762775
4,10693.638154


Below I have calculated y based on the following Multiple Regression formula:

    y = a + b1x1 + b2x2 + b3x3 + b4x4

So based on all the data contained within the matrix of values in 'X', how can I predict values in the vector 'y'? Once I have established the linear relationship between the dependent and independent variables, their I can summarize them in the following equation so I have decided to choose the first row of values from the dataset which shows the following data:

a = 104.14703757539974

b1 = -4.32747159e-01

x1 = 'Open' Price = 9718.07

b2 = 9.17019324e-01

x2 = 'High' Price = 9838.33

b3 = 5.08535341e-01

x3 = 'Low' Price = 9728.25

b4 = -1.62880798e-09

x4 = 'Volume' = 46,248,428,075

Remember, I am only using 'Open', 'High', 'Low' and 'Volume' features for values x1 to x4. The y (dependent variable) value for the 'Close' Price is based on these determinants (independent variables). So, plugging these values into the formula above produces an estimate for the 'Close' price I am using for the target variable.

In [19]:
y = 104.14703757539974 + (-4.32747159e-01*(9718.07)) + (9.17019324e-01*(9838.33)) + (5.08535341e-01*(9728.25)) + (-1.62880798e-09*(46248428075))
print(y)

9792.447702373423


Comparing the first entries of the actual 'y' value and the 'y_pred' estimate from Linear Regression gives me:

In [20]:
y = df["Close"]
print(y[:1])

0    9763.94
Name: Close, dtype: float64


And the predicted value

In [21]:
y = lin_reg.predict(X)
y_pred = pd.DataFrame(y)
print(y[:1])

[9792.44770128]


And comparing this linear regression prediction to the multiple regression formula for the first entry in the target data column

In [22]:
y = 104.14703757539974 + (-4.32747159e-01*(9718.07)) + (9.17019324e-01*(9838.33)) + (5.08535341e-01*(9728.25)) + (-1.62880798e-09*(46248428075))
print(y)

9792.447702373423


# Decision Tree Model Selection
Next, it's time to apply a Decision Tree model to the entire dataset before seeking further improvement.

In [29]:
from sklearn.tree import DecisionTreeRegressor

# select data for modeling
X = df[["Open", "High", "Low", "Volume"]]
y = df["Close"]

tree_reg = DecisionTreeRegressor()
tree_reg.fit(X, y)

DecisionTreeRegressor()

In [30]:
# split data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

In [31]:
# predict
y_pred = tree_reg.predict(X_train)

Now trying a prediction on the working linear model (first 5 values):

In [32]:
print(y_pred[:5])

[14452.49 23303.57 43794.73 30525.81 14984.18]


Measuring the RMSE and r-squared score for the linear model (based on training set):

In [33]:
y_pred = tree_reg.predict(X_train)
lin_mse = mean_squared_error(y_train, y_pred)
lin_rmse = np.sqrt(lin_mse)
print(lin_rmse)
    
r2_train = r2_score(y_train, y_pred)
print(r2_train)

0.0
1.0


This definitely appears to be overfitting with perfect scores for both RMSE and r-squared. Let's see if there is a different outcome for the test data.

In [34]:
y_pred = tree_reg.predict(X_test)
lin_mse = mean_squared_error(y_test, y_pred)
lin_rmse = np.sqrt(lin_mse)
print(lin_rmse)
    
r2_test = r2_score(y_test, y_pred)
print(r2_test)

0.0
1.0


This appears unlikely also. In order to establish a more likely outcome I will try dividing the dataframe into several smaller training and validation sets and perform the decision tree analysis on each. This is done using K-Fold Cross Validation.

## Cross Validation
This method will evaluate the Decsion Tree model by splitting the training set into several smaller training and validation sets for training and evaluation separately. This is achieved by using the K-fold cross validation technique and I have split the data into 10 separate folds, cv=10 (which can be changed).

In [35]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(tree_reg, X_train, y_train, scoring="neg_mean_squared_error", cv=10)
tree_rmse_scores = np.sqrt(-scores)
r2_test = r2_score(y_test, y_pred)
                         
def display_scores(scores):
    print("Scores:", scores)
    print("Mean", scores.mean())
    print("Standard Deviation", scores.std())
    print("R-Squared:", r2_test)
          
display_scores(tree_rmse_scores)

Scores: [1415.09508632 2714.23149461 1067.56332875 1407.46036783 1176.49410733
 1584.0568291  1880.06305708 1647.54752627 1004.42327796 1269.42876476]
Mean 1516.6363840000342
Standard Deviation 474.4537413956038
R-Squared: 1.0


Comparing the scores from cross validation to those from the linear regression model:

In [36]:
lin_scores = cross_val_score(lin_reg, X_train, y_train, scoring="neg_mean_squared_error", cv=10)
lin_rmse_scores = np.sqrt(-lin_scores)
display_scores(lin_rmse_scores)

Scores: [873.89562556 978.49089698 708.25187252 389.29477539 908.56489005
 505.32560623 741.94580962 766.5534399  671.31173392 507.85427701]
Mean 705.1488927197572
Standard Deviation 181.51221089972861
R-Squared: 1.0


# Random Forest Model Selection
Next I will try the Random Forest Regressor model to try and improve on these scores and their accuracy. Using a Random Forest model should provide a more accurate prediction because it's an aggregate of several individual decision tree models.

In [37]:
from sklearn.ensemble import RandomForestRegressor

forest_reg = RandomForestRegressor()
forest_reg.fit(X, y)

RandomForestRegressor()

In [38]:
# split data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

In [39]:
# predict
y_pred = forest_reg.predict(X_train)

Now trying a prediction on the working linear model (first 5 values):

In [40]:
print(y_pred[:5])

[14427.1194 23400.7579 43748.8539 30502.0843 15017.7404]


Measuring the RMSE and r-squared score for the linear model (based on training set):

In [41]:
y_pred = forest_reg.predict(X_train)
lin_mse = mean_squared_error(y_train, y_pred)
lin_rmse = np.sqrt(lin_mse)
print(lin_rmse)

r2_train = r2_score(y_train, y_pred)
print(r2_train)

402.3427324133909
0.9996758897154432


In [42]:
y_pred = forest_reg.predict(X_test)
lin_mse = mean_squared_error(y_test, y_pred)
lin_rmse = np.sqrt(lin_mse)
print(lin_rmse)

r2_test = r2_score(y_test, y_pred)
print(r2_test)

242.04766926917475
0.9998574989734604


This generalizes well with the test set data but I aim to use the cross-validation method one more time.

In [43]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(forest_reg, X_train, y_train, scoring="neg_mean_squared_error", cv=10)
tree_rmse_scores = np.sqrt(-scores)
r2_test = r2_score(y_test, y_pred)

def display_scores(scores):
    print("Scores:", scores)
    print("Mean", scores.mean())
    print("Standard Deviation", scores.std())
    print("R-Squared:", r2_test)
          
display_scores(tree_rmse_scores)

Scores: [1062.74035202 1326.1235471   942.74730148  755.70142066  872.07362614
 1047.05694763 1592.88011148  829.92991833 1560.36462776  929.46166385]
Mean 1091.9079516445825
Standard Deviation 284.0380638169419
R-Squared: 0.9998574989734604


So the cross validation appears to have reduced the standard deviation considerably using the random forest ensemble method. Once again, I am comparing the scores from cross validation to those from the linear regression model as follows:

In [44]:
lin_scores = cross_val_score(lin_reg, X_train, y_train, scoring="neg_mean_squared_error", cv=10)
lin_rmse_scores = np.sqrt(-lin_scores)
display_scores(lin_rmse_scores)

Scores: [873.89562556 978.49089698 708.25187252 389.29477539 908.56489005
 505.32560623 741.94580962 766.5534399  671.31173392 507.85427701]
Mean 705.1488927197572
Standard Deviation 181.51221089972861
R-Squared: 0.9998574989734604


So evaluating each of the 10 subsets using K-Folds Cross Validation has produced the most accurate score and lowest margin of error so far.

Saving the file as a pickle file will ensure some consistency when comparing scores, parameters and hyperparameters and enable me to start where I left off!

I first need import pickle and joblib.

In [None]:
# pickle file to go here

# Print Dependencies
Dependences are fundamental to record the **computational environment**.   

- Use [watermark](https://github.com/rasbt/watermark) to print version of python, ipython, and packages, and characteristics of the computer

In [45]:
%load_ext watermark

# python, ipython, packages, and machine characteristics
%watermark -v -m -p wget,pandas,numpy,watermark,tarfile,urllib3,matplotlib,seaborn,sklearn 

# date
print (" ")
%watermark -u -n -t -z 

ModuleNotFoundError: No module named 'watermark'