# Polynomial Regression
So the relationship between the dependent and independent variables appear to be strongly linear in relationship although the connection between the 'Volume' of daily bitcoin bought and sold has less association with the daily 'Close' price. Fitting a Linear Regression line to the data may be accurate in this case, with an R2 value of 0.9991392014437468 and RMSE of 689.1925598643533. However, out of curiosity I decided to see if a Polynomial function could fit the line slightly better to capture some of the variance in this seemingly linear relationship.

In [1]:
# import libraries
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# import data
bitcoin = pd.read_csv("C:/Users/lynst/Documents/GitHub/machine-learning-projects/machine-learning/BTC_CAD.csv")
df = pd.DataFrame(bitcoin).dropna(axis=0)

In [2]:
# all column names
df.columns

Index(['Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume'], dtype='object')

In [3]:
# all column data types
df.dtypes

Date          object
Open         float64
High         float64
Low          float64
Close        float64
Adj Close    float64
Volume       float64
dtype: object

## Feature Engineering and Model Selection
Assign the (dependant) y variable and (independent) X variables for the modelling process.

In [4]:
# select data for modeling
X = df[["Open", "High", "Low", "Volume"]]
y = df["Close"]

### Splitting the Data
Using this data for the Polynomial Features model and splitting it into training and test sets with a 70-30 split. Make a copy of the dataframe first.

In [5]:
df = df.copy()

In [6]:
from sklearn.model_selection import train_test_split

# split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

## Training the model
The following method is one way for training and testing the polynomial function, but first import the relevant libraries and instantiate the model.

In [7]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
import numpy as np

# initialize the model for a given degree
poly_features = PolynomialFeatures(degree=2, include_bias=False)
  
# transforms the existing features to higher degree features.
X_train_poly = poly_features.fit_transform(X_train)

Using indexation to return any value in X, say the 1st value

In [8]:
X[:1]

Unnamed: 0,Open,High,Low,Volume
0,9718.07,9838.33,9728.25,46248430000.0


Now repeating for the data contained in the 'X_train_poly' set and we can see that there are 14 values returned so it has created an array with 10 new features for a total of 14 features. This includes element-wise dot product values and some squared values (without going into too much detail). Both the original feature values for 'X1' to 'Xn' and the feature squared value from 'X_train_poly' are returned in this example. 

In [9]:
X_train_poly[:1]

array([[1.44019200e+04, 1.45237600e+04, 1.42865400e+04, 3.47848792e+10,
        2.07415300e+08, 2.09170030e+08, 2.05753606e+08, 5.00969048e+14,
        2.10939605e+08, 2.07494278e+08, 5.05207238e+14, 2.04105225e+08,
        4.96955569e+14, 1.20998782e+21]])

Saving the current dataframe to a CSV formatted file (or an Excel file) for preservation to view the degree of newly expanded features in a fresh table.

In [10]:
df.to_csv(r'C:/Users/lynst/Documents/GitHub/machine-learning-projects/machine-learning/X_train_poly.csv', index=False, header=True)

So the first set of metrics show more variance for the training set but a higher degree of fit with the polynomial features model than the test set.

Also, the bias term (intercept) and coefficients are both attributes of the LinearRegression() model so I can examine these from the independent variable:

In [12]:
poly_model.intercept_
poly_model.coef_

array([ 2.04746753e-11,  1.34619966e-08,  7.63252662e-11,  3.95073614e-08,
       -6.83473603e-06, -5.82358084e-07, -8.40449336e-07, -1.49894184e-12,
        5.83557909e-06,  5.55122986e-06,  2.54559597e-12,  5.15929020e-06,
        2.50892933e-12, -4.61371567e-19])

## Applying Regularization
To avoid overfitting regularization can be applied, but will require tuning hyperparameters manually to an extent. To see if I can improve on the linear regression model's ability to predict the target variable, I have decided to use Ridge Regression.
The regularization term basically employs the use of a penalization method by summing the squared values of each coefficient (whether positive or negative) and helps reduce the parameter weights overall but this term is only used during the training phase so I need to remove it for the purpose of testing and evaluation.

In [None]:
# import libraries
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

# import data
bitcoin = pd.read_csv("C:/Users/lynst/Documents/GitHub/machine-learning-projects/machine-learning/BTC_CAD.csv")
df = pd.DataFrame(bitcoin).dropna(axis=0)

Checking the first 2 rows:

In [None]:
df.head(2)

Assign the 'Close' prices to the dependent (target) variable.

In [None]:
y = df["Close"]
print(y)

In [None]:
# select features 
X = df[["Open", "High", "Low", "Volume"]]
print(X)

## Splitting the Data
Using a 70-30 split for the training and test sets.

In [None]:
df = df.copy()

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Printing out the shape of the training sets gives:

In [None]:
print(X_train.shape)
print(y_train.shape)

And the shape of the test data:

In [None]:
print(X_test.shape)
print(y_test.shape)

## Model Selection
### Ridge Regression
Trying a slightly different type of regression model using the alpha learning rate hyperparameter of 1.0 to see if I can reduce the (rmse) error value and increase the (r2) accuracy score. The purpose of using a ridge regression model is to try to reduce or eliminate the coefficient values of all the various features (especially those with high multi-colinearity between predictors) and will increase bias slightly but should decrease variance significantly. Ridge regression achieves this by assigning equal weights to those coefficient values which have high colinearity.

In [None]:
# instantiate model
ridge_regression = linear_model.Ridge(alpha=1.0)

# fit model
ridge_regression.fit(X_train, y_train)

The regularization term should only be added to the cost function during the training phase. Now the 'training' data has been fit, it becomes important to discover the performance measure on the unregularized test set. Making a prediction on the first 5 values in the test set first.

In [None]:
price_predictions = ridge_regression.predict(X_test)
print("Predictions: ", ridge_regression.predict(X_test.iloc[:5]))

Now try a prediction by imputing my own values.

In [None]:
# predicting price based on Open = C$35,000, High = C$40,000, Low = C$32,000 and Volume = 100bn
ridge_regression.predict([[35000, 40000, 32000, 100000000000]])

This tells me the prediction for the target output variable, y, based on the input variables specified and using ridge regression giving the value of C$37,990.99.

## Model Validation Metrics¶
Now to measure the error score and accuracy of the line of fit.

This new data matrix containing the additional features with the squared values has been created by expanding the number of features and the parameter weights (or coefficients) which represent a quadratic equation. The linear regression model should now be applied again to this newly expanded dataframe containing the polynomial features.

In [11]:
# fit the transformed features to Linear Regression
poly_model = LinearRegression()
poly_model.fit(X_train_poly, y_train)
  
# predicting on training data-set
y_train_pred = poly_model.predict(X_train_poly)
  
# predicting on test data-set
y_test_pred = poly_model.predict(poly_features.fit_transform(X_test))
  
# evaluating the model on training dataset
rmse_train = np.sqrt(mean_squared_error(y_train, y_train_pred))
r2_train = r2_score(y_train, y_train_pred)
  
# evaluating the model on test dataset
rmse_test = np.sqrt(mean_squared_error(y_test, y_test_pred))
r2_test = r2_score(y_test, y_test_pred)
  
print("The model performance for the training set")
print("-------------------------------------------")
print("RMSE of training set is {}".format(rmse_train))
print("R2 score of training set is {}".format(r2_train))
  
print("\n")
  
print("The model performance for the test set")
print("-------------------------------------------")
print("RMSE of test set is {}".format(rmse_test))
print("R2 score of test set is {}".format(r2_test))

The model performance for the training set
-------------------------------------------
RMSE of training set is 3087.9523217240453
R2 score of training set is 0.9809084291955809


The model performance for the test set
-------------------------------------------
RMSE of test set is 2941.2787067558293
R2 score of test set is 0.9789579261471753


In [None]:
print("R-squared: ", ridge_regression.score(X_test, y_test))

Producing a scatter plot of the data points to display actual vs predicted prices and the regression line of best fit.

In [None]:
from sklearn.metrics import mean_squared_error

mse = mean_squared_error(y_test, price_predictions)
print(mse)

rmse = np.sqrt(mse)
print(rmse)

# set the width and height of the plot
plt.figure(figsize=(6,6))

# visualizing the relationship between actual and predicted values for y
plt.scatter(y_test, price_predictions)
plt.xlabel("Actual Prices")
plt.ylabel("Predicted Prices")
plt.title("Actual Prices vs Predicted Prices")

# fit a regression line between high and low values to show linear nature
sns.regplot(x=y_test, y=price_predictions)

In [None]:
print(ridge_regression.intercept_)
print(ridge_regression.coef_)

## Using a Pipeline
The next method involves placing the expanded polynomial features and linear regression of these within a pipeline which can be trained and used to predict the target variable.

In [13]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

pipeline = Pipeline([
    ('poly', PolynomialFeatures(degree=2, include_bias=False)),
    ('reg', LinearRegression(normalize=True))
    ])

pipeline.fit(X_train, y_train)
y_train_pred = pipeline.predict(X_train)
print(y_train_pred)

[14451.8497609  23645.4431646  43301.74970473 30554.99600358
 15029.58660573 60042.6849235  68522.7234905  70031.13742336
 12498.2880599   9939.39361695 13657.77434716 11749.97202266
 43800.25035275 76050.50530402 24758.5564535  45121.02936203
 13818.00130113 68906.63435158 46736.12879176 36745.01394664
 12467.52975303 74031.13180218 12375.78903863 43236.84905855
 14784.26699683 12410.68817201 23710.75823537 22451.13791809
 78792.37811112 60853.4173909  72334.44551747 15544.24736262
 12988.90680838 71691.97288227 62558.40694316 12321.08862602
 13270.73561795 73851.90407614 14015.30226863 27043.84261296
 13345.17751371 49408.93141789 73175.32006717 59976.02708372
 65084.18982501 13787.60716279 49952.71031336 15063.04015456
 14231.86121981 14136.57949379 13380.78833418 46467.65757751
 14988.0839077  29755.92348715 35289.7715675  12190.98039995
 24786.27773596 24183.61320924 23619.1359942  23957.79504429
 13112.54649684 76766.97433793 15125.13946685 12746.8912965
 16082.55847323 14182.731

Next I wanted to find the shape of the matrices involved and now these datasets have been expanded by the polynomial model I want to make sure they have the same dimensions (m.n) otherwise the polynomial regression won't work.

In [14]:
print(X_train.shape)

(253, 4)


In [15]:
print(y_train.shape)

(253,)


In [16]:
print(y_train_pred.shape)

(253,)


In [17]:
# evaluating the model on training dataset
rmse_train = np.sqrt(mean_squared_error(y_train, y_train_pred))
r2_train = r2_score(y_train, y_train_pred)

# print the RMSE metric and R2 accuracy score
print("RMSE of training set is {}".format(rmse_train))
print("R2 score of training set is {}".format(r2_train))

RMSE of training set is 600.3385643743142
R2 score of training set is 0.9992784058980032


The metrics from using the 2nd order quadratic coefficients and terms on the training set, a significant improvement.

In [18]:
pipeline = Pipeline([
    ('poly', PolynomialFeatures(degree=2, include_bias=False)),
    ('reg', LinearRegression(normalize=True))
    ])

pipeline.fit(X_test, y_test)
y_test_pred = pipeline.predict(X_test)
print(y_test_pred)

[17778.77021386 40990.59151141 13213.10020101 13576.72568197
 22115.23563654 21459.50220266 73861.87152871 14482.66725764
 15325.35778916 10921.55020283 14274.74834326 14438.43409003
 18386.97114914 12783.33172331 15507.84960687 13768.14202956
 62068.22550618 61523.26190959 18361.33212892 13799.51185826
 42156.59079595 24124.95329311 70694.4849419  64366.74010853
 23930.90300245 34098.65495034 12485.03323    29102.95124855
 19805.8878612  49418.14231437 15663.74751615 50940.73940904
 12929.15492768 12956.75113655 12923.50518806 10831.35739012
 12813.71474303 77223.50140964 65394.44079378 47005.1660213
 24362.35252415 15492.08697297 71587.49574098 14875.08844024
 12236.98139105 12649.13911789 12504.66919562 75497.99349968
 24610.69635306 12861.14651084 20401.00988547 15076.34799577
 12548.09785747 24769.10556438 12427.4554853  12911.48145233
 12624.16070844 74538.2174557  12543.91326383 30184.58493709
 65543.99852979 10963.76317933 22591.56408213 23293.72873672
 14522.89529343 12953.935

The metrics from generalizing to the test set data for the quadratic equation.

In [19]:
# evaluating the model on test dataset
rmse_test = np.sqrt(mean_squared_error(y_test, y_test_pred))
r2_test = r2_score(y_test, y_test_pred)

# print the RMSE metric and R2 accuracy score
print("RMSE of test set is {}".format(rmse_test))
print("R2 score of test set is {}".format(r2_test))

RMSE of test set is 358.82848984421446
R2 score of test set is 0.999686822886171


In [20]:
pipeline = Pipeline([
    ('poly', PolynomialFeatures(degree=3, include_bias=False)),
    ('reg', LinearRegression(normalize=True))
    ])

pipeline.fit(X_train, y_train)
y_train_pred = pipeline.predict(X_train)
print(y_train_pred)

# evaluating the model on training dataset
rmse_train = np.sqrt(mean_squared_error(y_train, y_train_pred))
r2_train = r2_score(y_train, y_train_pred)

# print the RMSE metric and R2 accuracy score
print("\n")
print("RMSE of training set is {}".format(rmse_train))
print("R2 score of training set is {}".format(r2_train))

[14479.41696141 23819.02394734 43655.12187891 30396.14781215
 15035.22662708 60189.00993985 68511.85216449 70247.29157628
 12560.78809674 10099.53593345 13482.88754155 11888.85501207
 44075.1148766  76435.57188812 24963.162367   44189.78842616
 13469.08234185 68784.97140446 46661.39198147 36715.76453925
 12499.96718594 73948.31263382 12449.53953509 43600.57913168
 14945.02546396 12472.72113217 23630.8851514  22235.02016008
 79885.01627185 60939.35012036 72449.25051133 15522.93750821
 13034.4647546  71017.50383388 62032.5181194  12343.39844425
 13256.47144793 73868.00592164 14052.45186174 26860.76385027
 13347.46665167 48717.26976134 72925.00054193 59921.89277069
 64459.3874636  13779.11144849 50724.43876846 15074.83651556
 14150.85721539 14176.32823545 13421.15125324 45895.63598491
 15000.02074448 29803.06996917 35726.82401959 12238.36312613
 24634.79429473 24171.98879771 23563.36512013 23873.32699919
 13172.18287058 76622.26844757 15073.22177655 12814.10965542
 16044.45228986 14169.23

In [21]:
print(X_test.shape)

(109, 4)


In [22]:
print(y_test.shape)

(109,)


In [23]:
print(y_test_pred.shape)

(109,)


In [24]:
pipeline = Pipeline([
    ('poly', PolynomialFeatures(degree=3, include_bias=False)),
    ('reg', LinearRegression(normalize=True))
    ])

pipeline.fit(X_test, y_test)
y_test_pred = pipeline.predict(X_test)
print(y_test_pred)

# evaluating the model on test dataset
rmse_test = np.sqrt(mean_squared_error(y_test, y_test_pred))
r2_test = r2_score(y_test, y_test_pred)

# print the RMSE metric and R2 accuracy score
print("\n")
print("RMSE of test set is {}".format(rmse_test))
print("R2 score of test set is {}".format(r2_test))

[17999.87314506 41036.3726139  13056.40121838 13701.67317722
 22104.10186216 21714.3055008  73907.70050981 14422.50682
 15431.55230628 10866.27647892 14352.4288145  14352.70405383
 18384.19533799 12735.49927782 15701.27835738 13911.83559148
 62393.71791759 61668.81693837 18559.98004196 13732.5156679
 42587.85956154 24442.33674872 70804.13719409 64096.27929078
 24011.81019671 34328.62649579 12545.3336179  29153.90783245
 20217.50342343 48763.65288448 15675.04614768 50771.27927526
 13092.30456592 13055.79483562 12701.64023711 10779.658345
 12585.23232223 77044.25461137 65833.6725336  47346.27508013
 24519.62352237 15380.70780456 71901.52569645 14866.88567717
 12030.49586924 12450.2152917  12392.79375211 75513.34479543
 24737.82662495 12669.71064957 20397.82633078 15235.04099928
 12781.84989068 24784.40035776 12447.43422649 12898.16822852
 12419.72011144 74452.24768999 12424.16441931 30325.84505421
 66302.90614064 10765.99838817 22746.28375906 23522.44763902
 14360.30785163 12896.10323269

The metrics from using an equation with 3rd order cubic coefficients and terms on the test set produced the best overall scores so far.

## Model Validation
So the fit of the regression forecast line to the data has increased in its accuracy marginally with the degree of fit (or expansion of terms) and the RMSE has reduced fairly considerably. The R-squared accuracy appears to increase with smaller sets of data such as the test sets.

I have decided to see if I can improve the model's predictive power by electing to use a Decision Tree Regression model: "https://github.com/lynstanford/machine-learning-projects/tree/master/machine-learning/decision_tree.ipynb".