# Polynomial Regression
The first question I needed to ask myself is why decide to use Polynomial Regression and what does it do exactly? So according to the visualizations from the <a href="https://github.com/lynstanford/machine-learning-projects/machine-learning/multiple_regression.ipynb">Linear Regression</a> notebook, the connection between the dependent and independent variables appear to be strongly linear in relationship although the connection between the 'Volume' of daily bitcoin bought and sold has less association with the daily 'Close' price. 

Fitting a Linear Regression line to the data may be accurate in this case, with an R2 value of 0.9991392014437468 and RMSE of 689.1925598643533. However, out of curiosity I decided to see if a Polynomial function could fit the line slightly better by employing a regularization technique to try and improve the bias term by decreasing the Mean Squared Error.

The r-squared value is used to represent the overall accuracy score and directly measures the degree of variability associated between the predictors and target variable. The root mean squared value is represented as a loss function and my aim is to reduce its overall value as much as possible using regularization.

I know that Polynomial Regression is useful in determining non-linear relationships between multiple independent variables and the dependent variable, so it can be classified as a type of multiple linear regression. I can try to improve the fit of a prediction line to the data and improve estimates by changing the 'degree of fit' parameter, or by utilizing regularization.

## Import Data
Keeping the data loading simple this time will reduce the overall time it takes to retrieve.

In [None]:
# import libraries
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import os

# import data from filepath
btc_cad = pd.read_csv("C:/Users/lynst/Documents/GitHub/machine-learning-projects/machine-learning/BTC_CAD.csv")

# storing dataset from filepath into new dataframe
bitcoin = pd.DataFrame(btc_cad).dropna(axis=0)

This reads the entire dataset and then stores all the values in a single dataframe object.

## Feature Selection and Scaling
Checking to see what features are present within the current dataframe.

In [None]:
# all column names
bitcoin.columns

Now to remove the columns I don't need including 'Date' and 'Adj Close', selecting only the remaining ones to include in the X variable such as 'Open', 'High', 'Low' and 'Volume'. The remaining column of 'Close' in the same dataframe will be used as the target variable output, y.

In [None]:
# remove feature with string values - Date
del bitcoin['Date']

# remove adj close as I will not be using this
del bitcoin['Adj Close']

# see the remaining features
bitcoin.columns

Assigning features to X gives:

In [None]:
# select features as X dataframe
X = bitcoin[['Open','High','Low','Volume','Close']]
print(X)

Comparing the shape of the overall dataset before polynomial regression and before splitting gives:

In [None]:
X.shape

In [None]:
# select target as y series
y = bitcoin['Close']
print(y)

In [None]:
y.shape

Another way to select the right column vectors is using indexation.

In [None]:
X = X.iloc[:,0:5].values
print(X[0:10])

In [None]:
y = y.iloc[:, ].values
print(y[0:10])

Does the dataset need re-scaling? In this particular case the (X) predictors include 3 price variables and 1 volume variable which is scaled differently. The (y) target variable is another price variable, so the data would benefit from re-scaling. The transformation I have chosen will shift the values to a range between 0 and 1 for each column. 

Make a copy of the dataframe first.

In [None]:
bitcoin = bitcoin.copy()

In [None]:
# import preprocessing from sci-kit learn
from sklearn import preprocessing

# define min max scaler
min_max_scaler = preprocessing.MinMaxScaler()

# transform data
X_scaled = min_max_scaler.fit_transform(X)

bitcoin_features = pd.DataFrame(X_scaled)

bitcoin_features.to_csv(r'C:\Users\lynst\Documents\GitHub\machine-learning-projects\machine-learning\bitcoin_features.csv', index = False, header = True)

print(bitcoin_features)

Repeating this process for the y values:

In [None]:
from matplotlib import cm
from mpl_toolkits.mplot3d import Axes3D
from sklearn.preprocessing import PolynomialFeatures
from sklearn import linear_model

x = bitcoin_features.iloc[0:362,:1]
y = bitcoin_features.iloc[0:362,-1:]
z = bitcoin_features.iloc[0:362,3:4]

fig = plt.figure(figsize=(16,10))
ax = fig.add_subplot(111, projection='3d')

ax.scatter(x, y, z, c='r', marker='o')
ax.set_xlabel("Open")
ax.set_ylabel("Close")
ax.set_zlabel("Volume")
plt.show()   

So for this particular comparison between the open and close prices and volume of Bitcoin transactions I can see a positive linear relationship between 'Open' and 'Close' prices. As one increases in value, so does the other. The relationship they have with 'Volume' appears somewhat linear also except for and outlier when volume spiked on Feb 26th, 2021. This was apparently due to bets by Tesla and Mastercard and stood out significantly compared to anything seen previously this year.

Next I will perform both linear regression and polynomial regression to display prediction lines of best fit. I will only display the 'Close' price vs 'Date', so just two variables. I have chosen to manipulate data for price and time and store them in a new array called 'btc_new'.

In [None]:
# import data from filepath
btc_cad = pd.read_csv("C:/Users/lynst/Documents/GitHub/machine-learning-projects/machine-learning/BTC_CAD.csv")
# storing dataset from filepath into new dataframe
bitcoin = pd.DataFrame(btc_cad).dropna(axis=0)

I have decided to create a new data array containing the variables for date and the daily close price, having converted the dates to straight forward number of days (ranging from 1 to 362), the same as the total number of 'non-null' entries.

In [None]:
# Visualising the Linear Regression results
figure = plt.figure(figsize=(14,8))

x = np.array([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,
              42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,
              80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,
              113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,
              141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,
              169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,
              197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,
              225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,
              253,254,255,256,257,258,259,260,261,262,263,264,265,266,267,268,269,270,271,272,273,274,275,276,277,278,279,280,
              281,282,283,284,285,286,287,288,289,290,291,292,293,294,295,296,297,298,299,300,301,302,303,304,305,306,307,308,
              309,310,311,312,313,314,315,316,317,318,319,320,321,322,323,324,325,326,327,328,329,330,331,332,333,334,335,336,
              337,338,339,340,341,342,343,344,345,346,347,348,349,350,351,352,353,354,355,356,357,358,359,360,361,362])
y = np.array([9763.94,10096.28,10451.16,10642.81,10669.64,10836.68,10941.6,10917.12,12211.47,12083.24,12644.26,12820.88,12576.13,
              12551.25,12642.56,13128.23,13904.59,13795.97,13448.25,12194.49,12061.87,12397.94,13062.2,13659.96,13162.55,
              13228.95,13627.94,13559.83,13560.94,13241.13,12666.33,12857.9,12895.3,12293.22,12445.06,12177.47,12630.37,13121.03,
              13000.03,13359.9,13034.29,13812.49,12873.86,13031.1,13233.33,12972.79,12956.88,13071.93,13063.21,13147.35,13253.52,
              12711.01,12883.57,12876.0,12803.02,12791.0,12907.76,12871.53,12793.94,12640.25,12700.57,12668.61,13028.33,13041.35,
              12707.97,12634.47,12541.29,12380.43,12506.51,12552.25,12404.78,12543.14,12380.24,12313.75,12374.98,12296.15,
              12693.78,12588.0,12745.56,12599.85,12614.86,12556.24,12613.8,12580.1,12571.09,12417.11,12394.21,12428.28,12437.98,
              12469.29,12395.2,12604.21,12781.59,12843.04,12796.08,12982.28,13288.45,14662.38,14600.23,14809.4,14902.15,15186.41,
              15771.32,14809.83,15064.88,14900.92,15582.92,15690.25,15529.15,15733.03,15633.23,15862.41,15187.81,15342.8,
              15581.58,15614.35,15742.82,15761.58,16203.14,15775.92,15535.99,15648.99,15273.86,15391.56,15375.86,15563.07,
              14964.79,15097.17,14859.35,15119.23,15072.56,15319.24,15230.27,15626.5,14891.18,13462.52,13731.63,13284.56,
              13442.84,13579.41,13413.77,13471.4,13668.72,13705.81,13760.17,13609.86,14073.93,14244.94,14466.7,14398.2,14452.49,
              14650.47,14436.9,13916.7,14013.41,13690.02,14346.23,14325.02,14397.66,14417.81,14322.13,14520.83,14357.78,14115.09,
              14089.38,14063.2,14203.03,14325.21,14149.28,14160.03,14410.83,14818.74,14949.65,15031.95,15206.57,14936.71,
              14984.18,15137.27,15487.81,15634.23,16869.5,17032.64,16973.62,17205.31,17133.71,17267.45,18012.41,17666.01,
              17893.39,18045.32,18357.66,18364.88,17915.13,18279.6,18553.15,20383.97,20332.17,19375.87,20159.37,19945.73,
              19926.42,20504.46,21384.43,21423.16,21095.38,20937.96,21858.82,23126.07,23303.57,23320.17,24380.23,24407.62,
              24036.96,24010.26,24836.84,24356.4,22332.26,22226.46,23017.67,23600.83,25498.36,24319.8,24803.39,25023.04,23905.97,
              24486.96,24726.68,24572.39,23481.02,23800.7,23272.47,23059.65,24014.9,24409.37,24561.12,24658.5,27166.03,29036.47,
              29589.87,30525.81,30075.87,29308.0,30662.86,29851.39,30529.54,31754.68,34036.36,33737.04,34786.05,35085.89,
              36777.84,36919.19,37393.09,40898.17,41711.19,40865.06,43089.99,46641.22,49945.91,51769.03,51079.4,48817.74,45432.6,
              43108.25,47383.3,49563.35,46905.54,46081.14,45711.01,46701.33,45888.95,44892.29,38992.07,42028.71,40834.13,
              41094.11,41229.38,41341.23,39009.81,40608.72,43844.33,43794.73,42390.85,43093.59,45408.61,47908.07,47359.34,
              48683.77,50115.41,49642.27,58850.14,59023.47,57035.0,60845.81,60319.29,59816.94,61818.59,60583.38,62545.24,
              66229.13,65537.27,70533.62,70772.35,72505.56,68363.29,61493.78,62195.54,59404.52,59028.47,58830.69,57312.2,62739.6,
              61132.9,64055.45,61573.38,61913.1,61894.22,64684.29,66138.41,69333.05,70691.19,72463.92,71526.08,76406.88,73947.62,
              69760.45,70684.33,72931.8,72297.91,73079.59,73038.24,72043.42,68277.85,68917.98,66392.34,65160.43,69541.94,
              70596.59,70416.7,72713.56,74365.91,74029.63,74156.38,74675.77,72436.89,73827.42,73942.95,73156.52,70708.54,
              73273.84,72978.66,74918.53,75463.91,75243.42,79598.41,78995.98,79437.88,77018.32,75906.36,70374.91,69788.23,
              71477.76])

# The array of x values
reg_line = np.linspace(1, 362)

# Applying a linear fit
polymodel = np.poly1d(np.polyfit(x, y, deg=1))

plt.scatter(x, y, color = 'turquoise')
plt.plot(reg_line, polymodel(reg_line))
plt.show()

This would represent a highly positive linear relationship between price and time over the last year, specifically a high bias and low variance regression line, or an 'under-fitted' model. 

Making sure the values work printing out the first 10 values of each list.

In [None]:
print(x[0:10])

In [None]:
print(y[0:10])

Applying a polynomial degree of 2 this time provides an exponential curve. With a degree=2, the highest order value would be an exponent of 2, or x squared. The line used to capture these estimates are of low bias and low variance which is more suitable for prediction.

In [None]:
figure = plt.figure(figsize=(14,8))

polymodel = np.poly1d(np.polyfit(x, y, deg=2))

reg_line = np.linspace(1, 362)

plt.scatter(x, y, color = 'turquoise')
plt.plot(reg_line, polymodel(reg_line))
plt.show()

Using a polyfit method with a degree of 50.

In [None]:
figure = plt.figure(figsize=(14,8))

polymodel = np.poly1d(np.polyfit(x, y, deg=50, rcond=None, full=False, w=None, cov=False))

reg_line = np.linspace(1, 362, 362)

plt.scatter(x, y, color = 'turquoise')
plt.plot(reg_line, polymodel(reg_line))
plt.show()

# to eliminate the rank warning run this cell twice
import warnings
warnings.simplefilter('ignore', np.RankWarning)

In essence, this represents 'over-fitting' and a low bias / high variance model.

## Select and Train a Model
Having introduced a pipeline to clean the data and apply machine learning algorithms automatically I've chosen to apply a Linear Regression model first. Once again, selecting the data to model for each variable, but this time using four independent variables as input predictors so the polynomial algorithm can be applied to try and fit the model slightly better than before.

In [None]:
# import data from filepath
btc_cad = pd.read_csv("C:/Users/lynst/Documents/GitHub/machine-learning-projects/machine-learning/BTC_CAD.csv")

# storing dataset from filepath into new dataframe
bitcoin = pd.DataFrame(btc_cad).dropna(axis=0)

# remove feature with string values - Date
del bitcoin['Date']

# remove adj close as I will not be using this
del bitcoin['Adj Close']

# select features as X dataframe
X = bitcoin[['Open','High','Low','Volume']]
print(X)

# select target as y series
y = bitcoin['Close']
print(y)

## Scaling
Remember to apply the scaling because 'Volume' has different units of measurement.

In [None]:
# making a copy of the dataset
bitcoin = bitcoin.copy()

# import preprocessing from sci-kit learn
from sklearn import preprocessing

# define min max scaler
min_max_scaler = preprocessing.MinMaxScaler()

# transform data
X_scaled = min_max_scaler.fit_transform(X)
coin_features = pd.DataFrame(X_scaled)
coin_features.to_csv(r'C:\Users\lynst\Documents\GitHub\machine-learning-projects\machine-learning\bitcoin_features.csv', index = False, header = True)

print(coin_features)

Having established a more comparible set of values for the entire dataset, I can proceed with application of my Polynomial model.

Before using the polynomial model I will train and fit a Linear Regression algorithm to learn from the entire dataset before the prediction or validation phase.

## Fit the Linear Regression Model

In [None]:
# Fitting Linear Regression to the dataset
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

lin_reg = LinearRegression()
lin_reg.fit(X, y)

In [None]:
# printing the r2 score
print(lin_reg.score(X, y))

This could indicate a degree of data leakage because the score is so high, meaning it may not be the best representation of the correlation measure between the target and predictors because they're all extremely highly correlated.
To obtain a descriptive statistical summary I use the describe() function which provides figures I can use to calculate the R-squared manually....not just relying on the sklearn.metrics module.

In [None]:
coin_features.describe()

I can check the R-squared value and plug in the inputs manually to see if it works.

Defining the intercept and coefficient values for x1, x2 ... xn from the linear regression.

In [None]:
lin_reg.intercept_

In [None]:
lin_reg.coef_

Using the X values from the first row.

In [None]:
X[:1]

Given this is linear regression and there are no polynomial terms I can calculate a prediction based on the following:

y = a + b1x1 + b2x2 + b3x3 + b4x4

Then, this will form the basis for my predictive model so I assign the values to a new variable called 'y_pred'.

In [None]:
# linear regression terms for intercept and coefficients for row 1 values:
a = lin_reg.intercept_ = 104.15
b1 = lin_reg.coef_[0] = -0.43
b2 = lin_reg.coef_[1] = 0.92
b3 = lin_reg.coef_[2] = 0.51
b4 = lin_reg.coef_[3] = -0.0000000016
x1 = X['Open'] # = 9718.07
x2 = X['High'] # = 9838.33
x3 = X['Low'] # = 9728.25
x4 = X['Volume'] # = 462483300000.0
y = 9763.94
y_mean = 29864.75

In [None]:
y_pred = a + (b1*x1) + (b2*x2) + (b3*x3) + (b4*x4)
y_pred = 104.15 + (-0.43 * 9718.07) + (0.92 * 9838.33) + (0.51 * 9728.25) + (-0.0000000016 * 462483300000.0)
print(y_pred)

The actual y value for the first row in the table was 9763.94. To calculate the R-squared score:

In [None]:
r2 = 1 - ((np.sum(y_pred - y)**2) / (np.sum(y_mean - y)**2))
print("The r-squared value is:", r2)

In [None]:
import numpy as np
y_mean = np.mean(y)
squared_errors_mean = np.sum((y - y_mean)**2)
squared_errors_model = np.sum((y - lin_reg.predict(X))**2)
r2 = 1 - (squared_errors_model / squared_errors_mean)
print(r2)

This is similar to the above score for the entire dataset but remember this is only for the first row in the table. Repeating this process for the entire dataset this time may require a little more thought as I need to iterate through all the values and integrate these inputs and define them as part of a new function:

In [None]:
a = 104.15
b1 = -0.43
b2 = 0.92
b3 = 0.51
b4 = -0.0000000016
x1 = X['Open']
x2 = X['High']
x3 = X['Low']
x4 = X['Volume']

# obtaining the predicted values for y
def prediction():
    for i in range(len(X)):
        i = i + 1
        y_pred = a + (b1*x1) + (b2*x2) + (b3*x3) + (b4*x4)
    print(y_pred)
   
prediction()

In [None]:
from matplotlib import pyplot
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

# create actual and predicted values
y = bitcoin['Close']
actual = y
y_true = actual
y_pred = a + (b1*x1) + (b2*x2) + (b3*x3) + (b4*x4)
predicted = y_pred

In [None]:
actual[0:10]

In [None]:
predicted[0:10]

In [None]:
# create error values
error = actual - predicted
error = y - y_pred
errors = np.sum(error)
errors = list()
mse = mean_squared_error(y_true, y_pred)
rmse = np.sqrt(mse)

def summary_stats():
    for i in range(len(actual)):
        error = actual - predicted
        error = y_true - y_pred
        errors = np.sum(error)
        errors = np.sum(y_true - y_pred)
        errors = list()
        mse = mean_squared_error(y_true, y_pred)
        rmse = np.sqrt(mse)
        
summary_stats()

print("Mean Squared Error is:", mse)
print("Root Mean Squared Error is:", rmse)
print("\n")

# plot errors
fig = pyplot.figure(figsize=(10,10))
pyplot.plot(errors)
pyplot.scatter(y_true, y_pred, color = 'coral')
pyplot.xlabel('Predicted Value')
pyplot.ylabel('Actual Value')
pyplot.show()


Having trained my model on inputs and a labeled target dataset and establishing a measure for the root mean squared error or the 'Loss Function', this can be used to compare the predictions to the targets. An important point to note is that it's better to better to perform scores on unseen test data that isn't used in the training phase. I can assess this measure as the level of risk or the degree of the dispersion of data observations about their mean. The r-squared score is the degree of accuracy of the linear relationship.

Because the loss function can be improved upon, the next phase involves using an 'Optimization Function' to re-balance the coefficients of each data point reiterating through the data in an attempt to reduce the loss function.

In [None]:
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

## Check the Accuracy Score for the Linear Regression Fit
To see how close the relationship between X and y is.

In [None]:
# Predicting accuracy result with Linear Regression
print('R2: %0.8f' %r2_score(y_test, lin_reg.predict(X_test)))

Having split the dataset and fitting the linear regression model to the train set, comparing it to the test set improves the R-squared score. This could be slightly deceiving because the previous score was calculated on the entire dataset so this could be the result of data-bleed.

## Fit the Polynomial Model

In [None]:
# Fitting Polynomial Regression followed by Linear Regression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import r2_score
 
poly = PolynomialFeatures(degree=4)
X_poly = poly.fit_transform(X)
poly.fit(X_poly, y)

lin_reg2 = LinearRegression()
lin_reg2.fit(X_poly, y)
print('R2: %0.8f' %r2_score(y_test, lin_reg2.predict(X_test)))

There is no r-squared score for the polynomial model because the resultant training set (which is non-linear) has a different number of features than original.

## Split the Dataset
First I've decided to import the train_test_split function to split the dataset into respective training and test sets (for both X and y values in each set).

In [None]:
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

## Applying Regularization to the Model
In this particular instance I have chosen L2 regularization which will penalize the Loss (or cost) function by summing the coefficients and equalizing features which exhibit strong colinearity, while simultaneously reducing the effect of redundant coefficients. (See the colinearity matrix in the previous <a href="https://github.com/lynstanford/machine-learning-projects/machine-learning/multiple_regression.ipynb">Linear Regression</a> model).
    
Note, I have two hyperparameters I can adjust here including 'alpha' and 'normalize'. The 'alpha' value is the learning rate which controls the size of the incremental steps for the learning rate and the 'normalize' adjustment will re-scale the feature values between 0 and 1 for each observation in the dataset. An r-squared score will be used to determine its accuracy.

In [None]:
# Fitting Polynomial Regression followed by Ridge Regression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import r2_score

poly = PolynomialFeatures(degree=4)
X_poly = poly.fit_transform(X)
#poly.fit(X_poly, y)

from sklearn.linear_model import Ridge
reg_regression = Ridge(alpha=0.1, normalize=True)
reg_regression.fit(X, y)

print('R2: %0.8f' %r2_score(y_test, reg_regression.predict(X_test)))

In [None]:
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

So the accuracy has decreased slightly after using a Polynomial model and applying Ridge Regression to the entire dataset, but if I were to try fitting it only to the training set I get the following score: 

In [None]:
reg_regression = Ridge(alpha=0.1, normalize=True)
reg_regression.fit(X_train, y_train)
print('R2: %0.8f' %r2_score(y_test, reg_regression.predict(X_test)))

A marginal improvement, but I really need to fine-tune some of the hyperparameters in the ridge regression algorithm in order to improve this score. Removing the 'normalize' hyper-parameter seems to help as I already pre-processed the data using a MinMaxScaler above. 

Adjusting the 'alpha' parameter doesn't appear to improve the score but using the 'solver' parameter helps. It is set to the default value of 'auto'.

In [None]:
reg_regression = Ridge(alpha=0.1, solver="auto")
reg_regression.fit(X_train, y_train)
print('R2: %0.8f' %r2_score(y_test, reg_regression.predict(X_test)))

I can deduce that the accuracy for determining predictions appears to be highest for plain linear regression, before any polynomial modeling is applied. This is probably due to the strong nature of the association between the price variables, or their degree of co-linearity.

## Predicting Against Test Set
Now to predict the test set results.

In [None]:
y_test_pred = lin_reg.predict(X_test)

# to see the first ten results
y_test_pred[0:10]

Comparing the first row of X values to predict the y value by copying from the dataset output from cell 1. Remember only to use the 'Open', 'High', 'Low' and 'Volume' values.

In [None]:
lin_reg.predict([[9718.07, 9838.33, 9728.25, 4.624843e+10]])

So the predicted y value is 9792.45 and the actual y value (from output of the first value for 'Close' price) is 9763.94, which is reasonably accurate.

## Model Evaluation
This is where I need to look into the root mean squared error and R-squared values to see how well the model is performing.

In [None]:
# import the score measure
from sklearn.metrics import r2_score
r2_score(y_test, y_test_pred)

### Actual vs Predicted Prices
Plotting the actual and predicted results for the 'Close' price provides a nice visual display of the relationship.

In [None]:
plt.figure(figsize=(15,10))
plt.scatter(y_test, y_test_pred, color="orange")
plt.xlabel("Actual")
plt.ylabel("Predicted")
plt.title("Actual vs. Predicted Prices")
plt.show()

## Predictions
Storing the actual and predicted values and their difference in a dictionary object becomes useful. The larger the error, the greater the variance and subsequent standard deviation which translates to less accurate predictions and a less reliable model. 

Why does this become important? The main purpose of this exercise is to see how useful my model would be in predicting the daily closing price so the first information I can derive from the scatter plot above shows the relationship between independent and dependent variables is strong, but also incredibly linear. It may not pay off to apply Polynomial Regression in such circumstances but may improve the linear fit to the data points with increasing degrees of freedom.

In [None]:
predictions = pd.DataFrame({"Actual": y_test, "Predicted": y_test_pred, "Error": y_test - y_test_pred})
predictions[0:10]

Exploring the entire dataset (all the X and y values) and the degree of accuracy by fitting a linear regression predictive line once more, but this time calling Seaborn's regression plot function (regplot).

In [None]:
import seaborn as sns

# set the width and height of the plot
plt.figure(figsize=(15,10))

# visualizing the relationship between actual and predicted values for y
plt.scatter(y_test, y_test_pred, color="orange")
plt.xlabel("Actual Prices")
plt.ylabel("Predicted Prices")
plt.title("Actual Prices vs Predicted Prices")

# fit a regression line between high and low values to show linear nature
sns.regplot(x=y_test, y=y_test_pred, color="orange")

The scatter plot explains how actual and expected prices do not diverge too far from one another, with the dots representing actual 'Close' prices and the line representing expected or predicted 'Close' prices.

Suppose I were to re-run the polynomial model with only 2 degrees of freedom as a parameter. From the beginning I can train the dataset using a new model.

## Training a new Polynomial model
Importing the relevant libraries and instantiating the polynomial function is the first step. 

In [None]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split

# import data from filepath
btc_cad = pd.read_csv("C:/Users/lynst/Documents/GitHub/machine-learning-projects/machine-learning/BTC_CAD.csv")

# storing dataset from filepath into new dataframe
bitcoin = pd.DataFrame(btc_cad).dropna(axis=0)

# remove feature with string values - Date
del bitcoin['Date']

# remove adj close as I will not be using this
del bitcoin['Adj Close']

# select features as X dataframe
X = bitcoin[['Open','High','Low','Volume']]
print(X)

# select target as y series
y = bitcoin['Close']
print(y)

The first stage will produce an expanded feature set for the entire dataframe with quadratic terms (degree=2). I will attempt to tune this hyperparameter later to see if it can optimize the Loss function. The independent variables or features are selected as well as the target variable, unnecessary columns and rows are removed as part of the cleaning process.

In [None]:
# initialize the model for a given degree
poly_features = PolynomialFeatures(degree=2, include_bias=True)
  
# transforms the existing features to higher degree features.
poly_X = poly_features.fit_transform(X)

Using indexation to return any value in X, say up to but not including the 1st value.

In [None]:
X[:1]

Now repeating for the data contained in the 'poly_X' set and we can see that there are 15 values returned so it has created an array with 11 new features for a total of 15 features (giving both the original feature values for 'X1' to 'Xn', each feature value squared and dot product or cross-multiplication of each individual feature are returned in this example).

In [None]:
poly_X[:1]

Saving the current dataframe to an Excel workbook formatted file for preservation to view the degree of newly expanded features in a fresh table. Check the first row entries above against the new excel table below.

In [None]:
import pandas as pd
df = pd.DataFrame(poly_X)
df.to_excel("C:/Users/lynst/Documents/GitHub/machine-learning-projects/machine-learning/poly_X.xlsx", sheet_name='poly_X')

### Splitting the Data
Using this data for the Polynomial Features model and splitting it into training and test sets with a 70-30 split. 

In [None]:
# then split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(poly_X, y, test_size = 0.3, random_state = 42)

The number of entries (rows) by the number of features (cols):

In [None]:
print(X.shape)
print(y.shape)

The shape of the new dataset with all the extended features is much larger and I can see a distinctive change in dimensions. Printing out the shape of the training set from the polynomial data gives:

In [None]:
print(poly_X.shape)

Now for the training sets for X and y, 70% of the total values.

In [None]:
print(X_train.shape)
print(y_train.shape)

And the shape of the test data.

In [None]:
print(X_test.shape)
print(y_test.shape)

### Making a Prediction
Comparing predictions for training set and test set values. I am continuing to use the RMSE as the score measure to see if it can be optimized as far as possible for the purpose of this model.

In [None]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
import seaborn as sns

# set the width and height of the plot
plt.figure(figsize=(6,6))

# visualizing the relationship between actual and predicted values for y
plt.scatter(y_test, y_test_pred)
plt.xlabel("Actual Prices")
plt.ylabel("Predicted Prices")
plt.title("Actual Prices vs Predicted Prices")

# fit a regression line between high and low values to show linear nature
sns.regplot(x=y_test, y=y_test_pred)

# evaluating the model on train dataset
y_train = lin_reg.fit(y_train)
y_train_pred = lin_reg.predict(y_train)

# evaluating the model on test dataset
mse_train = np.sum(y_train - y_train_pred)
rmse_train = np.sqrt(mse_train)
r2_train = r2_score(y_train, y_train_pred)

mse_test = mean_squared_error(y_test, y_test_pred)
rmse_test = np.sqrt(mse_test)
r2_test = r2_score(y_test, y_test_pred)
  
print("\n")
  
print("The model performance for the test set")
print("-------------------------------------------")

print("MSE of the train set is {}".format(mse_train))
print("RMSE of train set is {}".format(rmse_train))
print("R2 score of train set is {}".format(r2_train))

print("MSE of the test set is {}".format(mse_test))
print("RMSE of test set is {}".format(rmse_test))
print("R2 score of test set is {}".format(r2_test))

print("\n")

Interpreting the RMSE tells me the results seem too high.

In [None]:
# for polynomial regression
y_train_pred = a + b1x1 + b2x2 + b3x3 + b4x4 + (b5x1)**2 + (b6x2)**2 + (b7x3)**2 + (b8x4)**2 + b9x1*x2 + b10x1*x3 + b11x1*x4 + b12x2*x3 + b13x2*x4 + b14x3*x4
a = 
error = y_train - y_train_pred
mse = np.sum(error)**2
rmse = np.sqrt(mse)

y_train = lin_reg.fit(y_train)
y_train_pred = lin_reg.predict(y_train)

def mse_train():
    for i in range(1, 362):
        i += 1
    
    error = y_train - y_pred
    mse_train = sum(i * (error)**2)
mse_train

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
# actual values
y_train = lin_reg.fit(y_train)

# predicted values
y_train_pred = lin_reg.predict(y_train)

# calculate errors
err = (y_train[i] - y_train_pred[i])**2
errors = list()

for i in range(len(y_train)):
    # calculate error
    err = (y_train[i] - y_train_pred[i])**2
    # store error
    errors.append(err)
    # report error
    print('>%.1f, %.1f = %.3f' % (y_train[i], y_train_pred[i], err))
# plot errors
plt.plot(errors)
plt.xticks(ticks=[i for i in range(len(errors))], labels=predicted)
plt.xlabel('Predicted Value')
plt.ylabel('Mean Squared Error')
plt.show()

In [None]:
y_test_pred = ridge_regression.predict(X_test)
print("Predictions: ", ridge_regression.predict(X_test.iloc[:5]))

The regularization term should only be added to the cost function during the training phase. Now the 'training' data has been fit, it becomes important to discover the performance measure on the unregularized test set. Making a prediction on the first 5 values in the test set first.

In [None]:
#lin_reg.predict([[363]])

In [None]:
#lin_reg2.predict(poly.fit_transform([[363]]))

## Pipelines
To better capture the different features I have used in my multiple regression I will employ the use of a Pipeline to include the different types of regression models, re-scaling of the features with different units of measurement, regularization techniques to improve the degree of overall fit before validation.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn import linear_model
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler


pipeline = Pipeline([
    ('poly', PolynomialFeatures(degree=2, include_bias=False)),
    ('reg', LinearRegression(normalize=True)),
    ('scale', StandardScaler()),
    ('scores', )
    ])

'''
scores = cross_val_score

pipeline.fit(X, y)
y_pred = pipeline.predict(X)
print(y_pred)
'''