# Polynomial Regression
The first question I needed to ask myself is why decide to use Polynomial Regression and what does it do exactly? So according to the visualizations from the <a href="https://github.com/lynstanford/machine-learning-projects/machine-learning/multiple_regression.ipynb">Linear Regression</a> notebook, the connection between the dependent and independent variables appear to be strongly linear in relationship although the connection between the 'Volume' of daily bitcoin bought and sold has less association with the daily 'Close' price. 

Fitting a Linear Regression line to the data may be accurate in this case, with an R2 value of 0.9991392014437468 and RMSE of 689.1925598643533. However, out of curiosity I decided to see if a Polynomial function could fit the line slightly better by employing a regularization technique to try and improve the bias term by decreasing the Mean Squared Error.

The r-squared value is used to represent the overall accuracy score and directly measures the degree of variability associated between the predictors and target variable. The root mean squared value is represented as a loss function and my aim is to reduce its overall value as much as possible using regularization.

I know that Polynomial Regression is useful in determining non-linear relationships between multiple independent variables and the dependent variable, so it can be classified as a type of multiple linear regression. I can try to improve the fit of a prediction line to the data and improve estimates by changing the 'degree of fit' parameter, or by utilizing regularization.

## Import Data
Keeping the data loading simple this time will reduce the overall time it takes to retrieve.

In [None]:
# import libraries
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import os

# import data from filepath
btc_cad = pd.read_csv("C:/Users/lynst/Documents/GitHub/machine-learning-projects/machine-learning/BTC_CAD.csv")

# storing dataset from filepath into new dataframe
bitcoin = pd.DataFrame(btc_cad).dropna(axis=0)

This reads the entire dataset and then stores all the values in a single dataframe object.

## Feature Selection and Scaling
Checking to see what features are present within the current dataframe.

In [None]:
# all column names
bitcoin.columns

Now to remove the columns I don't need including 'Date' and 'Adj Close', selecting only the remaining ones to include in the X variable such as 'Open', 'High', 'Low' and 'Volume'. The remaining column of 'Close' in the same dataframe will be used as the target variable output, y.

In [None]:
# remove feature with string values - Date
del bitcoin['Date']

# remove adj close as I will not be using this
del bitcoin['Adj Close']

# see the remaining features
bitcoin.columns

Assigning features to X gives:

In [None]:
# select features as X dataframe
X = bitcoin[['Open','High','Low','Volume','Close']]
print(X)

Comparing the shape of the overall dataset before polynomial regression and before splitting gives:

In [None]:
X.shape

In [None]:
# select target as y series
y = bitcoin['Close']
print(y)

In [None]:
y.shape

Another way to select the right column vectors is using indexation.

In [None]:
X = X.iloc[:,0:5].values
print(X[0:10])

In [None]:
y = y.iloc[:, ].values
print(y[0:10])

Does the dataset need re-scaling? In this particular case the (X) predictors include 3 price variables and 1 volume variable which is scaled differently. The (y) target variable is another price variable, so the data would benefit from re-scaling. The transformation I have chosen will shift the values to a range between 0 and 1 for each column. 

Make a copy of the dataframe first.

In [None]:
bitcoin = bitcoin.copy()

In [None]:
# import preprocessing from sci-kit learn
from sklearn import preprocessing

# define min max scaler
min_max_scaler = preprocessing.MinMaxScaler()

# transform data
X_scaled = min_max_scaler.fit_transform(X)

bitcoin_features = pd.DataFrame(X_scaled)

bitcoin_features.to_csv(r'C:\Users\lynst\Documents\GitHub\machine-learning-projects\machine-learning\bitcoin_features.csv', index = False, header = True)

print(bitcoin_features)

Repeating this process for the y values:

In [None]:
from matplotlib import cm
from mpl_toolkits.mplot3d import Axes3D
from sklearn.preprocessing import PolynomialFeatures
from sklearn import linear_model

x = bitcoin_features.iloc[0:362,:1]
y = bitcoin_features.iloc[0:362,-1:]
z = bitcoin_features.iloc[0:362,3:4]

fig = plt.figure(figsize=(16,10))
ax = fig.add_subplot(111, projection='3d')

ax.scatter(x, y, z, c='r', marker='o')
ax.set_xlabel("Open")
ax.set_ylabel("Close")
ax.set_zlabel("Volume")
plt.show()   

So for this particular comparison between the open and close prices and volume of Bitcoin transactions I can see a positive linear relationship between 'Open' and 'Close' prices. As one increases in value, so does the other. The relationship they have with 'Volume' appears somewhat linear also except for and outlier when volume spiked on Feb 26th, 2021. This was apparently due to bets by Tesla and Mastercard and stood out significantly compared to anything seen previously this year.

Next I will perform both linear regression and polynomial regression to display prediction lines of best fit. I will only display the 'Close' price vs 'Date', so just two variables. I have chosen to manipulate data for price and time and store them in a new array called 'btc_new'.

In [None]:
# import data from filepath
btc_cad = pd.read_csv("C:/Users/lynst/Documents/GitHub/machine-learning-projects/machine-learning/BTC_CAD.csv")
# storing dataset from filepath into new dataframe
bitcoin = pd.DataFrame(btc_cad).dropna(axis=0)

I have decided to create a new data array containing the variables for date and the daily close price, having converted the dates to straight forward number of days (ranging from 1 to 362), the same as the total number of 'non-null' entries.

In [None]:
# Visualising the Linear Regression results
figure = plt.figure(figsize=(14,8))

x = np.array([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,
              42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,
              80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,
              113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,
              141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,
              169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,
              197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,
              225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,
              253,254,255,256,257,258,259,260,261,262,263,264,265,266,267,268,269,270,271,272,273,274,275,276,277,278,279,280,
              281,282,283,284,285,286,287,288,289,290,291,292,293,294,295,296,297,298,299,300,301,302,303,304,305,306,307,308,
              309,310,311,312,313,314,315,316,317,318,319,320,321,322,323,324,325,326,327,328,329,330,331,332,333,334,335,336,
              337,338,339,340,341,342,343,344,345,346,347,348,349,350,351,352,353,354,355,356,357,358,359,360,361,362])
y = np.array([9763.94,10096.28,10451.16,10642.81,10669.64,10836.68,10941.6,10917.12,12211.47,12083.24,12644.26,12820.88,12576.13,
              12551.25,12642.56,13128.23,13904.59,13795.97,13448.25,12194.49,12061.87,12397.94,13062.2,13659.96,13162.55,
              13228.95,13627.94,13559.83,13560.94,13241.13,12666.33,12857.9,12895.3,12293.22,12445.06,12177.47,12630.37,13121.03,
              13000.03,13359.9,13034.29,13812.49,12873.86,13031.1,13233.33,12972.79,12956.88,13071.93,13063.21,13147.35,13253.52,
              12711.01,12883.57,12876.0,12803.02,12791.0,12907.76,12871.53,12793.94,12640.25,12700.57,12668.61,13028.33,13041.35,
              12707.97,12634.47,12541.29,12380.43,12506.51,12552.25,12404.78,12543.14,12380.24,12313.75,12374.98,12296.15,
              12693.78,12588.0,12745.56,12599.85,12614.86,12556.24,12613.8,12580.1,12571.09,12417.11,12394.21,12428.28,12437.98,
              12469.29,12395.2,12604.21,12781.59,12843.04,12796.08,12982.28,13288.45,14662.38,14600.23,14809.4,14902.15,15186.41,
              15771.32,14809.83,15064.88,14900.92,15582.92,15690.25,15529.15,15733.03,15633.23,15862.41,15187.81,15342.8,
              15581.58,15614.35,15742.82,15761.58,16203.14,15775.92,15535.99,15648.99,15273.86,15391.56,15375.86,15563.07,
              14964.79,15097.17,14859.35,15119.23,15072.56,15319.24,15230.27,15626.5,14891.18,13462.52,13731.63,13284.56,
              13442.84,13579.41,13413.77,13471.4,13668.72,13705.81,13760.17,13609.86,14073.93,14244.94,14466.7,14398.2,14452.49,
              14650.47,14436.9,13916.7,14013.41,13690.02,14346.23,14325.02,14397.66,14417.81,14322.13,14520.83,14357.78,14115.09,
              14089.38,14063.2,14203.03,14325.21,14149.28,14160.03,14410.83,14818.74,14949.65,15031.95,15206.57,14936.71,
              14984.18,15137.27,15487.81,15634.23,16869.5,17032.64,16973.62,17205.31,17133.71,17267.45,18012.41,17666.01,
              17893.39,18045.32,18357.66,18364.88,17915.13,18279.6,18553.15,20383.97,20332.17,19375.87,20159.37,19945.73,
              19926.42,20504.46,21384.43,21423.16,21095.38,20937.96,21858.82,23126.07,23303.57,23320.17,24380.23,24407.62,
              24036.96,24010.26,24836.84,24356.4,22332.26,22226.46,23017.67,23600.83,25498.36,24319.8,24803.39,25023.04,23905.97,
              24486.96,24726.68,24572.39,23481.02,23800.7,23272.47,23059.65,24014.9,24409.37,24561.12,24658.5,27166.03,29036.47,
              29589.87,30525.81,30075.87,29308.0,30662.86,29851.39,30529.54,31754.68,34036.36,33737.04,34786.05,35085.89,
              36777.84,36919.19,37393.09,40898.17,41711.19,40865.06,43089.99,46641.22,49945.91,51769.03,51079.4,48817.74,45432.6,
              43108.25,47383.3,49563.35,46905.54,46081.14,45711.01,46701.33,45888.95,44892.29,38992.07,42028.71,40834.13,
              41094.11,41229.38,41341.23,39009.81,40608.72,43844.33,43794.73,42390.85,43093.59,45408.61,47908.07,47359.34,
              48683.77,50115.41,49642.27,58850.14,59023.47,57035.0,60845.81,60319.29,59816.94,61818.59,60583.38,62545.24,
              66229.13,65537.27,70533.62,70772.35,72505.56,68363.29,61493.78,62195.54,59404.52,59028.47,58830.69,57312.2,62739.6,
              61132.9,64055.45,61573.38,61913.1,61894.22,64684.29,66138.41,69333.05,70691.19,72463.92,71526.08,76406.88,73947.62,
              69760.45,70684.33,72931.8,72297.91,73079.59,73038.24,72043.42,68277.85,68917.98,66392.34,65160.43,69541.94,
              70596.59,70416.7,72713.56,74365.91,74029.63,74156.38,74675.77,72436.89,73827.42,73942.95,73156.52,70708.54,
              73273.84,72978.66,74918.53,75463.91,75243.42,79598.41,78995.98,79437.88,77018.32,75906.36,70374.91,69788.23,
              71477.76])

# The array of x values
myline = np.linspace(1, 362, 80000)

# Applying a linear fit
mymodel = np.poly1d(np.polyfit(x, y, deg=1))

plt.scatter(x, y, color = 'turquoise')
plt.plot(myline, mymodel(myline))
plt.show()

This would represent a highly positive linear relationship between price and time over the last year, specifically a low-bias and high variance regression line, or an 'under-fitted' model. 

Making sure the values work printing out the first 10 values of each list.

In [None]:
print(x[0:10])

In [None]:
print(y[0:10])

Applying a polynomial degree of 2 this time provides an exponential curve. With a degree=2, the highest order value would be an exponent of 2, or x squared.

In [None]:
figure = plt.figure(figsize=(14,8))

mymodel = np.poly1d(np.polyfit(x, y, deg=2))

myline = np.linspace(1, 362, 80000)

plt.scatter(x, y, color = 'turquoise')
plt.plot(myline, mymodel(myline))
plt.show()

Using a polyfit method with a degree of 50.

In [None]:
figure = plt.figure(figsize=(14,8))

mymodel = np.poly1d(np.polyfit(x, y, deg=50, rcond=None, full=False, w=None, cov=False))

myline = np.linspace(1, 362, 80000)

plt.scatter(x, y, color = 'turquoise')
plt.plot(myline, mymodel(myline))
plt.show()

In essence, this represents 'over-fitting' and a high-bias / low variance model.