In [1]:
# import libraries
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# import data
bitcoin = pd.read_csv("C:/Users/lynst/Documents/GitHub/machine-learning-projects/machine-learning/BTC_CAD.csv")
df = pd.DataFrame(bitcoin).dropna(axis=0)

In [2]:
# all column names
df.columns

Index(['Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume'], dtype='object')

In [3]:
# all column data types
df.dtypes

Date          object
Open         float64
High         float64
Low          float64
Close        float64
Adj Close    float64
Volume       float64
dtype: object

## Feature Engineering and Model Selection
Assign the (dependant) y variable and (independent) X variables for the modelling process.

In [4]:
# select data for modeling
X = df[["Open", "High", "Low", "Volume"]]
y = df["Close"]

### Splitting the Data
Using this data for the Polynomial Features model and splitting it into training and test sets with a 70-30 split. Make a copy of the dataframe first.

In [5]:
df = df.copy()

In [6]:
from sklearn.model_selection import train_test_split

# split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

## Training the model
Import the library and instantiate the model.

In [7]:
from sklearn.preprocessing import PolynomialFeatures

# initialize model
poly_model = PolynomialFeatures(degree=2, include_bias=False)

# fit the model to the training set
X_train_poly = poly_model.fit_transform(X_train, y_train)

Using indexation to return any value in X, say the 1st value

In [11]:
X[:1]

Unnamed: 0,Open,High,Low,Volume
0,9718.07,9838.33,9728.25,46248430000.0


The first thing to note here is I am not training the entire dataframe of X, so it might be more accurate to display the first row of X_train. The main point here is there are only 4 features. 

In [12]:
X_train[:1]

Unnamed: 0,Open,High,Low,Volume
150,14401.92,14523.76,14286.54,34784880000.0


It has taken the first value from row 150.

In [10]:
X_train_poly[1]

array([2.31257900e+04, 2.40848000e+04, 2.26872900e+04, 6.42206718e+10,
       5.34802163e+08, 5.56980027e+08, 5.24661504e+08, 1.48515377e+15,
       5.80077591e+08, 5.46418842e+08, 1.54674204e+15, 5.14713128e+08,
       1.45699300e+15, 4.12429468e+21])

Now repeating for the data contained in the 'X_train_poly' set and we can see that there are 14 values returned so it has created an array with 10 new features for a total of 14 features. This includes element-wise dot product values and some squared values (without going into to much detail). Both the original feature values for 'X1' to 'Xn' and the feature squared value from 'X_poly' are returned in this example. Now this new data matrix containing the additional features with the squared values has been created by expanding the number of features and the parameter weights (or coefficients), the linear regression model can be applied again to this new dataframe.

In [None]:
# import the linear regression library first
from sklearn.linear_model import LinearRegression

# initialize the model
linear_regression = LinearRegression()

# fit the model but using the X_poly dataframe this time
linear_regression.fit(X_train_poly, y_train)

Also, the bias term (intercept) and coefficients are both attributes of the LinearRegression() model so I can examine these from the independent variable:

In [None]:
linear_regression.intercept_
linear_regression.coef_

## Model Validation
Here are some predictions using the test set data, before measuring their degree of variance and accuracy:

In [None]:
# prediction
y_pred = linear_regression.predict(X_test)
print("Price Predictions: ", linear_regression.predict(X_test.iloc[:5]))

In [None]:
# predicting price based on Open = C$35,000, High = C$40,000, Low = C$32,000 and Volume = 100bn
# linear_regression.predict([[35000, 40000, 32000, 100000000000]])

Validating the polynomial regression function with X and y training data and an attribute of 'degree=2' can be achieved using a mean squared error score and r-squared accuracy measure as before.

In [None]:
from sklearn.metrics import r2_score

# model evaluation
score = r2_score(y_test, y_pred)
print(score)

In [None]:
X_test_poly = poly_model.fit_transform(X_test)
print("R-squared: ", linear_regression.score(X_test_poly, y_test))

In [None]:
from sklearn.metrics import mean_squared_error
import numpy as np

mse = mean_squared_error(y_test, y_pred)
print(mse)

rmse = np.sqrt(mse)
print(rmse)

So I can see the R-squared value is not as good as the previous score after trying polynomial regression with 'degree=2' squared terms. Having applied the Polynomial Features model and fitted it to the training set I have decided to save the data to a csv file.

In [None]:
df.to_csv(r'C:/Users/lynst/Documents/GitHub/machine-learning-projects/machine-learning/X_poly.csv', index = False, header = True)

What if I change the parameter affecting the degree to which terms are multiplied to 'degree=3' in the polynomial equation?

## Training the model
Import the library and instantiate the model.

In [None]:
# initialize model
poly_model = PolynomialFeatures(degree=3, include_bias=False)

# fit the model to the training set
X_train_poly = poly_model.fit_transform(X_train, y_train)

Applying the linear model once more:

In [None]:
# import the linear regression library first
from sklearn.linear_model import LinearRegression

# initialize the model
linear_regression = LinearRegression()

# fit the model but using the X_poly dataframe this time
linear_regression.fit(X_train_poly, y_train)

Intercept and coefficients:

In [None]:
linear_regression.intercept_
linear_regression.coef_

## Model Validation
This time for 'degree=3' terms the rmse score and r-squared measure based on the test sets give:

In [None]:
# prediction
y_pred = linear_regression.predict(X_test)
print("Price Predictions: ", linear_regression.predict(X_test.iloc[:5]))

In [None]:
# model evaluation
score = r2_score(y_test, y_pred)
print(score)

In [None]:
X_test_poly = poly_model.fit_transform(X_test)
print("R-squared: ", linear_regression.score(X_test_poly, y_test))

In [None]:
mse = mean_squared_error(y_test, y_pred)
print(mse)

rmse = np.sqrt(mse)
print(rmse)

This causes a significant drop in the R-squared value or the degree of fit to the line, so may not be the most accurate model to use. I have decided to see if I can improve the model's predictive power by electing to use a Decision Tree Regression model: "https://github.com/lynstanford/machine-learning-projects/tree/master/machine-learning/decision_tree.ipynb".