In [1]:
# import libraries
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# import data
bitcoin = pd.read_csv("C:/Users/lynst/Documents/GitHub/machine-learning-projects/machine-learning/BTC_CAD.csv")
df = pd.DataFrame(bitcoin).dropna(axis=0)

In [2]:
# all column names
df.columns

Index(['Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume'], dtype='object')

In [3]:
# all column data types
df.dtypes

Date          object
Open         float64
High         float64
Low          float64
Close        float64
Adj Close    float64
Volume       float64
dtype: object

## Feature Engineering and Model Selection
Assign the (dependant) y variable and (independent) X variables for the modelling process.

In [4]:
# select data for modeling
X = df[["Open", "High", "Low", "Volume"]]
y = df["Close"]

### Splitting the Data
Using this data for the Polynomial Features model and splitting it into training and test sets with a 70-30 split. Make a copy of the dataframe first.

In [5]:
df = df.copy()

In [6]:
from sklearn.model_selection import train_test_split

# split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

## Training the model
Import the library and instantiate the model.

In [7]:
from sklearn.preprocessing import PolynomialFeatures

# initialize model
poly_model = PolynomialFeatures(degree=2, include_bias=False)

# fit the model to the training set
X_train_poly = poly_model.fit_transform(X_train, y_train)

Using indexation to return any value in X, say the 1st value

In [34]:
X[:1]

Unnamed: 0,Open,High,Low,Volume
0,9718.07,9838.33,9728.25,46248430000.0


Now repeat for the data contained in the 'X_train_poly' set.

In [35]:
X_train_poly[1]

array([2.31257900e+04, 2.40848000e+04, 2.26872900e+04, 6.42206718e+10,
       5.34802163e+08, 5.56980027e+08, 5.24661504e+08, 1.48515377e+15,
       5.80077591e+08, 5.46418842e+08, 1.54674204e+15, 5.14713128e+08,
       1.45699300e+15, 4.12429468e+21, 1.23677225e+13, 1.28806031e+13,
       1.21332118e+13, 3.43453542e+19, 1.34147526e+13, 1.26363674e+13,
       3.57696315e+19, 1.19031477e+13, 3.36941142e+19, 9.53775727e+25,
       1.39710528e+13, 1.31603885e+13, 3.72529726e+19, 1.23967627e+13,
       3.50913851e+19, 9.93328125e+25, 1.16774460e+13, 3.30552228e+19,
       9.35690695e+25, 2.64864975e+32])

Both the original feature value for 'X' and the feature squared value from 'X_poly' are returned in this example. Now this new data matrix containing the additional features with the squared values has been created by expanding the number of features and the parameter weights (or coefficients), the linear regression model can be applied again to this new dataframe.

In [10]:
# import the linear regression library first
from sklearn.linear_model import LinearRegression

# initialize the model
linear_regression = LinearRegression()

# fit the model but using the X_poly dataframe this time
linear_regression.fit(X_train_poly, y_train)

LinearRegression()

Also, the bias term (intercept) and coefficients are both attributes of the LinearRegression() model so I can examine these from the independent variable:

In [11]:
linear_regression.intercept_
linear_regression.coef_

array([ 2.04746753e-11,  1.34619966e-08,  7.63252662e-11,  3.95073614e-08,
       -6.83473603e-06, -5.82358084e-07, -8.40449336e-07, -1.49894184e-12,
        5.83557909e-06,  5.55122986e-06,  2.54559597e-12,  5.15929020e-06,
        2.50892933e-12, -4.61371567e-19])

## Model Validation
Here are some predictions using the test set data, before measuring their degree of variance and accuracy:

In [22]:
# prediction
y_pred = linear_regression.predict(X_test)
# print("Price Predictions: ", linear_regression.predict(X_test.iloc[:5]))

ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 14 is different from 4)

In [21]:
# predicting price based on Open = C$35,000, High = C$40,000, Low = C$32,000 and Volume = 100bn
# linear_regression.predict([[35000, 40000, 32000, 100000000000]])

Validating the polynomial regression function with X and y training data and an attribute of 'degree=2' can be achieved using a mean squared error score and r-squared accuracy measure as before.

In [19]:
from sklearn.metrics import r2_score

# model evaluation
score = r2_score(y_test, y_pred)
print(score)

ValueError: Found input variables with inconsistent numbers of samples: [109, 253]

In [24]:
X_test_poly = poly_model.fit_transform(X_test)
print("R-squared: ", linear_regression.score(X_test_poly, y_test))

R-squared:  0.9789579261471753


In [25]:
from sklearn.metrics import mean_squared_error
import numpy as np

mse = mean_squared_error(y_test, y_pred)
print(mse)

rmse = np.sqrt(mse)
print(rmse)

ValueError: Found input variables with inconsistent numbers of samples: [109, 253]

So I can see the R-squared value is not as good as the previous score after trying polynomial regression with 'degree=2' squared terms. Having applied the Polynomial Features model and fitted it to the training set I have decided to save the data to a csv file.

In [None]:
df.to_csv(r'C:/Users/lynst/Documents/GitHub/machine-learning-projects/machine-learning/X_poly.csv', index = False, header = True)

What if I change the parameter affecting the degree to which terms are multiplied to 'degree=3' in the polynomial equation?

## Training the model
Import the library and instantiate the model.

In [26]:
# initialize model
poly_model = PolynomialFeatures(degree=3, include_bias=False)

# fit the model to the training set
X_train_poly = poly_model.fit_transform(X_train, y_train)

Applying the linear model once more:

In [27]:
# import the linear regression library first
from sklearn.linear_model import LinearRegression

# initialize the model
linear_regression = LinearRegression()

# fit the model but using the X_poly dataframe this time
linear_regression.fit(X_train_poly, y_train)

LinearRegression()

Intercept and coefficients:

In [28]:
linear_regression.intercept_
linear_regression.coef_

array([ 1.57920482e-18, -6.91842112e-18, -1.69411691e-24, -5.36908367e-26,
        5.31897352e-26, -2.40654900e-26, -2.27653160e-26, -2.17648301e-20,
       -2.16346124e-26, -2.04355958e-26, -2.06217581e-20, -1.93355632e-26,
       -1.90482242e-20,  1.33547628e-18, -7.21477605e-22, -5.13064362e-22,
       -4.97906906e-22, -2.10148835e-16, -2.89871854e-22, -2.87502907e-22,
       -3.19332152e-17, -2.86778022e-22, -2.52546597e-17,  1.10083632e-22,
       -5.11386835e-23, -6.22584583e-23,  1.54660625e-16, -7.46155384e-23,
        1.58244526e-16, -7.67425545e-23, -8.79444001e-23,  1.61281158e-16,
       -1.06413159e-22,  3.29565132e-30])

## Model Validation
This time for 'degree=3' terms the rmse score and r-squared measure based on the test sets give:

In [29]:
# prediction
y_pred = linear_regression.predict(X_test)
print("Price Predictions: ", linear_regression.predict(X_test.iloc[:5]))

ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 34 is different from 4)

In [30]:
# model evaluation
score = r2_score(y_test, y_pred)
print(score)

ValueError: Found input variables with inconsistent numbers of samples: [109, 253]

In [32]:
X_test_poly = poly_model.fit_transform(X_test)
print("R-squared: ", linear_regression.score(X_test_poly, y_test))

R-squared:  0.935684565908637


In [33]:
mse = mean_squared_error(y_test, y_pred)
print(mse)

rmse = np.sqrt(mse)
print(rmse)

ValueError: Found input variables with inconsistent numbers of samples: [109, 253]

This causes a significant drop in the R-squared value or the degree of fit to the line, so may not be the most accurate model to use. I have decided to see if I can improve the model's predictive power by electing to use a Decision Tree Regression model: "https://github.com/lynstanford/machine-learning-projects/tree/master/machine-learning/decision_tree.ipynb".