# Model Selection
## Training the Data
Next I want to split the data into predictors and a target variable, containing all my feature columns in one dataframe variable and the target variable in a column vector. The prediction target can be assigned as follows:

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression

# import data
bitcoin = pd.read_csv("C:/Users/lynst/Documents/GitHub/machine-learning-projects/machine-learning/BTC_CAD.csv")
df = pd.DataFrame(bitcoin).dropna(axis=0)

In [2]:
df.head(2)

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2020-04-21,9718.07,9838.33,9728.25,9763.94,9763.94,46248430000.0
1,2020-04-22,9762.68,10125.87,9747.39,10096.28,10096.28,47166350000.0


In [3]:
print(df)

           Date      Open      High       Low     Close  Adj Close  \
0    2020-04-21   9718.07   9838.33   9728.25   9763.94    9763.94   
1    2020-04-22   9762.68  10125.87   9747.39  10096.28   10096.28   
2    2020-04-23  10102.09  10533.73  10009.76  10451.16   10451.16   
3    2020-04-24  10457.43  10678.71  10457.43  10642.81   10642.81   
4    2020-04-25  10642.22  10773.18  10601.61  10669.64   10669.64   
..          ...       ...       ...       ...       ...        ...   
360  2021-04-16  79368.23  79831.47  75262.20  77018.32   77018.32   
361  2021-04-17  76964.70  78268.41  75503.00  75906.36   75906.36   
362  2021-04-18  75928.95  76373.72  66081.83  70374.91   70374.91   
363  2021-04-19  70344.11  71803.61  68107.27  69788.23   69788.23   
365  2021-04-21  71509.30  71556.76  71132.08  71477.76   71477.76   

           Volume  
0    4.624843e+10  
1    4.716635e+10  
2    6.119120e+10  
3    4.881932e+10  
4    4.643028e+10  
..            ...  
360  1.050000e+11  

In [4]:
# all column names
df.columns

Index(['Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume'], dtype='object')

In [5]:
# all column data types
df.dtypes

Date          object
Open         float64
High         float64
Low          float64
Close        float64
Adj Close    float64
Volume       float64
dtype: object

Assign the dependant, or y variable for the modelling process.

In [6]:
y = df['Close']
print(y)

0       9763.94
1      10096.28
2      10451.16
3      10642.81
4      10669.64
         ...   
360    77018.32
361    75906.36
362    70374.91
363    69788.23
365    71477.76
Name: Close, Length: 362, dtype: float64


The features are a selection of columns used to predict 'y', also known as the independent variables. I am choosing to leave the 'Date' and 'Adj Close' columns out of this dataframe.

Note, I can either store the individual features in a variable which can be referenced or called when performing some function, or I can store the exact feature names as a list in the dataframe. For example:

In [7]:
bitcoin_features = ["Open","High","Low","Volume"]
print(bitcoin_features)

['Open', 'High', 'Low', 'Volume']


In [8]:
# select features
X = df[bitcoin_features]
print(X)

         Open      High       Low        Volume
0     9718.07   9838.33   9728.25  4.624843e+10
1     9762.68  10125.87   9747.39  4.716635e+10
2    10102.09  10533.73  10009.76  6.119120e+10
3    10457.43  10678.71  10457.43  4.881932e+10
4    10642.22  10773.18  10601.61  4.643028e+10
..        ...       ...       ...           ...
360  79368.23  79831.47  75262.20  1.050000e+11
361  76964.70  78268.41  75503.00  8.272967e+10
362  75928.95  76373.72  66081.83  1.220000e+11
363  70344.11  71803.61  68107.27  8.183693e+10
365  71509.30  71556.76  71132.08  8.495634e+10

[362 rows x 4 columns]


In [9]:
# an alternative way
X = df[["Open", "High", "Low", "Volume"]]
print(X)

         Open      High       Low        Volume
0     9718.07   9838.33   9728.25  4.624843e+10
1     9762.68  10125.87   9747.39  4.716635e+10
2    10102.09  10533.73  10009.76  6.119120e+10
3    10457.43  10678.71  10457.43  4.881932e+10
4    10642.22  10773.18  10601.61  4.643028e+10
..        ...       ...       ...           ...
360  79368.23  79831.47  75262.20  1.050000e+11
361  76964.70  78268.41  75503.00  8.272967e+10
362  75928.95  76373.72  66081.83  1.220000e+11
363  70344.11  71803.61  68107.27  8.183693e+10
365  71509.30  71556.76  71132.08  8.495634e+10

[362 rows x 4 columns]


A couple of important things to note here. Firstly because I already dropped not-available row entries (3), there are 362 correct entries spanning 365 rows which is correct. I don't need to perform this dropna() method on X and y individually because I have already applied this operation to the dataframe (df).

Secondly, I can see that referencing the features and storing them in a separate variable named 'bitcoin_features' really only comes in handy when there are a large number of features, perhaps too many to type into a list; but for the purpose of this exercise I prefer entering each feature name individually.

## Splitting the Data
Split the data into training and test sets with a 70-30 split but not without making a copy of the dataframe first.

In [10]:
df = df.copy()

In [11]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Printing out the shape of the training sets for both the X matrix and y vector gives:

In [12]:
print(X_train.shape)
print(y_train.shape)

(253, 4)
(253,)


And the shape of the test data:

In [13]:
print(X_test.shape)
print(y_test.shape)

(109, 4)
(109,)


So I can see that 253/362 * 100 = 70% and 109/362 * 100 = 30% for both the train and test sets respectively. Next I will save a copy of the dataframe to use.

In [14]:
from sklearn.linear_model import LinearRegression

linear_regression = LinearRegression()
linear_regression.fit(X_train, y_train)

LinearRegression()

Now the 'training' data has been fit, try making a prediction on the test set first.

In [15]:
price_predictions = linear_regression.predict(X_test)
print("Predictions: ", linear_regression.predict(X_test.iloc[:5]))

Predictions:  [14235.0672502  12655.46154078 13223.19509735 70402.98731737
 12747.62432462]


Now try a prediction by imputing my own values.

In [17]:
# predicting price based on Open = C$30,000, High = C$40,000, Low = C$29,000 and Volume = 100bn
linear_regression.predict([[30000, 40000, 29000, 100000000000]])

array([38754.37951656])

# Metrics
So checking the values against the BTC_CAD.csv dataset I can see they are not exactly accurate. One way to check is introduce an accuracy score called MSE (mean squared error), but first I will check the R-squared measure to establish the overall degree of fit to the line.

In [None]:
print("R-squared: ", linear_regression.score(X_test, y_test))

This is a fairly high score and I can see this relationship in a scatter plot showing actual prices against predicted prices.

In [None]:
from sklearn.metrics import mean_squared_error

mse = mean_squared_error(y_test, price_predictions)
print(mse)

rmse = np.sqrt(mse)
print(rmse)

# visualizing the relationship between actual and predicted values for y
plt.scatter(y_test, price_predictions)
plt.xlabel("Actual Prices")
plt.ylabel("Predicted Prices")
plt.title("Actual Prices vs Predicted Prices")

To find the intercept and coefficients:

In [None]:
print(linear_regression.intercept_)
print(linear_regression.coef_)