# Model Selection
## Training the Data
Next I want to split the data into predictors and a target variable, containing all my feature columns in one dataframe variable and the target variable in a column vector. The prediction target can be assigned as follows:

In [None]:
# import libraries
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

# import data
bitcoin = pd.read_csv("C:/Users/lynst/Documents/GitHub/machine-learning-projects/machine-learning/BTC_CAD.csv")
df = pd.DataFrame(bitcoin).dropna(axis=0)

Checking the data in the first couple of rows:

In [None]:
df.head(2)

In [None]:
print(df)

In [None]:
# all column names
df.columns

In [None]:
# all column data types
df.dtypes

Assign the dependant, or y variable for the modelling process.

In [None]:
y = df['Close']
print(y)

The features are a selection of columns used to predict 'y', also known as the independent variables. I am choosing to leave the 'Date' and 'Adj Close' columns out of this dataframe.

Note, I can either store the individual features in a variable which can be referenced or called when performing some function, or I can store the exact feature names as a list in the dataframe. For example:

In [None]:
bitcoin_features = ["Open","High","Low","Volume"]
print(bitcoin_features)

In [None]:
# select features
X = df[bitcoin_features]
print(X)

In [None]:
# an alternative way
X = df[["Open", "High", "Low", "Volume"]]
print(X)

A couple of important things to note here. Firstly because I already dropped not-available row entries (3), there are 362 correct entries spanning 365 rows which is correct. I don't need to perform this dropna() method on X and y individually because I have already applied this operation to the dataframe (df).

Secondly, I can see that referencing the features and storing them in a separate variable named 'bitcoin_features' really only comes in handy when there are a large number of features, perhaps too many to type into a list; but for the purpose of this exercise I prefer entering each feature name individually.

## Splitting the Data
Split the data into training and test sets with a 70-30 split but not without making a copy of the dataframe first.

In [None]:
df = df.copy()

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Printing out the shape of the training sets for both the X matrix and y vector gives:

In [None]:
print(X_train.shape)
print(y_train.shape)

And the shape of the test data:

In [None]:
print(X_test.shape)
print(y_test.shape)

So I can see that 253/362 * 100 = 70% and 109/362 * 100 = 30% for both the train and test sets respectively. Next I will save a copy of the dataframe to use.

In [None]:
# instantiate model
linear_regression = LinearRegression()

# fit model
linear_regression.fit(X_train, y_train)

Now the 'training' data has been fit, try making a prediction on the first 5 values in the test set first.

In [None]:
price_predictions = linear_regression.predict(X_test)
print("Predictions: ", linear_regression.predict(X_test.iloc[:5]))

Now try a prediction by imputing my own values.

In [None]:
# predicting price based on Open = C$30,000, High = C$40,000, Low = C$29,000 and Volume = 100bn
linear_regression.predict([[30000, 40000, 29000, 100000000000]])

## Metrics
So checking the values against the BTC_CAD.csv dataset I can see they are not exactly accurate. One way to check is introduce an accuracy score called MSE (mean squared error), but first I will check the R-squared measure to establish the overall degree of fit to the line.

In [None]:
print("R-squared: ", linear_regression.score(X_test, y_test))

This is a fairly high score and I can see this relationship in a scatter plot showing actual prices against predicted prices.

In [None]:
from sklearn.metrics import mean_squared_error

mse = mean_squared_error(y_test, price_predictions)
print(mse)

rmse = np.sqrt(mse)
print(rmse)

# visualizing the relationship between actual and predicted values for y
plt.scatter(y_test, price_predictions)
plt.xlabel("Actual Prices")
plt.ylabel("Predicted Prices")
plt.title("Actual Prices vs Predicted Prices")

To find the intercept and coefficients:

In [None]:
print(linear_regression.intercept_)
print(linear_regression.coef_)