# BTC Price Prediction (Preprocessing and Training)

### This notebook contains:
- Use of "Standard Scaler" to ensure values are to scale for modeling (no large fluctuations), preventing leakage
- Splitting of dataframe into "testing" and "training" subsets using "train_test_split" 
- Overall preprocessing to prep for modeling and implementation of various classifying/tree based methods

In [1]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn import linear_model, preprocessing
from sklearn.metrics import accuracy_score
from sklearn import metrics
import statsmodels.api as sm

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

In [2]:
training_data = pd.read_csv('data/train.csv', index_col='Date')

In [3]:
training_data.head()

Unnamed: 0_level_0,close,volume,ema_short,ema_long,atr,obv,tweet_sentiment,close_nextday
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2019-09-02,10340.0,44740.25,10164.518939,10452.265343,530.693553,225053.863244,-1.0,10615.28
2019-09-03,10615.28,47998.38,10207.448563,10458.658074,528.572585,273052.240025,0.5,10567.02
2019-09-04,10567.02,43943.89,10241.693462,10462.907561,521.468114,229108.350999,0.5,10564.49
2019-09-05,10564.49,33970.96,10272.43599,10466.891187,516.363249,195137.39036,0.5,10298.73
2019-09-06,10298.73,58799.64,10274.940181,10460.29663,533.470874,136337.749401,0.0,10455.88


In [None]:
# check for nulls

print(pd.isnull(training_data).sum())

In [None]:
training_data.info()

In [None]:
training_data.describe()

In [None]:
print(training_data.shape)

### Scaling the dataframe

In [None]:
ss = StandardScaler()
df_scaled = pd.DataFrame(ss.fit_transform(training_data),
                         index=training_data.index,
                         columns=training_data.columns)

In [None]:
df_scaled.head()

### Dataframe split (training and testing)

In [None]:
X = df_scaled.drop(labels=['close_nextday'], axis=1)
y = df_scaled['close_nextday']
#X = sm.add_constant(X)

print(X.head(5))

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.25,
                                                    random_state=42)

#### Multiple Linear Regression Using OLS (Ordinary Least Squares)

In [None]:
lr = sm.OLS(y_train, X_train)

results = lr.fit()
results.summary()

From the summary results, we can see that the Adj. R-Squared was 0.970. This shows us that the model explains it variability fairly well. However, this metric solely does not explain the overall performance of the model but it does show us that our predictors are on the right track

In [None]:
y_pred = results.predict(X_test)

In [None]:
_ = plt.scatter(y_test, y_pred)
_ = plt.plot([x for x in range(-3, 6)], [y for y in range(-3, 6)], color='red')

_ = plt.title('Model Prediction vs Actual')
_ = plt.xlabel('actual values')
_ = plt.ylabel('predicted values')
plt.show()

In [None]:
def mean_absolute_percentage_error(y_test, y_pred):
    y_test, y_pred = np.array(y_test), np.array(y_pred)
    return np.mean(np.abs((y_test - y_pred) / y_test)) * 100

In [None]:
mape = mean_absolute_percentage_error(y_test, y_pred)
mae = metrics.mean_absolute_error(y_test, y_pred)
mse = metrics.mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)

print("Results of sklearn.metrics:\n")
print("MAPE:", mape)
print("MAE:", mae)
print("MSE:", mse)
print("RMSE:", rmse)

In [None]:
_ = plt.figure(figsize=(10, 6))
_ = plt.plot(y_test)
_ = plt.plot(y_pred)
_ = plt.legend(["actual", "pred"])
plt.show()

#### Avoiding redundancy in linear regression models

Since the feature "close" is already found to be correlated with our prediction of "close_nextday" (from heatmap EDA stage), we can remove "close" to see how disposable it really is in relation to the model.

In [None]:
# Pull in correlation data gathered from EDA correlation heatmap

f = open('data/highest_corr_target.txt', 'r')
file_contents = f.read()
print('Highest correlated variable with target:\n', "\n", file_contents)

In [None]:
X2 = df_scaled.drop(labels=['close_nextday', 'close'], axis=1)
y2 = df_scaled['close_nextday']
#X2 = sm.add_constant(X2)
print(X2.head(5))

In [None]:
X2_train, X2_test, y2_train, y2_test = train_test_split(X2,
                                                        y2,
                                                        test_size=0.25,
                                                        random_state=42)

In [None]:
lr2 = sm.OLS(y2_train, X2_train)

results2 = lr2.fit()
results2.summary()

This time the results of our summary statistics give us an Adj. R-squared measure of 0.907. This was an expected drop from our last model as we removed one of our predictors ("close"), which is a highly correlated variable to the target. 

In [None]:
y2_pred = results2.predict(X2_test)

In [None]:
_ = plt.scatter(y2_test, y2_pred)
_ = plt.plot([x for x in range(-3, 6)], [y for y in range(-3, 6)], color='red')

_ = plt.title('Model 2 Prediction vs Actual')
_ = plt.xlabel('actual values')
_ = plt.ylabel('predicted values')
plt.show()

In [None]:
def mean_absolute_percentage_error(y2_test, y2_pred):
    y2_test, y2_pred = np.array(y2_test), np.array(y2_pred)
    return np.mean(np.abs((y2_test - y2_pred) / y2_test)) * 100

In [None]:
mape2 = mean_absolute_percentage_error(y2_test, y2_pred)
mae2 = metrics.mean_absolute_error(y2_test, y2_pred)
mse2 = metrics.mean_squared_error(y2_test, y2_pred)
rmse2 = np.sqrt(mse)

print("Results of sklearn.metrics:\n")
print("MAPE:", mape2)
print("MAE:", mae2)
print("MSE:", mse2)
print("RMSE:", rmse2)

In [None]:
_ = plt.figure(figsize=(10, 6))
_ = plt.plot(y2_test)
_ = plt.plot(y2_pred)
_ = plt.legend(["actual", "pred"])
plt.show()

In [None]:
df_scaled.to_csv('data/btc_df_scaled.csv')

## Initial Findings and Conclusion

During this stage (preprocessing and training), we started by standardizing the magnitude of numeric features within our original dataframe by using _StandardScaler()_. This function allows us to normalize the features of each column(X), individually, so that each column/feature/variable will have a mean of 0 and standard deviation of 1. After transforming the dataframe we then split it into two separate ones, training and testing using _train_test_split()_. From there we created a baseline for our predictions or future models by utilizing two simple multi linear regression models using ordinary least squares(OLS). 

One iteration of our linear regression model, for X contained all available features, aside from the target(y). While the other one for X contained all features except the target and "close", with the target(y) remaining the same. We then calculated the _MAPE_, _MAE_, and _MSE_ for each. By comparison, the first linear regression model performed better as the mean absolute percent error was roughly 53%, meaning the errors are "slightly greater" than the actual values. Whereas the second linear regression model gave us an _MAPE_ of roughly 198%, meaning our errors of that model were "much greater" than the actual values. This difference in _MAPE_ can be seen as well when comparing _RMSE_ results to _MAE_ results for each model. We can see the first model gave an _RMSE_ output of approx. 0.13 and an _MAE_ output of approx. 0.096. Since the two measurements are relatively close to each other it can be implied that the model makes many mistakes, but the mistakes are "small". The same can be said for the other model but opposite when comparing metrics. 

That being said, we can see that the feature "close" is fairly important when being utilized in a regression model. As when it was removed, the performance of the model decreased. Because of these brief findings, we have further confirmed the original "hypothesis" formed during the EDA stage in exploring the high correlation between "close_nextday" and "close".

During the next stage (modeling), we will be examining other model architecture such as ensemble methods through "trees", i.e. random forest regressors(RFR). We will be improving or tuning our different models using optimization methods like _GridSearchCV_. Moreover, optimization will also aid us in the analysis of features (i.e. feature "impact" or "importance") as well as interpretability as a whole. 