In [None]:
from sklearn.metrics import r2_score
%run helper_functions.py
%run prophet_helper.py #this runs the TS models for features
%run regression_ts_model.py #nested TS script 
%run btc_info_df.py #helps loan jup new BTC data
%autosave 120
%matplotlib inline
plt.style.use('fivethirtyeight')
plt.rcParams["figure.figsize"] = (15,10)
plt.rcParams["xtick.labelsize"] = 16
plt.rcParams["ytick.labelsize"] = 16
plt.rcParams["axes.labelsize"] = 20
plt.rcParams['legend.fontsize'] = 20
plt.style.use('fivethirtyeight')
pd.set_option('display.max_colwidth', -1)

# Notebook Overview

In this notebook, I will construct:
- A naive model of bitcoin price prediction

- A **nested** time series model.

#### What do I mean by a nested time series model?

I will illustrate with a simple example.

Let's say that I wish to predict the `mkt_price` on `2016-10-30`. I could fit a Linear Regression on all the features from `2016-10-26 - 29-10-2016`. However, in order to predict the price of `mkt_price` on `2016-10-30` I need to have values for the features on `2016-10-30`. This presents a problem as all my features are time series! That is, I cannot simply plug in a value for all the features because I don't know what their values would be on this future date!

One possible remedy for this is to simply use the values of all the features on `2016-10-29`. In fact, it is well know that the best predictor of a variable tomorrow is it's current state today. However, I wish to be more rigorous.

Instead of simply plugging in `t-1` values for the features at time `t`, I construct a time series model for _each_ feature in order to **predict** its value at time `t` based on the **entire** history of data that I have for the features!

These predicted values are then passed as inputs to our linear regression models!

Thus, if I have N features, I am creating N-Time Series models in order to do a single prediction with Linear Regression for the `mkt_price` variable.

### Naive Baseline Model

I will construct a naive baseline model that will most likely outperorm any other model I build below.

The model will work as follows:

When predicting the price on Day 91, I will take the average price change between Day 90 and Day 0. Let's call this average price change _alpha_.

I will then take the price of Day 90 and add _alpha_ to it. This will serve as the 'predicted' price for day 91.

In [None]:
df = unpickle_object("FINAL_DATAFRAME_PROJ_5.pkl")
df.head()

In [None]:
def linear_extrapolation(df, window):
    pred_lst = []
    true_lst = []

    cnt = 0

    all_rows = df.shape[0]

    while cnt < window:
        start = df.iloc[cnt:all_rows-window+cnt, :].index[0].date()
        end = df.iloc[cnt:all_rows-window+cnt, :].index[-1].date()
        predicting = df.iloc[all_rows-window+cnt, :].name.date()

        print("---- Running model from {} to {} and predicting on {} ----".format(start,end,predicting))

        training_df = df.iloc[cnt:all_rows-window+cnt, :]

        testing_df = df.iloc[all_rows-window+cnt, :]
        
        true_val = testing_df[-1]
        
        first_row_value = training_df.iloc[0, :]['mkt_price']
        first_row_date = training_df.iloc[0, :].name
        
        last_row_value = training_df.iloc[-1, :]['mkt_price']
        last_row_date = training_df.iloc[-1, :].name
        
        alpha = (last_row_value-first_row_value)/90
        
        prediction = last_row_value + alpha
        
        pred_lst.append(prediction)
        
        true_lst.append(true_val)
        
        
        cnt += 1
        
    return pred_lst, true_lst

In [None]:
pred_lst, true_lst = linear_extrapolation(df, 30)

In [None]:
r2_score(true_lst, pred_lst)

### Naïve Model Caveats

We can see above that we can use this extremely basic model to obtain an $R^2$ of 0.86. In fact, this should be the baseline model score that we need to beat!

Let me mention some caveats to this result:

- I only have 4 months of Bitcoin data. It should be obvious to the reader that such a naive model is NOT the appropriate way to forecast bitcoin price in general. For if it were this simple, we would all be millionaires.


- Since I have 120 days worth of day, I am choosing to subset my data in 90 day periods, as such, I will produce 30 predictions. The variability of bitcoin prices around these 30 days will significantly impact the $R^2$ score. Again, more data is needed.


- While bitcoin data itself is not hard to come by, twitter data is! It is the twitter data that is limiting a deeper analysis. I hope that this notebook serves as a starting point for further investigation in the relationship between tweets and bitcoin price fluctuations.


- Lastly, I have made this notebook in Sept. 2017. The data for this project spans Oct 2016 - Feb 2017. Since that timeframe, bitcoin grew to unprecedented highs of \$4k/coin. Furthermore, media sound bites of CEOs such as James Dimon of JPMorgan have sent bitcoin prices tumbling by as much as $1k/coin. For me, this is what truly lies at the crux of the difficulty of cryptocurrency forecasting. I searched at great length for a **free**, searchable NEWS API, however, I could not find one. I think I great next step for this project would be to incorporate sentiment of news headlines concerning bitcoin!


- Furthermore, with the aforementioned timeframe, the overall bitcoin trend was upward. That is, there was not that much volatility in the price - as such, it is expected that the Naïve Model would outperform the nested time series model. The next step would again, be to collect more data and re-run all the models.

### Nested Time Series Model

In [None]:
df = unpickle_object("FINAL_DATAFRAME_PROJ_5.pkl")
df.head()

In [None]:
df.corr()

In [None]:
plot_corr_matrix(df)

In [None]:
beta_values, pred, true = master(df, 30)

In [None]:
r2_score(true, pred)#blows our Prophet TS only model away!

#### Nested TS VS. FB Prophet TS

We see from the above that our model has an $R^2$ of 0.75! This greatly outperforms our baseline model of just using FaceBook Prophet to forecast the price of bitcoin! The RMSE is 1.40

This is quite impressive given that we only have 3 months of training data and are testing on one month!

The output above also shows regression output from statsmodels!

The following features were significant in all 30 models:

- Gold Price

- Ethereum Price

- Positive Sentiment (Yay!)

- Average Transactions Per Block

It is important, yet again, to note that this data does NOT take into account the wild fluctuations in price that bitcoin later experienced. We would need more data to affirm the significance of the above variables.

In [None]:
plt.plot(pred)
plt.plot(true)
plt.legend(["Prediction", 'Actual'], loc='upper left')
plt.xlabel("Prediction #")
plt.ylabel("Price")
plt.title("Nested TS - Price Prediction");

In [None]:
fig, ax = plt.subplots()
ax.scatter(true, pred, edgecolors=(0, 0, 0))
ax.plot([min(true), max(true)], [min(true), max(true)], 'k--', lw=3)
ax.set_xlabel('Actual')
ax.set_ylabel('Predicted')

In [None]:
plotting_dict_1 = {"eth_price": [], "pos_sent": [], "neg_sent": [], "unique_addr": [], "gold_price": [], "tot_num_trans": [], "mempool_trans":[], "hash_rate": [], "avg_trans_per_block":[]}

for index, sub_list in enumerate(beta_values):
    for tup in sub_list:
        plotting_dict_1[tup[0]].append(tup[1])

In [None]:
plot_key(plotting_dict_1, "pos_sent")# here we say the effect of positive sentiment through time!
plt.title("Positive Sentiment Effect on BTC Price")
plt.ylabel("Beta Value")
plt.xlabel("Model #")
plt.tight_layout()

In [None]:
plot_key(plotting_dict_1, "gold_price")
plt.title("Gold Price Effect on BTC Price")
plt.ylabel("Beta Value")
plt.xlabel("Model #")
plt.tight_layout()

In [None]:
plot_key(plotting_dict_1, "avg_trans_per_block")
plt.title("Avg. Trans per Block Effect on BTC Price")
plt.ylabel("Beta Value")
plt.xlabel("Model #")
plt.tight_layout()

## Percent change model!

I will now run the same nested TS model as above, however, I will now make my 'target' variable the percent change in bitcoin price. In order to make this a log-og model, I will use the percentage change of all features as inputs into the TS model and thus the linear regression!

Since percent change will 'shift' our dataframe by one row, I omit the first row (which is all NaN's).

Thus, if we were to predict a percent change of $0.008010$ on `28-10-2017`, then this would mean that the **predicted price** would be the price on `27-10-2017` $*predicted\_percent\_change$.

In [None]:
df_pct = df.copy(deep=True)
df_pct = df_pct.pct_change()
df_pct.rename(columns={"mkt_price": "percent_change"}, inplace=True)
df_pct = df_pct.iloc[1:, :] #first row is all NaN's
df_pct.head()

In [None]:
beta_values_p, pred_p, true_p = master(df_pct, 30)

In [None]:
r2_score(true_p, pred_p) # this is expected due to the range of values on the y-axis!

In [None]:
#very good!
plt.plot(pred_p)
plt.plot(true_p)
plt.legend(["Prediction", 'Actual'], loc='upper left')
plt.xlabel("Prediction #")
plt.ylabel("Price")
plt.title("Nested TS - % Change Prediction");

From the above, it seems that our model is not tuned well enough to anticipate the large dip shown above. This is due to a lack of training data. However, while our model might not be the best in predicting **percent change** how does it fair when we turn the percent change into **prices**.

In [None]:
fig, ax = plt.subplots()
ax.scatter(true_p, pred_p, edgecolors=(0, 0, 0))
ax.plot([min(true), max(true)], [min(true), max(true)], 'k--', lw=3)
ax.set_xlabel('Actual')
ax.set_ylabel('Predicted');

In [None]:
df.set_index('date', inplace=True)
prices_to_be_multiplied = df.loc[pd.date_range(start="2017-01-23", end="2017-02-21"), "mkt_price"]
forecast_price_lst = []
for index, price in enumerate(prices_to_be_multiplied):
    predicted_percent_change = 1+float(pred_p[index])
    forecasted_price = (predicted_percent_change)*price
    forecast_price_lst.append(forecasted_price)
ground_truth_prices = df.loc[pd.date_range(start="2017-01-24", end="2017-02-22"), "mkt_price"]
ground_truth_prices = list(ground_truth_prices)
r2_score(ground_truth_prices, forecast_price_lst)

We have an $R^2$ of 0.87!

This surpasses the baseline model and the nested TS model!

The caveats of the baseline model also apply here, however, it seems that the addition of additional variables have helped us **slightly** improve with regards to the $R^2$

In [None]:
plt.plot(forecast_price_lst)
plt.plot(ground_truth_prices)
plt.legend(["Prediction", 'Actual'], loc='upper left')
plt.xlabel("Prediction #")
plt.ylabel("Price")
plt.title("Nested TS - % Change Prediction");