# Daily Price-Prediction of Bitcoin using ARIMA
***
Authors: [Mourad Baptiste Karib](https://www.linkedin.com/in/datamouradkarib/), 

In this notebook, I go through a vanilla model used for **daily price-prediction of the Bitcoin**. 

I created a python package containing the different modules for the different parts of the data processing and ML design. You will find the code in the same folder. 

This is a first draft and I will progressively add new features. I inspired myself from the work of Paul Lindquist and used codes from his [repository](https://github.com/paul-lindquist/cryptocurrency_time_series). 
***

# Outline
&nbsp;These are the different steps included in this notebook:
1. Data Acquisition

2.  Data processing

3.  Statistical tests

4. Arima prediction

5. Future work 

In [3]:
import sys
import os

#The modules I created to run the different parts of the process. 
sys.path.append('..')
from MyTradingPlatform.DataPipeline.data_acquisition import get_historical_data, get_list_symbols, get_list_coins
from MyTradingPlatform.DataPipeline.feature_engineering import resample, add_cols
from MyTradingPlatform.DataPipeline.models import ARIMA_train, ARIMA_predict
from MyTradingPlatform.DataPipeline.utils import set_binance_key,get_binance_key,set_binance_secret,get_binance_secret

from binance import Client, ThreadedWebsocketManager, ThreadedDepthCacheManager


import pandas as pd
import numpy as np
import math
from datetime import datetime

import mlflow
mlflow.set_tracking_uri('http://localhost:5000')

import mlflow.statsmodels

# Data Acquisition
First, we use get_list_symbols() that returns the lists of all symbols available on Binance and make sure we have the good symbol name:


In [4]:
'BTCUSDT' in get_list_symbols()

True

Then we can get the historical data, choosing which frequency we want and since what time stamp. The function has default values for its parameters:

- interval='1m',
- symbol='BTCUSDT',
- first_ts=None - we take all the data available. 
- binance_api_key=None,
- binance_api_secret=None



# ! [ remember to delete the keys ] !

In [5]:
# df = get_historical_data(binance_api_key = ******, \
#                          binance_api_secret = '******',\
#                          interval = '5m')
# df.to_csv('data/btc_all_210422.csv')

df = pd.read_csv('data/btc_all_210422.csv',parse_dates=['OpenTime'] )
df = df.set_index('OpenTime')

### Data Processing

The first thing we need to is to **resample** the data to have the price per day. 

In [6]:
daily =  resample(df)

Then we can **add the features** we calculated. You can get the full list in *MyTradingPlatform.DataPipeline.feature_engineering*.

In [7]:
daily = add_cols(daily)

# Statistical tests

list of tests we can do to characterize a time serie analysis data:

In [8]:
daily.head()

Unnamed: 0_level_0,Open,Close,Volume,log_open,log_close,return,close_1_prior,close_2_prior,close_3_prior,close_4_prior,...,momentum_50,momentum_200,volatility_7,volatility_30,volatility_50,volatility_200,volume_7,volume_14,volume_30,volume_50
OpenTime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2018-03-07,10783.67,9925.22,28955.654245,9.285788,9.202834,-0.072939,9.285788,9.358728,9.343058,9.342247,...,-0.002963,0.004923,0.02557,0.02557,0.072872,0.06218,23219.705701,35351.060003,41052.305191,38394.790333
2018-03-08,9925.22,9233.24,50110.30714,9.202834,9.130565,-0.082954,9.202834,9.285788,9.358728,9.343058,...,-0.004841,0.004989,0.040415,0.040415,0.073451,0.062089,26116.764431,33782.835602,40680.230817,38207.348955
2018-03-09,9233.23,9200.99,41772.807568,9.130564,9.127066,-0.072269,9.130565,9.202834,9.285788,9.358728,...,-0.001455,0.004452,0.050812,0.050812,0.06526,0.062384,28384.117778,31926.500582,38722.940604,37572.93529
2018-03-10,9213.92,8829.99,62545.037522,9.128471,9.085909,-0.003499,9.127066,9.130565,9.202834,9.285788,...,-0.002856,0.003999,0.048775,0.048775,0.066024,0.062611,33885.955187,32337.831776,38828.606836,37824.817052
2018-03-11,8825.0,9521.05,38671.503733,9.085344,9.16126,-0.041157,9.085909,9.127066,9.130565,9.202834,...,-0.003506,0.004116,0.046938,0.046938,0.065864,0.062574,36316.029309,32132.364947,39923.463125,37891.679692


# Arima 

### Pick the best model
I ran a test for the different parameters to define which one were optimal for our data. You can find the tests in the other notebook

In [None]:
dfr = mlflow.search_runs(experiment_ids = [13])#, filter_string="metrics.rmse < 1"), 
dfr.sort_values('metrics.rmse')\
    [['metrics.aics','metrics.rmse','params.q','params.p','params.d','params.train_size']].head()

### Training

In [None]:
fitmodel = ARIMA_train(daily['Close'], p = 3, d  = 0, q = 2, train_size=26)

### Predicting

In [None]:
ARIMA_predict(daily['Close'], p = 3, d  = 0, q = 2)                

# Error analysis & Future work

- The model fails at predicting changes. 

- Using other features that would help predict drastics changes. 

- We could investigate more the training size. 

# Conclusion