# **VAR**


In this notebook, we have implemented a VAR model, in order to predict daily closing bitcoin prices with multi-variate inputs.


This multivariate model is similar to the ARIMA model, to the extent that it computes the next values of a time series using the previous data points. The big difference is that it computes as many time series as we want, and it predicts the future values of all time series at the same time. It predicts all values so that it can make longer-term predictions based on previous prediction of all series. In other terms, the equation predicting the future value of one of the series is calculated on the basis of the previous values of all series.


This second model is the more complex one, and it includes inputs that we thought could be relevant to forecast bitcoin prices.

The inputs included are : 
- The price of gold. This is a stable measure that can inform us on global economic health. Here, we are wondering whether the Bitcoin price is affected by the global economy.
- The S&P, a stock market index measuring the stock performance of 500 large companies in the United States. We are using it as a general indicator of the US's market health. Here, we are wondering whether the Bitcoin price is affected by the US economy.
- The price of Ethereum (ETH), another crypto currency, that we are using as a comparison measure - we are checking whether crypto currencies influence each other 
- The exchange rate between the Dollar and the Yuan, to check whether the health of these currencies and the geopolitical implications impact Bitcoin. 

We initially wanted to include Twitter inputs to give us more qualitative data rather than quantitative, and to showcase the public's opinion as well as experts' opinions. However, we encountered several issues with the retrieval of data, notably the fact that twitter doesn't allow for scapping >100 data points - meaning that we had data covering time periods that were way too short.


# **Step 1: Data preparation**

Loading, cleaning and plotting the datasets

In [0]:
# Importing pandas for database manipulation ; matplotlib for graphs ; and seaborn for data visualization and correlation analysis
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

  import pandas.util.testing as tm


**Bitcoin**

In [0]:
#Importing the bitcoin price dataset
bitcoin = pd.read_csv("/content/Bitcoin Historical Data - Investing.com (1).csv", sep = ",")

#Keeping only the Date and Price columns
bitcoin = bitcoin[["Date", "Price"]]

#Making the df more easy to manipulate by changing the column names
bitcoin.rename(columns = {"Price" : "Bitcoin"}, inplace = True)

#putting the data in chronological order
bitcoin = bitcoin[::-1]
bitcoin.reset_index(drop = True, inplace = True)

#Converting to datetime
bitcoin["Date"] = pd.to_datetime(bitcoin['Date'])

#Setting the type to float64 to allow for plotting, analysis and comparison
bitcoin['Bitcoin'] = bitcoin['Bitcoin'].str.replace(',', '')
bitcoin['Bitcoin']=bitcoin["Bitcoin"].astype('float64')

bitcoin


FileNotFoundError: ignored

In [0]:
#Plotting the data
bitcoin.plot(y = ['Bitcoin'], kind = 'line', figsize = (12,12) )

**Gold**

In [0]:
#Importing the gold dataset
gold = pd.read_csv("/content/gold-Current.csv", delimiter = ",")

#Keeping only the Date ("Market") and Price ("Perth Mint Spot.12") columns
gold = gold[["Market","Perth Mint Spot.12"]]
gold = gold.dropna()

#Getting rid of the descriptive rows
gold = gold.drop(gold.index[0:16])

#Converting to datetime
gold['Market'] = pd.to_datetime(gold['Market'])

#Making the df more easy to manipulate by changing the column names
gold.rename(columns = {"Market" : "Date"}, inplace = True)
gold.rename(columns = {"Perth Mint Spot.12" : "Gold"}, inplace = True)

#Setting the type to float64 to allow for plotting, analysis and comparison
gold["Gold"]=gold["Gold"].astype('float64')
gold

In [0]:
#Plotting the data
gold.plot(y = ['Gold'], kind = 'line', figsize = (12,12) )

**Ethereum**

In [0]:
#Importing the ETH dataset
ETH = pd.read_csv("/content/ETH_USD Bitfinex Historical Data.csv", delimiter = ",")

#Keeping only the Date and Price columns
ETH = ETH [["Date", "Price"]]

#Making the df more easy to manipulate by changing the column names
ETH.rename(columns = {"Price" : "ETH"}, inplace = True)

#putting the data in chronological order
ETH = ETH [::-1]
ETH.reset_index(drop = True, inplace = True)

#Converting to datetime
ETH["Date"] = pd.to_datetime(ETH['Date'])

ETH

In [0]:
#Plotting the data
ETH.plot(y = ['ETH'], kind = 'line', figsize = (12,12) )

**SP**

In [0]:
#Importing the SP dataset
SP = pd.read_csv("/content/^GSPC (1).csv")

#Keeping only the Date and Price columns
SP = SP[["Date", "Close"]]

#Converting to datetime
SP["Date"] = pd.to_datetime(SP["Date"])

#Setting the type to float64 to allow for plotting, analysis and comparison
SP["Close"]=SP["Close"].astype('float64')

#Making the df more easy to manipulate by changing the column names
SP.rename(columns = {"Close" : "SP"}, inplace = True)

SP

In [0]:
#Plotting the data
SP.plot(y = ['SP'], kind = 'line', figsize = (12,12) )

**Dollar - Yuan exchange rate**

In [0]:
#Importing the Dollar to Yuan dataset
Dollar_to_Yuan = pd.read_csv("/content/USD_CNY Historical Data.csv")

#Keeping only the Date and Price columns
Dollar_to_Yuan = Dollar_to_Yuan[["Date","Price"]]

#Converting to datetime
Dollar_to_Yuan["Date"]= pd.to_datetime(Dollar_to_Yuan["Date"])

#Setting the type to float64 to allow for plotting, analysis and comparison
Dollar_to_Yuan["Price"]=Dollar_to_Yuan["Price"].astype('float64')

#Making the df more easy to manipulate by changing the column names
Dollar_to_Yuan.rename(columns = {"Price" : "DtoY"}, inplace = True)

#putting the data in chronological order
Dollar_to_Yuan = Dollar_to_Yuan[::-1]

Dollar_to_Yuan

#weekends missing (closed stock exchange)

In [0]:
#Plotting the data
Dollar_to_Yuan.plot(y = ['DtoY'], kind = 'line', figsize = (12,12) )

# **Step 2: Creating and exploring the final data frame**

Merging the datasets and creating the final df

Exploring the correlations between the variables

**Merging the datasets**

In [0]:
# Combining the datasets by merging them two at a time, on the common "Date" feature
merge1 = pd.merge(gold,ETH, on = "Date")
merge2 = pd.merge(merge1,SP, on = "Date")
merge3 = pd.merge(merge2, Dollar_to_Yuan, on = "Date")

#Creating the final dataset
df = pd.merge(merge3, bitcoin, on = "Date")

#Setting date as the index
df.set_index('Date', inplace=True)
df = df.sort_index()
df

#weekends missing

**Exploring the final df**

In [0]:
df.info()

In [0]:
df.describe()

**Displaying the correlations**

In [0]:
#Computing the correlations between the different values
df.corr()

In [0]:
f,ax = plt.subplots(figsize = (10,10))
sns.heatmap(df.corr(), annot = True, linewidths= 0.5, fmt = ".1f", ax = ax)
plt.show()

**Plotting the entire data**

In [0]:
df.plot(y = ['Gold','Bitcoin', 'SP', 'DtoY', 'ETH'], kind = 'line', figsize = (14,14) )
plt.xlabel('Date')
plt.ylabel('Price')

In [0]:
df[["Gold", "ETH","SP","DtoY","Bitcoin"]].plot(subplots=True, figsize=(12, 12)); plt.legend(loc = 2)

# **Step 3 : Creating the VAR Model**
Creating the data frames

Testing and adjusting for stationarity of the variables

Choosing the appropriate parameter


In [0]:
# Importing statsmodels for time series modelization tools (tests, plots and forecasting methods) for database manipulation ; numpy for algebra 
from statsmodels.tsa.api import VAR
from statsmodels.tsa.stattools import adfuller
from statsmodels.tools.eval_measures import rmse, aic, bic
import numpy as np

**Creating the dataframes**

In [0]:
# Creating a training dataframe
df_train = df[0:350]
df_train

In [0]:
# Creating a testing dataframe with the remaining data
df_test = df[351:]
df_test

**Testing for stationarity and differenciating the data**

In [0]:
#AD Fuller test on all components of the final data : this will help us verify that the series are non-stationary and need to be differentiated
adfuller(bitcoin['Bitcoin'])
#Here, t t-test value is higher than the confidence interval thresholds --> the data is not stationary

In [0]:
adfuller(gold['Gold'])
#Here, t t-test value is higher than the confidence interval thresholds --> the data is not stationary

In [0]:
adfuller(ETH['ETH'])
#Here, t t-test value is higher than the confidence interval thresholds --> the data is not stationary

In [0]:
adfuller(SP['SP'])
#Here, t t-test value is higher than the confidence interval thresholds --> the data is not stationary

In [0]:
adfuller(Dollar_to_Yuan['DtoY'])
#Here, t t-test value is higher than the confidence interval thresholds --> the data is not stationary
#We have thus concluded that all series are non-stationary --> we should differenciate them

In [0]:
# Differenciating the data by differenciating each of the components (we know that a multivariate time series is stationary if all its components are stationary)
bitcoin_differenciated = bitcoin.diff().dropna()
adfuller(bitcoin_differenciated["Bitcoin"])

#Here, we observe that the t-test is smaller than the confidence interval thresholds --> the series is now stationary

In [0]:
gold_differenciated = gold.diff().dropna()
adfuller(gold_differenciated["Gold"])

#Here, we observe that the t-test is smaller than the confidence interval thresholds --> the series is now stationary

In [0]:
ETH_differenciated = ETH.diff().dropna()
adfuller(ETH_differenciated["ETH"])

#Here, we observe that the t-test is smaller than the confidence interval thresholds --> the series is now stationary

In [0]:
SP_differenciated = SP.diff().dropna()
adfuller(SP_differenciated["SP"])

#Here, we observe that the t-test is smaller than the confidence interval thresholds --> the series is now stationary

In [0]:
DtoY_differenciated = Dollar_to_Yuan.diff().dropna()
adfuller(DtoY_differenciated["DtoY"])

#Here, we observe that the t-test is smaller than the confidence interval thresholds --> the series is now stationary

In [0]:
#Now that all composing series are stationary, we can conclude that the main series will be stationary after one differenciation
df_differenced = df_train.diff().dropna()

#This is now the complete differenciated series

**Choosing the parameter for the model**

In [0]:
#Creating the model and testing for the best parameters using the AIC function (a prediction quality criterion - prediction optimized at the smallest aic value)
model = VAR(df_differenced)
for i in [1,2,3,4,5,6,7,8,9]:
    result = model.fit(i)
    print('Lag Order =', i)
    print('AIC : ', result.aic)

# Numbers 2 and 4 seem to yield the most interesting results

# **Step 4 : Fitting and forecasting**

Fitting the model using the parameters

Predicting the test dataset

**Fitting the model**

In [0]:
#Fitting the model using the chosen parameters
model_fitted = model.fit(30)
model_fitted.summary()

**Forecasting**

In [0]:
# Create the forecasting data, specifying the number of data points to predicy (here, 50)
forecast_input = df_differenced.values[-50:]
forecast_input

In [0]:
# Computing the forecast for the 50 next data points
nobs = 50
fc = model_fitted.forecast(y=forecast_input, steps=nobs)
df_forecast = pd.DataFrame(fc, index=df.index[-nobs:], columns=df.columns + '_1d')
df_forecast

In [0]:
# Transforming the data back to normal (canceling the diff function)
def invert_transformation(df_train, df_forecast):
    df_fc = df_forecast.copy()
    for col in df_train.columns:        
        df_fc[str(col)+'_forecast'] = df_train[col].iloc[-1] + df_fc[str(col)+'_1d'].cumsum()
    return df_fc

df_results = invert_transformation(df_train, df_forecast)        
df_results.loc[:, ['Gold_forecast','ETH_forecast','SP_forecast','DtoY_forecast','Bitcoin_forecast']]

In [0]:
# Creating a final database with the actual values and the forecasts
final = pd.merge(df_results, df, on = 'Date')
final

In [0]:
#Plotting the final results
final.plot(y = ['Bitcoin_forecast','Bitcoin'], kind = 'line', figsize = (14,14) )
plt.xlabel('Date')
plt.ylabel('Price')