# Time Series Model (Autoregressive) on Financial Data

#### Autoregressive(p) model:

An autoregressive (AR) model predicts future behavior based on past behavior. It’s used for forecasting when there is some correlation between values in a time series and the values that precede and succeed them. An AR(p) model is an autoregressive model where specific lagged values of $y_t$ are used as predictor variables. Lags are where results from one time period affect following periods.

<img src="https://otexts.com/fpp2/fpp_files/figure-html/arp-1.png" width="630" height="630"/>

For example, an AR(1) would be a “first order autoregressive process.” The outcome variable in a first order AR process at some point in time t is related only to time periods that are one period apart (i.e. the value of the variable at t – 1). An AR(p) model can be denoted as below.

$$
X_t = c + \sum_{i = 1}^{p}\displaystyle \varphi _{i} X_{t - i} + \epsilon_t
$$

Here, $\displaystyle \varphi _{1}$, ..., $\displaystyle \varphi _{p}$ are the parameters of the model, $c$ is a constant, and $\epsilon_t$ is white noise. 

#### Linear regression model:
A linear regression model is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables. It can be written as ${\displaystyle {y} = X{{\beta }}+{{\varepsilon }},\,}$, where ${y}$ is a vector of observed values $\displaystyle y_{i}$. $X$ is a matrix of row-vectors $\displaystyle {x} _{i}$. $\beta$ is a ${\displaystyle (p+1)}$ dimensional parameter vector, where ${\displaystyle \beta _{0}}$ is the intercept term. ${\displaystyle {{\varepsilon }}}$ is a vector of values ${\displaystyle \varepsilon _{i}}$.

In statistics, ordinary least squares (OLS) chooses the parameters of a linear function of a set of explanatory variables by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable (values of the variable being observed) in the given dataset and those predicted by the linear function of the independent variable. Using OLS, $\hat{{\beta}}$ is obtained through the formula $\hat{{\beta}}=\left({X}^{\top}{X}\right)^{-1} {X}^{\top}{y}$. According, $\hat{y} = X\hat{\beta}$.


# My Implementation on Litecoin Daily Closing Price

I extracted the Litecoin data from 1/1/2021 to 11/30/2021 from Yahoo Finance. Then the data will be fitted using an AR model.

In [183]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pandas_datareader as web
import plotly.express as px

In [184]:
# Extract the Litecoin USD data from Yahoo Finance
data = web.DataReader('LTC-USD',
                      'yahoo',
                      start = '2021-01-01',
                      end = '2021-11-30')

Take a look at the data.

In [185]:
data.head()

Unnamed: 0_level_0,High,Low,Open,Close,Volume,Adj Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2021-01-01,133.18576,123.328079,124.672768,126.230347,7326980728,126.230347
2021-01-02,140.372574,123.693619,126.272964,136.944885,10532067985,136.944885
2021-01-03,163.898636,135.739914,136.949402,160.190582,15385661271,160.190582
2021-01-04,173.027817,143.623962,160.271164,154.807327,13659785704,154.807327
2021-01-05,162.850189,147.40007,154.897552,158.594772,10192818976,158.594772


We are only interested in the closing price, so let's extract the closing price for January.

In [186]:
train_df = pd.DataFrame(data['Close'].iloc[:30])
train_df.head()

Unnamed: 0_level_0,Close
Date,Unnamed: 1_level_1
2021-01-01,126.230347
2021-01-02,136.944885
2021-01-03,160.190582
2021-01-04,154.807327
2021-01-05,158.594772


Plot the time series of the closing price and see what it looks like.

In [187]:
px.line(train_df, title = 'Litecoin Daily Price for 2021 January')

Then, plot all the variables of the complete data from 01/01/2021 to 11/30/2021 to find out some patterns.

In [188]:
px.line(data, y = ['High', 'Low', 'Close', 'Open'],
        title = 'High, Low, Closing and Opening Price for Litecoin')

Since we are interested in an AR(1) model, let's generate a 1 step lagged data.

In [189]:
train_df['lag_1'] = train_df['Close'].shift(1)
train_df.head()

Unnamed: 0_level_0,Close,lag_1
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2021-01-01,126.230347,
2021-01-02,136.944885,126.230347
2021-01-03,160.190582,136.944885
2021-01-04,154.807327,160.190582
2021-01-05,158.594772,154.807327


Create columns for lag 1 through lag 7.

In [190]:
for i in range(2, 8):
    train_df[f'lag_{i}'] = train_df['Close'].shift(i)
train_df.head(8)

Unnamed: 0_level_0,Close,lag_1,lag_2,lag_3,lag_4,lag_5,lag_6,lag_7
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2021-01-01,126.230347,,,,,,,
2021-01-02,136.944885,126.230347,,,,,,
2021-01-03,160.190582,136.944885,126.230347,,,,,
2021-01-04,154.807327,160.190582,136.944885,126.230347,,,,
2021-01-05,158.594772,154.807327,160.190582,136.944885,126.230347,,,
2021-01-06,169.016922,158.594772,154.807327,160.190582,136.944885,126.230347,,
2021-01-07,169.615952,169.016922,158.594772,154.807327,160.190582,136.944885,126.230347,
2021-01-08,173.279877,169.615952,169.016922,158.594772,154.807327,160.190582,136.944885,126.230347


Drop all the NaNs.

In [191]:
train_df = train_df.dropna()
train_df.head()

Unnamed: 0_level_0,Close,lag_1,lag_2,lag_3,lag_4,lag_5,lag_6,lag_7
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2021-01-08,173.279877,169.615952,169.016922,158.594772,154.807327,160.190582,136.944885,126.230347
2021-01-09,177.483932,173.279877,169.615952,169.016922,158.594772,154.807327,160.190582,136.944885
2021-01-10,171.114838,177.483932,173.279877,169.615952,169.016922,158.594772,154.807327,160.190582
2021-01-11,139.252228,171.114838,177.483932,173.279877,169.615952,169.016922,158.594772,154.807327
2021-01-12,132.63591,139.252228,171.114838,177.483932,173.279877,169.615952,169.016922,158.594772


Now, we neet to obtain a list of weights for each lag. Equivalently, our outcome should regress on lag 1 through lag 7 via the weights.
$$\hat{{w}}=\left({X}^{\top}{X}\right)^{-1} {X}^{\top}{y}$$

In [192]:
cols = [f'lag_{i}' for i in range(1,8)]
X = train_df[cols].to_numpy()
y = train_df['Close'].to_numpy()
w_hat = np.linalg.inv(X.T @ X) @ X.T @ y

Get the predicted value $\hat{y} = X\hat{w}$.

In [193]:
train_df['predictions'] = X @ w_hat
train_df['predictions'].head()

Date
2021-01-08    159.247515
2021-01-09    175.428462
2021-01-10    172.934176
2021-01-11    164.080648
2021-01-12    138.500884
Name: predictions, dtype: float64

Now, let's plot the real closing prices and the predicted closing prices.

In [196]:
px.line(train_df, y = ['Close', 'predictions'],
        title = 'Real Colsing Values and vs Predicted Values' )

We see that the precicted values follows the trend of the real values, but the predicted values are a bit lagging behind the real ones.

#### Adding bais:

Now, let's creat a column called ones and append it to the original X matrix.

In [201]:
train_df['ones'] = np.ones(len(train_df['Close']))
train_df.head()

Unnamed: 0_level_0,Close,lag_1,lag_2,lag_3,lag_4,lag_5,lag_6,lag_7,predictions,ones
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2021-01-08,173.279877,169.615952,169.016922,158.594772,154.807327,160.190582,136.944885,126.230347,159.247515,1.0
2021-01-09,177.483932,173.279877,169.615952,169.016922,158.594772,154.807327,160.190582,136.944885,175.428462,1.0
2021-01-10,171.114838,177.483932,173.279877,169.615952,169.016922,158.594772,154.807327,160.190582,172.934176,1.0
2021-01-11,139.252228,171.114838,177.483932,173.279877,169.615952,169.016922,158.594772,154.807327,164.080648,1.0
2021-01-12,132.63591,139.252228,171.114838,177.483932,173.279877,169.615952,169.016922,158.594772,138.500884,1.0


In [202]:
cols = [f'lag_{i}' for i in range(1, 8)]
cols.append('ones')
print(cols)

['lag_1', 'lag_2', 'lag_3', 'lag_4', 'lag_5', 'lag_6', 'lag_7', 'ones']


Introduce "bias" and "predictions with bias." The $\hat{w}_{bias}$ is simply calculated using the original matrix but with a bias term.

In [205]:
X_bias = train_df[cols].to_numpy()
w_hat_bias = np.linalg.inv(X_bias.T @ X_bias) @ X_bias.T @ y
train_df['predictions_with_bias'] = X_bias @ w_hat_bias

Then, let's plot the real closing prices, predictions, and predictions with bias.

In [207]:
px.line(train_df, y = ['Close', 'predictions', 'predictions_with_bias'],
        title = 'Real Closing Values vs Predicted Values vs Predicted Values with Bias')

We can tell from the plot that the "predictions with bias", simialar to "predictions", follows the general trend of the real values, but is still a bit lagging behind.

#### Introduce RMSE

In [150]:
y_hat1 = train_df['predictions'].to_numpy()
y_hat2 = train_df['predictions_with_bias'].to_numpy()

$
RMSE = \sqrt{\frac{\sum_{i = 1}^{n}(y_i - \hat{y}_i)^2}{n}}
$

In [208]:
def RMSE(labels, predictions):
    return np.sqrt((labels - predictions) @ (labels - predictions))

In [209]:
print(f'RMSE without bias = {RMSE(y, y_hat1)}')
print(f'RMSE with bias = {RMSE(y, y_hat2)}')

RMSE without bias = 37.62238420961228
RMSE with bias = 34.954639049956505


The RMSE with bias is slightly smaller than the RMSE without bias.

Finally, let's again plot the closing price. The price varies from 100 to 400.

In [211]:
px.line(data, y = 'Close')

In this case, we might consider introducing log of the closing price for further studies.

In [213]:
data['log_Close'] = np.log(data['Close'])
px.line(data, y = 'log_Close')

The variation decreases a lot.

# References

Autoregressive model: definition & the AR process. Statistics How To. https://www.statisticshowto.com/

Autoregressive model. Wikipedia. https://en.wikipedia.org/wiki/Autoregressive_model

Linear regression. Wikipedia. https://en.wikipedia.org/wiki/Linear_regression