# Heterogeneous Autoregressive model

The HAR model assumes that the volatility (or other financial variables) is driven by components at different time scales (long-term, medium-term, and short-term).

$$
lnRV_{t+1} = \beta_0 + \beta_1 lnRV_t + \beta_2 lnRV^w_t + \beta_3 lnRV^m_t + \epsilon_t
$$
Where:
- $lnRV_{t+1}$ is the one day ahead forecast of log realised volatility of return.
- $\beta_0, \beta_1, \beta_2, \beta_3$ are the parameters to be estimated.
- $lnRV^w_t$ is the weekly realised variance.
- $lnRV^m_t$ is the monthly realised variance.
- $\epsilon_t$ is the error term.

### Load Data

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm

In [2]:
file_path = "..\data\ALGO_daily.csv"
df = pd.read_csv(file_path)
df.rename(columns={'timestamp': 'Date'}, inplace=True)
df.set_index('Date', inplace=True)
df = df.loc[:, ['lnRV']]
df.head(3)

Unnamed: 0_level_0,lnRV
Date,Unnamed: 1_level_1
2019-06-23,-3.865058
2019-06-24,-3.652501
2019-06-25,-3.161881


### Create Lags

In [15]:
def create_lags(df, lags):
    df_copy = df.copy()
    for lag in lags:
        df_copy[f'lnRV_{lag}D_mean_lag'] = df_copy['lnRV'].rolling(window=lag).mean().shift(1)
    df_copy.dropna(inplace=True)
    return df_copy

create_lags(df, (1,3)).head(4)

Unnamed: 0_level_0,lnRV,lnRV_1D_mean_lag,lnRV_3D_mean_lag
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2019-06-26,-3.331801,-3.161881,-3.559813
2019-06-27,-3.956779,-3.331801,-3.382061
2019-06-28,-4.305468,-3.956779,-3.483487
2019-06-29,-4.842811,-4.305468,-3.864683


In [11]:
(-4.305468-4.842811-4.888673)/3

-4.678984

### Feature & Target Split

In [18]:
def split_f_t(df, lags):
    df_h = create_lags(df, lags)
    
    features = df_h.columns
    features = features[features != 'lnRV']

    X = df_h[features]
    y = df_h['lnRV']
    
    # Adds a constant term to the predictor
    X = sm.add_constant(X)
    return X, y

X, y = split_f_t(df, (1,3))
X.head(4)

Unnamed: 0_level_0,const,lnRV_1D_mean_lag,lnRV_3D_mean_lag
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2019-06-26,1.0,-3.161881,-3.559813
2019-06-27,1.0,-3.331801,-3.382061
2019-06-28,1.0,-3.956779,-3.483487
2019-06-29,1.0,-4.305468,-3.864683


In [17]:
y[:4]

Date
2019-06-26   -3.331801
2019-06-27   -3.956779
2019-06-28   -4.305468
2019-06-29   -4.842811
Name: lnRV, dtype: float64

### Train & Test Split

In [5]:
# Split the data into training and testing sets by the cutoff date
def split_t_t(X, y, cutoff_date='2024-01-01'):
    X_train, X_test = X[X.index < cutoff_date], X[X.index >= cutoff_date]
    y_train, y_test = y[y.index < cutoff_date], y[y.index >= cutoff_date]
    return X_train, X_test, y_train, y_test

In [19]:
X_train, X_test, y_train, y_test = split_t_t(X, y, cutoff_date='2024-01-01')
X_train.head(4)

Unnamed: 0_level_0,const,lnRV_1D_mean_lag,lnRV_3D_mean_lag
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2019-06-26,1.0,-3.161881,-3.559813
2019-06-27,1.0,-3.331801,-3.382061
2019-06-28,1.0,-3.956779,-3.483487
2019-06-29,1.0,-4.305468,-3.864683


In [20]:
y_train[:4]

Date
2019-06-26   -3.331801
2019-06-27   -3.956779
2019-06-28   -4.305468
2019-06-29   -4.842811
Name: lnRV, dtype: float64

### Prediction

In [6]:
cutoff_dates = {1: '2024-01-02', 3: '2024-01-04', 7: '2024-01-08', 30: '2024-01-31'}

In [7]:
def get_pred(df, lags, h):

    # Split the data into features and target
    X, y = split_f_t(df, lags)

    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = split_t_t(X, y, cutoff_dates[h])
    
    # Fit the model
    model = sm.OLS(y_train, X_train).fit()

    # Make predictions
    pred = model.predict(X_test)
    pred = pd.DataFrame(pred, columns=['Predicted'])
    pred.index = X_test.index

    # save predictions to csv
    pred.to_csv(f'../res/HAR{lags}_{h}D.csv')
    
    return pred, y_test, model

### Plotting

In [8]:
def plot_pred(pred, actual, h, lags):
    plt.figure(figsize=(8, 3))
    plt.plot(actual, label='Actual')
    plt.plot(pred, label='Predicted')
    plt.xticks(actual.index[::100])
    plt.xlabel('Date')
    plt.ylabel('Volatility')
    plt.title(f'HAR{lags}_{h}D-Ahead Forecast')
    plt.legend()
    plt.savefig(f'../res/HAR{lags}_{h}D-Ahead Forecast.png')

## Results

In [9]:
for h in [1, 3, 7, 30]:
    for lags in ((1,7,30),):
        pred, actual, model = get_pred(df, lags, h)
        print(h, lags, '\n', model.summary())

1 (1, 7, 30) 
                             OLS Regression Results                            
Dep. Variable:                   lnRV   R-squared:                       0.681
Model:                            OLS   Adj. R-squared:                  0.680
Method:                 Least Squares   F-statistic:                     1151.
Date:                Fri, 11 Apr 2025   Prob (F-statistic):               0.00
Time:                        17:07:19   Log-Likelihood:                -1248.7
No. Observations:                1624   AIC:                             2505.
Df Residuals:                    1620   BIC:                             2527.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
const                -0