# Volatility Measurement and Forecasting

The purpose of this tutorial is to introduce various estimators for forecasting volatility.  This material is closely related to a the following white papers:

1. *Volatility Modeling and Trading* - Artur Sepp, 2016

2. *Measuring Historic Volatility* - Colin Bennet & Miguel Gil, 2012

Our main objective will be to implement the code for various historical volatility estimators.  To test our work, we will attempt to replicate some of Sepp's results for *weekly* volatility forecasts for SPY (see pp 38-43).

## Loading Packages

Let's begin by loading the packages that we will need.

In [1]:
import pandas as pd
import pandas_datareader as pdr
import numpy as np
import sklearn
pd.options.display.max_rows = 10

## Reading-In SPY Data From Yahoo Finance

Sepp's analysis covers data starting from 1/1/2005 and ending on 4/2/2016.  Let's grab these SPY prices from Yahoo Finance using `pandas_datareader`.

In [2]:
df_spy = pdr.get_data_yahoo('SPY', start = '2004-12-31', end = '2016-04-02').reset_index()
df_spy.columns = df_spy.columns.str.lower().str.replace(' ', '_')
df_spy.rename(columns = {'date':'trade_date'}, inplace = True)
df_spy.insert(0, 'ticker', 'SPY')
df_spy

Unnamed: 0,ticker,trade_date,high,low,open,close,volume,adj_close
0,SPY,2004-12-31,121.660004,120.800003,121.300003,120.870003,28648800.0,87.549049
1,SPY,2005-01-03,121.760002,119.900002,121.559998,120.300003,55748000.0,87.136215
2,SPY,2005-01-04,120.540001,118.440002,120.459999,118.830002,69167600.0,86.071442
3,SPY,2005-01-05,119.250000,118.000000,118.739998,118.010002,65667300.0,85.477493
4,SPY,2005-01-06,119.150002,118.260002,118.440002,118.610001,47814700.0,85.912079
...,...,...,...,...,...,...,...,...
2827,SPY,2016-03-28,203.860001,202.710007,203.610001,203.240005,62408200.0,184.875488
2828,SPY,2016-03-29,205.250000,202.399994,202.759995,205.119995,92922900.0,186.585632
2829,SPY,2016-03-30,206.869995,205.589996,206.300003,206.020004,86365300.0,187.404327
2830,SPY,2016-03-31,206.410004,205.330002,205.910004,205.520004,94584100.0,186.949509


## Calculating Daily Returns & Realized Volatility

The volatility estimators that we will implement will involve various daily returns.  Let's calculate them in the following block of code.

In [3]:
# daily (close-to-close)
df_spy['dly_ret'] = np.log(df_spy['close']).diff()
# overnight (close-to-open)
df_spy['overnight'] = np.log(df_spy['open']) - np.log(df_spy['close']).shift(1)
# intraday (open-to-close)
df_spy['open_close'] = np.log(df_spy['close']) - np.log(df_spy['open'])

df_spy = df_spy[1:].reset_index(drop = True)
df_spy

Unnamed: 0,ticker,trade_date,high,low,open,close,volume,adj_close,dly_ret,overnight,open_close
0,SPY,2005-01-03,121.760002,119.900002,121.559998,120.300003,55748000.0,87.136215,-0.004727,0.005692,-0.010419
1,SPY,2005-01-04,120.540001,118.440002,120.459999,118.830002,69167600.0,86.071442,-0.012295,0.001329,-0.013624
2,SPY,2005-01-05,119.250000,118.000000,118.739998,118.010002,65667300.0,85.477493,-0.006925,-0.000758,-0.006167
3,SPY,2005-01-06,119.150002,118.260002,118.440002,118.610001,47814700.0,85.912079,0.005071,0.003637,0.001434
4,SPY,2005-01-07,119.230003,118.129997,118.970001,118.440002,55847700.0,85.788956,-0.001434,0.003031,-0.004465
...,...,...,...,...,...,...,...,...,...,...,...
2826,SPY,2016-03-28,203.860001,202.710007,203.610001,203.240005,62408200.0,184.875488,0.000591,0.002409,-0.001819
2827,SPY,2016-03-29,205.250000,202.399994,202.759995,205.119995,92922900.0,186.585632,0.009208,-0.002365,0.011572
2828,SPY,2016-03-30,206.869995,205.589996,206.300003,206.020004,86365300.0,187.404327,0.004378,0.005736,-0.001358
2829,SPY,2016-03-31,206.410004,205.330002,205.910004,205.520004,94584100.0,186.949509,-0.002430,-0.000534,-0.001896


## Organizing Dates for Backtest

Organizing dates is an important step in a historical analysis.  

We are performing a weekly analysis, which means that in later steps we will performing aggregation calculations of daily calculations grouped into weeks.  Therefore, we will need to add a column to `df_spy` that will allow us to group by weeks.

The key to our approach will be to use the `.dt.weekday` attribute of the `trade_date` columns.  In the following code, the variable `weekday` is a `Series` that contains the weekday associated with each date.  Notice that Monday is encoded by `0` and Friday is encoded by `4`.

In [4]:
weekday = df_spy['trade_date'].dt.weekday
weekday

0       0
1       1
2       2
3       3
4       4
       ..
2826    0
2827    1
2828    2
2829    3
2830    4
Name: trade_date, Length: 2831, dtype: int64

The following code is a simple `for`-loop that has the effect of creating a week-number for each week.

In [5]:
week_num = []
ix_week = 0
week_num.append(ix_week)
for ix in range(0, len(weekday) - 1):
    prev_day = weekday[ix]
    curr_day = weekday[ix + 1]
    if curr_day < prev_day:
        ix_week = ix_week + 1
    week_num.append(ix_week)
np.array(week_num) # I use the array function simply because it looks better when it prints

array([  0,   0,   0, ..., 586, 586, 586])

Let's now insert the week numbers into `df_spy`.

In [6]:
df_spy.insert(2, 'week_num', week_num)
df_spy

Unnamed: 0,ticker,trade_date,week_num,high,low,open,close,volume,adj_close,dly_ret,overnight,open_close
0,SPY,2005-01-03,0,121.760002,119.900002,121.559998,120.300003,55748000.0,87.136215,-0.004727,0.005692,-0.010419
1,SPY,2005-01-04,0,120.540001,118.440002,120.459999,118.830002,69167600.0,86.071442,-0.012295,0.001329,-0.013624
2,SPY,2005-01-05,0,119.250000,118.000000,118.739998,118.010002,65667300.0,85.477493,-0.006925,-0.000758,-0.006167
3,SPY,2005-01-06,0,119.150002,118.260002,118.440002,118.610001,47814700.0,85.912079,0.005071,0.003637,0.001434
4,SPY,2005-01-07,0,119.230003,118.129997,118.970001,118.440002,55847700.0,85.788956,-0.001434,0.003031,-0.004465
...,...,...,...,...,...,...,...,...,...,...,...,...
2826,SPY,2016-03-28,586,203.860001,202.710007,203.610001,203.240005,62408200.0,184.875488,0.000591,0.002409,-0.001819
2827,SPY,2016-03-29,586,205.250000,202.399994,202.759995,205.119995,92922900.0,186.585632,0.009208,-0.002365,0.011572
2828,SPY,2016-03-30,586,206.869995,205.589996,206.300003,206.020004,86365300.0,187.404327,0.004378,0.005736,-0.001358
2829,SPY,2016-03-31,586,206.410004,205.330002,205.910004,205.520004,94584100.0,186.949509,-0.002430,-0.000534,-0.001896


**Discussion Question:** The `pandas.Series.dt.week` attribute gives the *week-of-the-year* for a give trade-date.  My initial idea was to use `.dt.week` and `dt.year` for my grouping, but I ran into an issue.  Can you think what the issue was?

In [7]:
##> Weeks at the beginning and end of the year may be partial weeks.





We can now use `groupby()` to calculate the starting and ending dates for each week.

In [8]:
df_start_end = \
    (
    df_spy.groupby(['week_num'], as_index = False)[['trade_date']].agg([min, max])['trade_date']
    .rename(columns = {'min':'week_start', 'max':'week_end'})
    .reset_index()
    .rename(columns = {'index':'week_num'})
    )
df_start_end

Unnamed: 0,week_num,week_start,week_end
0,0,2005-01-03,2005-01-07
1,1,2005-01-10,2005-01-14
2,2,2005-01-18,2005-01-21
3,3,2005-01-24,2005-01-28
4,4,2005-01-31,2005-02-04
...,...,...,...
582,582,2016-02-29,2016-03-04
583,583,2016-03-07,2016-03-11
584,584,2016-03-14,2016-03-18
585,585,2016-03-21,2016-03-24


Let's merge this data into `df_spy`.

In [9]:
df_spy = df_spy.merge(df_start_end)
df_spy

Unnamed: 0,ticker,trade_date,week_num,high,low,open,close,volume,adj_close,dly_ret,overnight,open_close,week_start,week_end
0,SPY,2005-01-03,0,121.760002,119.900002,121.559998,120.300003,55748000.0,87.136215,-0.004727,0.005692,-0.010419,2005-01-03,2005-01-07
1,SPY,2005-01-04,0,120.540001,118.440002,120.459999,118.830002,69167600.0,86.071442,-0.012295,0.001329,-0.013624,2005-01-03,2005-01-07
2,SPY,2005-01-05,0,119.250000,118.000000,118.739998,118.010002,65667300.0,85.477493,-0.006925,-0.000758,-0.006167,2005-01-03,2005-01-07
3,SPY,2005-01-06,0,119.150002,118.260002,118.440002,118.610001,47814700.0,85.912079,0.005071,0.003637,0.001434,2005-01-03,2005-01-07
4,SPY,2005-01-07,0,119.230003,118.129997,118.970001,118.440002,55847700.0,85.788956,-0.001434,0.003031,-0.004465,2005-01-03,2005-01-07
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2826,SPY,2016-03-28,586,203.860001,202.710007,203.610001,203.240005,62408200.0,184.875488,0.000591,0.002409,-0.001819,2016-03-28,2016-04-01
2827,SPY,2016-03-29,586,205.250000,202.399994,202.759995,205.119995,92922900.0,186.585632,0.009208,-0.002365,0.011572,2016-03-28,2016-04-01
2828,SPY,2016-03-30,586,206.869995,205.589996,206.300003,206.020004,86365300.0,187.404327,0.004378,0.005736,-0.001358,2016-03-28,2016-04-01
2829,SPY,2016-03-31,586,206.410004,205.330002,205.910004,205.520004,94584100.0,186.949509,-0.002430,-0.000534,-0.001896,2016-03-28,2016-04-01


## Calculating Weekly Realized Volatility

Now that we have a `week_num` associated with each `trade_date`, we can use `groupby()` to calculate the realized volatility.

These weekly realized volatilities are the labels that we will be predicting later in our analysis.

In [10]:
df_realized = \
    (
    df_spy
        .groupby(['week_num', 'week_start', 'week_end'], as_index = False)[['dly_ret']].agg(lambda x: np.std(x) * np.sqrt(252))
        .rename(columns = {'dly_ret':'realized_vol'})
    )
df_realized = df_realized[1:]
df_realized

Unnamed: 0,week_num,week_start,week_end,realized_vol
1,1,2005-01-10,2005-01-14,0.093295
2,2,2005-01-18,2005-01-21,0.126557
3,3,2005-01-24,2005-01-28,0.029753
4,4,2005-01-31,2005-02-04,0.069583
5,5,2005-02-07,2005-02-11,0.084567
...,...,...,...,...
582,582,2016-02-29,2016-03-04,0.159055
583,583,2016-03-07,2016-03-11,0.137591
584,584,2016-03-14,2016-03-18,0.057861
585,585,2016-03-21,2016-03-24,0.048135


## Close-to-Close Estimator

The first estimator that we will implement is the simlple close-to-close.

In [11]:
def close_to_close(r):
    T = r.shape[0]
    r_bar = r.mean()
    vol = np.sqrt((1 / (T - 1)) * ((r - r_bar) ** 2).sum()) * np.sqrt(252)
    return(vol)

Notice that `close_to_close()` is an aggregation function that takes in an array of daily returns and returns back a number.  In order to calculate weekly estimates we use `close_to_close()` as the aggregation function applied to a `.groupby()`.

In [12]:
df_close_to_close = \
    (
    df_spy
        .groupby(['week_num', 'week_start', 'week_end'], as_index = False)[['dly_ret']]
        .agg(close_to_close)
        .rename(columns = {'dly_ret':'close_to_close'})
    )
df_close_to_close = df_close_to_close[0:-1]
df_close_to_close

Unnamed: 0,week_num,week_start,week_end,close_to_close
0,0,2005-01-03,2005-01-07,0.102492
1,1,2005-01-10,2005-01-14,0.104307
2,2,2005-01-18,2005-01-21,0.146136
3,3,2005-01-24,2005-01-28,0.033265
4,4,2005-01-31,2005-02-04,0.077796
...,...,...,...,...
581,581,2016-02-22,2016-02-26,0.175394
582,582,2016-02-29,2016-03-04,0.177829
583,583,2016-03-07,2016-03-11,0.153831
584,584,2016-03-14,2016-03-18,0.064691


**Discussion Question:** Verify that the `.groupby()` above works just fine with out including `week_start` and `week_end`.  If that is the case, then why did I include it?

In [13]:
(
df_spy
    .groupby(['week_num', 'week_start', 'week_end'], as_index = False)[['dly_ret']]
    .agg(close_to_close)
    .rename(columns = {'dly_ret':'close_to_close'})
)

Unnamed: 0,week_num,week_start,week_end,close_to_close
0,0,2005-01-03,2005-01-07,0.102492
1,1,2005-01-10,2005-01-14,0.104307
2,2,2005-01-18,2005-01-21,0.146136
3,3,2005-01-24,2005-01-28,0.033265
4,4,2005-01-31,2005-02-04,0.077796
...,...,...,...,...
582,582,2016-02-29,2016-03-04,0.177829
583,583,2016-03-07,2016-03-11,0.153831
584,584,2016-03-14,2016-03-18,0.064691
585,585,2016-03-21,2016-03-24,0.055581


**Code Challenge:** Create an alternative version of our close-to-close function using `np.std()`.  Call the new function `close_to_close_std()`. Verify that your values match.

In [14]:
def close_to_close_std(r):
    vol = np.std(r, ddof = 1) * np.sqrt(252)
    return(vol)

df_std = \
    (
    df_spy
        .groupby(['week_num', 'week_start', 'week_end'], as_index = False)[['dly_ret']]
        .agg(close_to_close_std)
        .rename(columns = {'dly_ret':'close_to_close'})
    )
    
df_std = df_std[:-1]
print(df_std['close_to_close'].sum())
print(df_close_to_close['close_to_close'].sum())

90.6980642016463
90.6980642016463


In Sepp 2016, the author uses the $R^2$ between the forecasts and the realized labels as a means of assessing the quality of a particular estimator.  Let's utilize `sklearn` to do the same.

We being by importing the `LinearRegression` constructor and instantiating a model.

In [15]:
from sklearn.linear_model import LinearRegression
mdl_reg = LinearRegression(fit_intercept = True)

Next, let's organize our features and labels.

In [16]:
X = df_close_to_close[['close_to_close']]
y = df_realized['realized_vol']

We can now fit the model.

In [17]:
mdl_reg.fit(X, y)

LinearRegression()

The `.score()` method of a `LinearRegression` model returns the $R^2$.

In [18]:
mdl_reg.score(X, y)

0.4093645253435927

And we can examine the slope and intercept of our model as follows:

In [19]:
print(mdl_reg.intercept_)
print(mdl_reg.coef_)

0.04933384095053199
[0.57068844]


**Discussion Question:** How do our results compare to Sepp's?

Let's also measure the bias and efficiency of the the close-to-close estimator.

In [20]:
# bias
print(np.mean(df_close_to_close['close_to_close'] - df_realized['realized_vol']))

# efficiency
print(np.std(df_realized['realized_vol']) / np.std(df_close_to_close['close_to_close']))

0.01708041424884867
0.8919571135609281


## Parkinson

The next estimator that we implement is the Parkinson.

In [21]:
def parkinson(hl):
    T = hl.shape[0]
    high = hl.high
    low = hl.low
    vol = np.sqrt(np.sum((np.log(high / low) ** 2)) * (1 / (4 * np.log(2))) / T) * np.sqrt(252)
    return(vol)

Let's apply our function to a single weeks worth of data in `df_spy`.

In [22]:
parkinson(df_spy.query('week_num == 0')[['high', 'low']])

0.12051770757840295

From a programming standpoint, the Parkinson estimate is a little bit different because it is an aggregation function that takes in two columns (`high` and `low`) and returns a single number.  

For this reason, we will need to use `.apply()` rather than `.agg()`.

In [23]:
df_parkinson = \
    (
    df_spy.groupby(['week_num', 'week_start', 'week_end'])[['high', 'low']].apply(parkinson)
    .to_frame().reset_index()
    .rename(columns = {0:'parkinson'})
    )
df_parkinson = df_parkinson[:-1]
df_parkinson

Unnamed: 0,week_num,week_start,week_end,parkinson
0,0,2005-01-03,2005-01-07,0.120518
1,1,2005-01-10,2005-01-14,0.085756
2,2,2005-01-18,2005-01-21,0.107782
3,3,2005-01-24,2005-01-28,0.066081
4,4,2005-01-31,2005-02-04,0.073116
...,...,...,...,...
581,581,2016-02-22,2016-02-26,0.129379
582,582,2016-02-29,2016-03-04,0.126361
583,583,2016-03-07,2016-03-11,0.121619
584,584,2016-03-14,2016-03-18,0.083877


Next, let's fit a linear regression to the parkinson forecasts and the realized volatilities.

In [24]:
from sklearn.linear_model import LinearRegression
mdl_reg = LinearRegression(fit_intercept = True)
X = df_parkinson[['parkinson']]
y = df_realized['realized_vol']
mdl_reg.fit(X, y)

LinearRegression()

**Code Challenge:** Check the $R^2$ and coefficients and compare them with Sepp's.  How closely do we match?

In [25]:
print(mdl_reg.score(X, y))
print(mdl_reg.intercept_)
print(mdl_reg.coef_)

0.6121937433393481
0.013786304921137138
[0.93118615]


Let's also measure the bias and efficiency of the Parkinson estimator.

In [26]:
# bias
print(np.mean(df_parkinson['parkinson'] - df_realized['realized_vol']))

# efficiency
print(np.std(df_realized['realized_vol']) / np.std(df_parkinson['parkinson']))

-0.004732356458542568
1.1901235883927868


## Garman-Klass

The next estimator is the Garman-Klass.

In [27]:
def garman_klass(ohlc):
    T = ohlc.shape[0]
    o = ohlc.open
    h = ohlc.high
    l = ohlc.low
    c = ohlc.close
    vol = np.sqrt(np.sum((0.5 * np.log(h / l) ** 2) - ((2 * np.log(2) - 1) * np.log(c / o) ** 2)) / T) * np.sqrt(252)
    return(vol)

Let's check that the function works for a single week of data.

In [28]:
garman_klass(df_spy.query('week_num == 0')[['open', 'high', 'low', 'close']])

0.11506262004283113

The Garman-Klass estimator takes in four different columns to produce a single numeric estimate, thus we have to use `.apply()`.

In [29]:
df_garman_klass = \
    (
    df_spy.groupby(['week_num', 'week_start', 'week_end'])[['open', 'high', 'low', 'close']].apply(garman_klass)
    .to_frame().reset_index()
    .rename(columns = {0:'garman_klass'} )
    )
df_garman_klass = df_garman_klass[:-1]
df_garman_klass

Unnamed: 0,week_num,week_start,week_end,garman_klass
0,0,2005-01-03,2005-01-07,0.115063
1,1,2005-01-10,2005-01-14,0.087781
2,2,2005-01-18,2005-01-21,0.089608
3,3,2005-01-24,2005-01-28,0.074419
4,4,2005-01-31,2005-02-04,0.067922
...,...,...,...,...
581,581,2016-02-22,2016-02-26,0.122548
582,582,2016-02-29,2016-03-04,0.121252
583,583,2016-03-07,2016-03-11,0.134788
584,584,2016-03-14,2016-03-18,0.083794


Next let's check for the goodness of predictions by fitting a linear regression and calculating the $R^2$.

In [30]:
from sklearn.linear_model import LinearRegression
mdl_reg = LinearRegression(fit_intercept = True)
X = df_garman_klass[['garman_klass']]
y = df_realized['realized_vol']
mdl_reg.fit(X, y)
mdl_reg.score(X, y)

0.6130244469714066

Let's also measure the bias and efficiency of the Garman-Klass estimator.

In [31]:
# bias
print(np.mean(df_garman_klass['garman_klass'] - df_realized['realized_vol']))

# efficiency
print(np.std(df_realized['realized_vol']) / np.std(df_garman_klass['garman_klass']))

-0.0027699614763221536
1.162669294935497


## Rogers-Satchell

**Code Challenge:** Implement the Rogers-Satchell model, and calculate the $R^2$ between the forecasts and the realized, and also the bias and efficiency.

In [32]:
def rogers_satchell(ohlc):
    T = ohlc.shape[0]
    o = ohlc.open
    h = ohlc.high
    l = ohlc.low
    c = ohlc.close
    vol =   np.sqrt(np.sum((np.log(h / c) * np.log(h / o)) + (np.log(l / c) * np.log(l / o))) / T) * np.sqrt(252)
    return(vol)

In [33]:
rogers_satchell(df_spy.query('week_num == 0')[['open', 'high', 'low', 'close']])

0.11015628389785498

In [34]:
df_rogers_satchell = \
    (
    df_spy.groupby(['week_num', 'week_start', 'week_end'])[['open', 'high', 'low', 'close']].apply(rogers_satchell)
    .to_frame().reset_index()
    .rename(columns = {0:'rogers_satchell'} )
    )
df_rogers_satchell = df_rogers_satchell[:-1]
df_rogers_satchell

Unnamed: 0,week_num,week_start,week_end,rogers_satchell
0,0,2005-01-03,2005-01-07,0.110156
1,1,2005-01-10,2005-01-14,0.089740
2,2,2005-01-18,2005-01-21,0.075553
3,3,2005-01-24,2005-01-28,0.082389
4,4,2005-01-31,2005-02-04,0.063900
...,...,...,...,...
581,581,2016-02-22,2016-02-26,0.117695
582,582,2016-02-29,2016-03-04,0.116514
583,583,2016-03-07,2016-03-11,0.142715
584,584,2016-03-14,2016-03-18,0.079714


In [35]:
from sklearn.linear_model import LinearRegression
mdl_reg = LinearRegression(fit_intercept = True)
X = df_rogers_satchell[['rogers_satchell']]
y = df_realized['realized_vol']
mdl_reg.fit(X, y)
mdl_reg.score(X, y)

0.5942065182092937

In [36]:
# bias
print(np.mean(df_rogers_satchell['rogers_satchell'] - df_realized['realized_vol']))

# efficiency
print(np.std(df_realized['realized_vol']) / np.std(df_rogers_satchell['rogers_satchell']))

-0.001544745869980974
1.1175591828194589


## Yang-Zhang

And finally, let's repeat thes same steps for the Yang-Zang estimator.

In [37]:
def yang_zhang(ohlc_on_oc):
    T = ohlc_on_oc.shape[0]
    ohlc = ohlc_on_oc[['open', 'high', 'low', 'close']]
    on = ohlc_on_oc.overnight
    oc = ohlc_on_oc.open_close
    
    var_overnight = (close_to_close(on) / np.sqrt(252)) ** 2
    var_open_close = (close_to_close(oc) / np.sqrt(252)) ** 2
    var_rogers_satchell = (rogers_satchell(ohlc) / np.sqrt(252)) ** 2
    
    c = 0.34 / (1.34 + (T + 1)/(T - 1))
    
    vol = np.sqrt((var_overnight) + (c * var_open_close) + ((1 - c) * (var_rogers_satchell))) * np.sqrt(252)
    
    return(vol)

Checking the function on a single week of data.

In [38]:
yang_zhang(df_spy.query('week_num == 0')[['open', 'high', 'low', 'close', 'overnight', 'open_close']])

0.11480516872375435

Calculating weekly forecasts using `.groupby()` and `.apply()`.

In [39]:
df_yang_zhang = \
    (
    df_spy.groupby(['week_num', 'week_start', 'week_end'])[['open', 'high', 'low', 'close', 'overnight', 'open_close']].apply(yang_zhang)
    .to_frame().reset_index()
    .rename(columns = {0:'yang_zhang'} )
    )
df_yang_zhang = df_yang_zhang[:-1]
df_yang_zhang

Unnamed: 0,week_num,week_start,week_end,yang_zhang
0,0,2005-01-03,2005-01-07,0.114805
1,1,2005-01-10,2005-01-14,0.097092
2,2,2005-01-18,2005-01-21,0.096201
3,3,2005-01-24,2005-01-28,0.083237
4,4,2005-01-31,2005-02-04,0.073564
...,...,...,...,...
581,581,2016-02-22,2016-02-26,0.170104
582,582,2016-02-29,2016-03-04,0.132710
583,583,2016-03-07,2016-03-11,0.171927
584,584,2016-03-14,2016-03-18,0.082335


Let's check the performance of Yang-Zang by checking the $R^2$ of the fitted regression.

In [40]:
from sklearn.linear_model import LinearRegression
mdl_reg = LinearRegression(fit_intercept = True)
X = df_yang_zhang[['yang_zhang']]
y = df_realized['realized_vol']
mdl_reg.fit(X, y)
mdl_reg.score(X, y)

0.577240027949232

Let's also measure the bias and efficiency of the Yang-Zang estimator.

In [41]:
# bias
print(np.mean(df_yang_zhang['yang_zhang'] - df_realized['realized_vol']))

# efficiency
print(np.std(df_realized['realized_vol']) / np.std(df_yang_zhang['yang_zhang']))

0.026423462963608067
0.9314743785664658


**Code Challenge:** There is a short-hand identity for $R^2$ that would allow us to not have to bother with `sklearn`.  Google it and implement it.

In [42]:
np.corrcoef(df_yang_zhang['yang_zhang'], df_realized['realized_vol'])[0, 1] ** 2

0.5772400279492313