# The Close-to-Close Estimator

This material is closely related to a the following white papers:

1. *Volatility Modeling and Trading* - Artur Sepp, 2016

2. *Measuring Historic Volatility* - Colin Bennet & Miguel Gil, 2012

Our main objective will be to implement the code for the close-to-close volatility estimator.  To test our work, we will attempt to replicate Sepp's results for *weekly* volatility forecasts for SPY (see pp 38-43).

For your project you will be asked to implement code for other estimators.

## Loading Packages

Let's begin by loading the packages that we will need.

In [1]:
import pandas as pd
import pandas_datareader as pdr
import numpy as np
import sklearn
pd.options.display.max_rows = 10

## Reading-In SPY Data From Yahoo Finance

Sepp's analysis covers data starting from 1/1/2005 and ending on 4/2/2016.  Let's grab these SPY prices from Yahoo Finance using `pandas_datareader`.

In [2]:
df_spy = pdr.get_data_yahoo('SPY', start = '2004-12-31', end = '2016-04-02').reset_index()
df_spy.columns = df_spy.columns.str.lower().str.replace(' ', '_')
df_spy.rename(columns = {'date':'trade_date'}, inplace = True)
df_spy.insert(0, 'ticker', 'SPY')
df_spy

Unnamed: 0,ticker,trade_date,high,low,open,close,volume,adj_close
0,SPY,2004-12-31,121.660004,120.800003,121.300003,120.870003,28648800.0,86.040260
1,SPY,2005-01-03,121.760002,119.900002,121.559998,120.300003,55748000.0,85.634529
2,SPY,2005-01-04,120.540001,118.440002,120.459999,118.830002,69167600.0,84.588089
3,SPY,2005-01-05,119.250000,118.000000,118.739998,118.010002,65667300.0,84.004417
4,SPY,2005-01-06,119.150002,118.260002,118.440002,118.610001,47814700.0,84.431496
...,...,...,...,...,...,...,...,...
2827,SPY,2016-03-28,203.860001,202.710007,203.610001,203.240005,62408200.0,181.689377
2828,SPY,2016-03-29,205.250000,202.399994,202.759995,205.119995,92922900.0,183.369995
2829,SPY,2016-03-30,206.869995,205.589996,206.300003,206.020004,86365300.0,184.174637
2830,SPY,2016-03-31,206.410004,205.330002,205.910004,205.520004,94584100.0,183.727631


## Calculating Daily Returns & Realized Volatility

The close-to-close estimator is a function of daily returns so let's calculate those now.

In [3]:
df_spy['dly_ret'] = np.log(df_spy['close']).diff()
df_spy = df_spy[1:].reset_index(drop = True)
df_spy

Unnamed: 0,ticker,trade_date,high,low,open,close,volume,adj_close,dly_ret
0,SPY,2005-01-03,121.760002,119.900002,121.559998,120.300003,55748000.0,85.634529,-0.004727
1,SPY,2005-01-04,120.540001,118.440002,120.459999,118.830002,69167600.0,84.588089,-0.012295
2,SPY,2005-01-05,119.250000,118.000000,118.739998,118.010002,65667300.0,84.004417,-0.006925
3,SPY,2005-01-06,119.150002,118.260002,118.440002,118.610001,47814700.0,84.431496,0.005071
4,SPY,2005-01-07,119.230003,118.129997,118.970001,118.440002,55847700.0,84.310509,-0.001434
...,...,...,...,...,...,...,...,...,...
2826,SPY,2016-03-28,203.860001,202.710007,203.610001,203.240005,62408200.0,181.689377,0.000591
2827,SPY,2016-03-29,205.250000,202.399994,202.759995,205.119995,92922900.0,183.369995,0.009208
2828,SPY,2016-03-30,206.869995,205.589996,206.300003,206.020004,86365300.0,184.174637,0.004378
2829,SPY,2016-03-31,206.410004,205.330002,205.910004,205.520004,94584100.0,183.727631,-0.002430


## Organizing Dates for Backtest

Organizing dates is an important step in a historical analysis.  

We are performing a weekly analysis, which means that in later steps we will performing aggregation calculations of daily calculations grouped into weeks.  Therefore, we will need to add a column to `df_spy` that will allow us to group by weeks.

The key to our approach will be to use the `.dt.weekday` attribute of the `trade_date` columns.  In the following code, the variable `weekday` is a `Series` that contains the weekday associated with each date.  Notice that Monday is encoded by `0` and Friday is encoded by `4`.

In [4]:
weekday = df_spy['trade_date'].dt.weekday
weekday

0       0
1       1
2       2
3       3
4       4
       ..
2826    0
2827    1
2828    2
2829    3
2830    4
Name: trade_date, Length: 2831, dtype: int64

The following code is a simple `for`-loop that has the effect of creating a week-number for each week.

In [5]:
week_num = []
ix_week = 0
week_num.append(ix_week)
for ix in range(0, len(weekday) - 1):
    prev_day = weekday[ix]
    curr_day = weekday[ix + 1]
    if curr_day < prev_day:
        ix_week = ix_week + 1
    week_num.append(ix_week)
np.array(week_num) # I use the array function simply because it looks better when it prints

array([  0,   0,   0, ..., 586, 586, 586])

Let's now insert the week numbers into `df_spy`.

In [6]:
df_spy.insert(2, 'week_num', week_num)
df_spy

Unnamed: 0,ticker,trade_date,week_num,high,low,open,close,volume,adj_close,dly_ret
0,SPY,2005-01-03,0,121.760002,119.900002,121.559998,120.300003,55748000.0,85.634529,-0.004727
1,SPY,2005-01-04,0,120.540001,118.440002,120.459999,118.830002,69167600.0,84.588089,-0.012295
2,SPY,2005-01-05,0,119.250000,118.000000,118.739998,118.010002,65667300.0,84.004417,-0.006925
3,SPY,2005-01-06,0,119.150002,118.260002,118.440002,118.610001,47814700.0,84.431496,0.005071
4,SPY,2005-01-07,0,119.230003,118.129997,118.970001,118.440002,55847700.0,84.310509,-0.001434
...,...,...,...,...,...,...,...,...,...,...
2826,SPY,2016-03-28,586,203.860001,202.710007,203.610001,203.240005,62408200.0,181.689377,0.000591
2827,SPY,2016-03-29,586,205.250000,202.399994,202.759995,205.119995,92922900.0,183.369995,0.009208
2828,SPY,2016-03-30,586,206.869995,205.589996,206.300003,206.020004,86365300.0,184.174637,0.004378
2829,SPY,2016-03-31,586,206.410004,205.330002,205.910004,205.520004,94584100.0,183.727631,-0.002430


**Discussion Question:** The `pandas.Series.dt.week` attribute gives the *week-of-the-year* for a give trade-date.  My initial idea was to use `.dt.week` and `dt.year` for my grouping, but I ran into an issue.  Can you think what the issue was?

In [7]:
##> Weeks at the beginning and end of the year may be partial weeks.





We can now use `groupby()` to calculate the starting and ending dates for each week.

In [8]:
df_start_end = \
    (
    df_spy.groupby(['week_num'], as_index = False)[['trade_date']].agg([min, max])['trade_date']
    .rename(columns = {'min':'week_start', 'max':'week_end'})
    .reset_index()
    .rename(columns = {'index':'week_num'})
    )
df_start_end

Unnamed: 0,week_num,week_start,week_end
0,0,2005-01-03,2005-01-07
1,1,2005-01-10,2005-01-14
2,2,2005-01-18,2005-01-21
3,3,2005-01-24,2005-01-28
4,4,2005-01-31,2005-02-04
...,...,...,...
582,582,2016-02-29,2016-03-04
583,583,2016-03-07,2016-03-11
584,584,2016-03-14,2016-03-18
585,585,2016-03-21,2016-03-24


Let's merge this data into `df_spy`.

In [9]:
df_spy = df_spy.merge(df_start_end)
df_spy

Unnamed: 0,ticker,trade_date,week_num,high,low,open,close,volume,adj_close,dly_ret,week_start,week_end
0,SPY,2005-01-03,0,121.760002,119.900002,121.559998,120.300003,55748000.0,85.634529,-0.004727,2005-01-03,2005-01-07
1,SPY,2005-01-04,0,120.540001,118.440002,120.459999,118.830002,69167600.0,84.588089,-0.012295,2005-01-03,2005-01-07
2,SPY,2005-01-05,0,119.250000,118.000000,118.739998,118.010002,65667300.0,84.004417,-0.006925,2005-01-03,2005-01-07
3,SPY,2005-01-06,0,119.150002,118.260002,118.440002,118.610001,47814700.0,84.431496,0.005071,2005-01-03,2005-01-07
4,SPY,2005-01-07,0,119.230003,118.129997,118.970001,118.440002,55847700.0,84.310509,-0.001434,2005-01-03,2005-01-07
...,...,...,...,...,...,...,...,...,...,...,...,...
2826,SPY,2016-03-28,586,203.860001,202.710007,203.610001,203.240005,62408200.0,181.689377,0.000591,2016-03-28,2016-04-01
2827,SPY,2016-03-29,586,205.250000,202.399994,202.759995,205.119995,92922900.0,183.369995,0.009208,2016-03-28,2016-04-01
2828,SPY,2016-03-30,586,206.869995,205.589996,206.300003,206.020004,86365300.0,184.174637,0.004378,2016-03-28,2016-04-01
2829,SPY,2016-03-31,586,206.410004,205.330002,205.910004,205.520004,94584100.0,183.727631,-0.002430,2016-03-28,2016-04-01


## Calculating Weekly Realized Volatility

Now that we have a `week_num` associated with each `trade_date`, we can use `.groupby()` to calculate the realized volatility.

These weekly realized volatilities are the labels that we will be predicting later in our analysis.

In [10]:
df_realized = \
    (
    df_spy
        .groupby(['week_num', 'week_start', 'week_end'], as_index = False)[['dly_ret']].agg(lambda x: np.std(x) * np.sqrt(252))
        .rename(columns = {'dly_ret':'realized_vol'})
    )
df_realized = df_realized[1:]
df_realized

Unnamed: 0,week_num,week_start,week_end,realized_vol
1,1,2005-01-10,2005-01-14,0.093295
2,2,2005-01-18,2005-01-21,0.126557
3,3,2005-01-24,2005-01-28,0.029753
4,4,2005-01-31,2005-02-04,0.069583
5,5,2005-02-07,2005-02-11,0.084567
...,...,...,...,...
582,582,2016-02-29,2016-03-04,0.159055
583,583,2016-03-07,2016-03-11,0.137591
584,584,2016-03-14,2016-03-18,0.057861
585,585,2016-03-21,2016-03-24,0.048135


## Close-to-Close Estimator

Let's now implement the close-to-close estimator.

In [11]:
def close_to_close(r):
    T = r.shape[0]
    r_bar = r.mean()
    vol = np.sqrt((1 / (T - 1)) * ((r - r_bar) ** 2).sum()) * np.sqrt(252)
    return(vol)

Notice that `close_to_close()` is an aggregation function that takes in an array of daily returns and returns back a number.  In order to calculate weekly estimates we use `close_to_close()` as the aggregation function applied to a `.groupby()`.

In [12]:
df_close_to_close = \
    (
    df_spy
        .groupby(['week_num', 'week_start', 'week_end'], as_index = False)[['dly_ret']]
        .agg(close_to_close)
        .rename(columns = {'dly_ret':'close_to_close'})
    )
df_close_to_close = df_close_to_close[0:-1]
df_close_to_close

Unnamed: 0,week_num,week_start,week_end,close_to_close
0,0,2005-01-03,2005-01-07,0.102492
1,1,2005-01-10,2005-01-14,0.104307
2,2,2005-01-18,2005-01-21,0.146136
3,3,2005-01-24,2005-01-28,0.033265
4,4,2005-01-31,2005-02-04,0.077796
...,...,...,...,...
581,581,2016-02-22,2016-02-26,0.175394
582,582,2016-02-29,2016-03-04,0.177829
583,583,2016-03-07,2016-03-11,0.153831
584,584,2016-03-14,2016-03-18,0.064691


**Discussion Question:** Verify that the `.groupby()` above works just fine with out including `week_start` and `week_end`.  If that is the case, then why did I include it?

In [13]:
(
df_spy
    .groupby(['week_num', 'week_start', 'week_end'], as_index = False)[['dly_ret']]
    .agg(close_to_close)
    .rename(columns = {'dly_ret':'close_to_close'})
)

# It makes the code and the dataframe more readable.

Unnamed: 0,week_num,week_start,week_end,close_to_close
0,0,2005-01-03,2005-01-07,0.102492
1,1,2005-01-10,2005-01-14,0.104307
2,2,2005-01-18,2005-01-21,0.146136
3,3,2005-01-24,2005-01-28,0.033265
4,4,2005-01-31,2005-02-04,0.077796
...,...,...,...,...
582,582,2016-02-29,2016-03-04,0.177829
583,583,2016-03-07,2016-03-11,0.153831
584,584,2016-03-14,2016-03-18,0.064691
585,585,2016-03-21,2016-03-24,0.055581


**Code Challenge:** Create an alternative version of our close-to-close function using `np.std()`.  Call the new function `close_to_close_std()`. Verify that your values match.

In [14]:
def close_to_close_std(r):
    vol = np.std(r, ddof = 1) * np.sqrt(252)
    return(vol)

df_std = \
    (
    df_spy
        .groupby(['week_num', 'week_start', 'week_end'], as_index = False)[['dly_ret']]
        .agg(close_to_close_std)
        .rename(columns = {'dly_ret':'close_to_close'})
    )
    
df_std = df_std[:-1]
print(df_std['close_to_close'].sum())
print(df_close_to_close['close_to_close'].sum())

90.6980642016463
90.6980642016463


In Sepp 2016, the author uses the $R^2$ between the forecasts and the realized labels as a means of assessing the quality of a particular estimator.  Let's utilize `sklearn` to do the same.

We being by importing the `LinearRegression` constructor and instantiating a model.

In [15]:
from sklearn.linear_model import LinearRegression
mdl_reg = LinearRegression(fit_intercept = True)

Next, let's organize our features and labels.

In [16]:
X = df_close_to_close[['close_to_close']]
y = df_realized['realized_vol']

We can now fit the model.

In [17]:
mdl_reg.fit(X, y)

The `.score()` method of a `LinearRegression` model returns the $R^2$.

In [18]:
mdl_reg.score(X, y)

0.4093645253435927

And we can examine the slope and intercept of our model as follows:

In [19]:
print(mdl_reg.intercept_)
print(mdl_reg.coef_)

0.04933384095053199
[0.57068844]


**Discussion Question:** How do our results compare to Sepp's?

In [1]:
# They seem close enough that there probably isn't some error in my calculations.
# The differences probably come down to differences in data.
# I do wish the results were a bit closer to feel totally comfortable.

Let's also measure the bias and efficiency of the the close-to-close estimator.

In [20]:
# bias
print(np.mean(df_close_to_close['close_to_close'] - df_realized['realized_vol']))

# efficiency
print(np.std(df_realized['realized_vol']) / np.std(df_close_to_close['close_to_close']))

0.01708041424884867
0.8919571135609281
