## Feature Engineering (Continued from `EDA_1`)
**Import required packages & check working directory**

In [1]:
import graph_fun
from model_fun import kendall_rank
import multiprocessing
from multiprocessing.pool import Pool
import numpy as np
import os
import pandas as pd
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from statsmodels.tsa.stattools import adfuller

if os.getcwd()[-3:] == "src":
    os.chdir(os.path.dirname(os.getcwd()))
else:
    pass


**User defined parameters**


In [2]:
ticker = "CVX"
stock_data_path = "data/EDA/"
adj_daily_closing_path = "data/adjusted_daily_closing/"
adj_daily_dividend_path = "data/dividends/"
eda_data_path = "data/EDA/"
# lead_days = 5

**Load adjusted daily closing and dividends**

In [3]:
daily_closing = pd.read_csv(os.path.abspath(os.path.join(adj_daily_closing_path, (ticker + ".csv"))))
daily_closing["date"] = pd.to_datetime(daily_closing["date"]).dt.date

daily_dividend = pd.read_csv(os.path.abspath(os.path.join(adj_daily_dividend_path, (ticker + "_ts.csv"))))
daily_dividend["date"] = pd.to_datetime(daily_dividend["date"]).dt.date

adj_closing_df = daily_closing[["date", "close"]].merge(right=daily_dividend,
                                                        how="inner",
                                                        on="date")

adj_closing_df["adj closing"] = (adj_closing_df["close"] - adj_closing_df["amount"]).round(6)
adj_closing_df = adj_closing_df[["date", "adj closing"]]

**Load data from `EDA_1.ipynb`**

In [4]:
calls_df = pd.DataFrame()
puts_df = pd.DataFrame()

for n in ["calls", "puts"]:
    my_df = pd.read_csv(os.path.abspath(os.path.join(eda_data_path, (ticker + f"_{n}_EDA1.csv"))))
    my_df["date"] = pd.to_datetime(my_df["date"]).dt.date
    my_df["expiration date"] = pd.to_datetime(my_df["expiration date"]).dt.date
    if n == "calls":
        calls_df = my_df
    elif n == "puts":
        puts_df = my_df

**Splitting Data Into Training and Testing**

We split our call and put data into 80% training and 20% testing. No shuffling because data is time-series (order dependent).

In [5]:
all_dates = np.sort(calls_df["date"].unique())

dates_train, dates_test = train_test_split(all_dates, test_size=0.2, random_state=None, shuffle=False)

adj_closing_train = adj_closing_df[adj_closing_df["date"].isin(dates_train)]

Using a Dickey-Fuller test to see if a unit root is present in end-of-day prices.

In [6]:
adfuller_og = adfuller(adj_closing_train["adj closing"], regression="c", autolag="AIC")

print('ADF Statistic: %f' % adfuller_og[0])
print('p-value: %f' % adfuller_og[1])
print('Critical Values:')
for key, value in adfuller_og[4].items():
    print('\t%s: %.3f' % (key, value))

ADF Statistic: -1.315451
p-value: 0.622074
Critical Values:
	1%: -3.464
	5%: -2.876
	10%: -2.575


A p=value of 0.622 indicates we cannot reject the null hypothesis of a unit root being present.

Plotting time series, autocorrelation (ACF) and partial ACF to pick appropriate de-trending options.

In [7]:
acf_plot = graph_fun.ts_decompose(ts=adj_closing_train["adj closing"], nlags=30,
                                  dates=adj_closing_train["date"])

acf_plot.show(renderer="browser")

acf_plot.write_image("./img/EDA2_ACF.svg", width=750, height=800)


![ACF](../img/EDA2_ACF.svg)

Fitting the appropriate ARIMA model for this time series. It seems that the model would benefit the most from differencing of 1 (d = 1). There does not seem to be a clear seasonality to the data (perhaps b.c. we only have 1 year of data?)

In [8]:
# Take the difference between neighbouring observations
Y_TRAIN_STAT = adj_closing_train.reset_index(drop=True).copy()

for delta in range(1, 21):
    Y_TRAIN_STAT[f"delta {delta}"] = (Y_TRAIN_STAT["adj closing"] - (adj_closing_train["adj closing"]
                                                                     .shift(periods=delta)
                                                                     .reset_index(drop=True)))

Y_TRAIN_STAT.drop(columns=["adj closing"], inplace=True)

# Show ACF of delta = 1 residuals
delta = 1
acf_plot_l1 = graph_fun.ts_decompose(ts=Y_TRAIN_STAT[f"delta {delta}"].dropna(), nlags=30,
                                     dates=Y_TRAIN_STAT["date"][delta:],
                                     y_label=f"Delta {delta} Adj. Closing")

acf_plot_l1.show(renderer="browser")

acf_plot_l1.write_image("./img/EDA2_ACF2.pdf", width=750, height=800)

![ACF](../img/EDA2_ACF2.pdf)

Using Dickey-Fuller test to examine stationarity at different lags

In [9]:
adfuller_df = pd.DataFrame(index=["ADF Stat.", "p-value",
                                  "critical value 1%", "critical value 5%", "critical value 10%"])

for delta in [1, 2, 5, 10, 15, 20]:
    temp_adfuller = adfuller(Y_TRAIN_STAT[f"delta {delta}"].dropna(), regression="c", autolag="AIC")

    adfuller_df[f"delta {delta}"] = [round(temp_adfuller[0], 5),
                                     round(temp_adfuller[1], 5),
                                     round(temp_adfuller[4]["1%"], 5),
                                     round(temp_adfuller[4]["5%"], 5),
                                     round(temp_adfuller[4]["10%"], 5)]

print(adfuller_df)

                     delta 1  delta 2  delta 5  delta 10  delta 15  delta 20
ADF Stat.          -17.17277 -4.82437 -3.40888  -3.87423  -3.40522  -3.22498
p-value              0.00000  0.00005  0.01066   0.00223   0.01078   0.01858
critical value 1%   -3.46382 -3.46601 -3.46742  -3.46742  -3.46660  -3.46763
critical value 5%   -2.87625 -2.87721 -2.87783  -2.87783  -2.87747  -2.87792
critical value 10%  -2.57461 -2.57512 -2.57545  -2.57545  -2.57526  -2.57550


By taking the daily change (I = 1) in end of day prices, the dependent variable becomes stationary.

With more data, we can try to find a seasonal ARMA model that fits the residuals.

As the delta increases between adjusted EOD prices, the p-value increases as well. However, the residual is still considered stationary.


### Feature Engineering

Creating features to help draw relationships between option data and residuals

In [10]:
CALL_FEATS = pd.DataFrame()
PUT_FEATS = pd.DataFrame()

for df in [[calls_df, 1], [puts_df, 2]]:
    temp_df = df[0].copy()
    temp_df = temp_df[temp_df["date"].isin(dates_train)]
    temp_df = temp_df[temp_df["delta interest"] != 0].reset_index(drop=True)
    temp_df["ask er"] = temp_df["ask price"] * temp_df["delta interest"]
    temp_df["bid er"] = temp_df["bid price"] * temp_df["delta interest"]

    if df[1] == 1:
        temp_df["moneyness"] = temp_df["adj closing"] - temp_df["adj strike"]
    else:
        temp_df["moneyness"] = temp_df["adj strike"] - temp_df["adj closing"]

    temp_df["sign"] = np.sign(temp_df["delta interest"])

    temp_df = temp_df[["date", "days till exp", "delta interest", "sign",
                       "moneyness", "ask er", "bid er"]]

    if df[1] == 1:
        CALL_FEATS = temp_df
    else:
        PUT_FEATS = temp_df

Fitting linear regression using different features
1. Baseline linear regression between days until option expiry and moneyness
2. Above + takes into account whether the total number of open contracts increased or decreased
3. `2` + weighted by absolute change in open contracts
4. `2` + weighted by "ask er" as defined above
5. `2` + weighted by "bid er" as defined above

In [11]:
LR_fits = []

for date in dates_train:
    for my_input in [[CALL_FEATS, "call"], [PUT_FEATS, "put"]]:
        temp_df = (my_input[0])[(my_input[0])["date"] == date]

        lr_base = linear_model.LinearRegression().fit(X=temp_df[["days till exp"]],
                                                      y=temp_df["moneyness"])

        lr_sign = linear_model.LinearRegression().fit(X=temp_df[["days till exp"]],
                                                      y=temp_df["moneyness"] * temp_df["sign"])

        lr_delta = linear_model.LinearRegression().fit(X=temp_df[["days till exp"]],
                                                       y=temp_df["moneyness"] * temp_df["sign"],
                                                       sample_weight=np.abs(
                                                           temp_df["delta interest"]))

        lr_er_ask = linear_model.LinearRegression().fit(X=temp_df[["days till exp"]],
                                                        y=temp_df["moneyness"] * temp_df["sign"],
                                                        sample_weight=np.abs(
                                                            temp_df["ask er"]))

        lr_er_bid = linear_model.LinearRegression().fit(X=temp_df[["days till exp"]],
                                                        y=temp_df["moneyness"] * temp_df["sign"],
                                                        sample_weight=np.abs(
                                                            temp_df["bid er"]))

        temp_fits = [date,
                     lr_base.coef_[0], lr_base.intercept_,
                     lr_sign.coef_[0], lr_sign.intercept_,
                     lr_delta.coef_[0], lr_delta.intercept_,
                     lr_er_ask.coef_[0], lr_er_ask.intercept_,
                     lr_er_bid.coef_[0], lr_er_bid.intercept_]

        if my_input[1] == "call":
            temp_fits.append("call")
            LR_fits.append(temp_fits)
        else:
            temp_fits.append("put")
            LR_fits.append(temp_fits)

LR_fits = pd.DataFrame(LR_fits, columns=["date",
                                         "baseline_s", "baseline_i",
                                         "sign_s", "sign_i",
                                         "weighted_delta_s", "weighted_delta_i",
                                         "weighted_ask_er_s", "weighted_ask_er_i",
                                         "weighted_bid_er_s", "weighted_bid_er_i",
                                         "option type"])

In [12]:
LR_fits.head()

Unnamed: 0,date,baseline_s,baseline_i,sign_s,sign_i,weighted_delta_s,weighted_delta_i,weighted_ask_er_s,weighted_ask_er_i,weighted_bid_er_s,weighted_bid_er_i,option type
0,2016-01-04,-0.015845,-3.106311,-0.023524,0.97247,-0.039414,-2.574678,-0.019102,-3.563837,-0.018223,-3.605539,call
1,2016-01-04,-0.010023,-4.303141,-0.00866,-3.167798,-0.036403,-0.819443,0.010782,-2.786647,0.012764,-2.964281,put
2,2016-01-05,-0.011184,-2.410562,0.002616,-1.060309,0.004163,-2.79729,0.018259,-0.835753,0.018324,-0.718303,call
3,2016-01-05,0.010021,-4.839111,0.009236,-3.776612,-0.076194,6.170687,-0.061745,10.858818,-0.062534,10.882176,put
4,2016-01-06,-0.00984,-3.651803,-0.007106,0.89967,-0.011108,-1.816048,0.015329,-2.094608,0.015464,-2.102186,call


### Kendall rank correlations for each type of slope / intercept coefficient

In [13]:
input_list = []
lr_cols = list(LR_fits.columns)
lr_cols.remove("date")
lr_cols.remove("option type")
delta_cols = list(Y_TRAIN_STAT.columns)
delta_cols.remove("date")

for option_type in ["call", "put"]:
    for col1 in lr_cols:
        temp_lr = LR_fits[LR_fits["option type"] == option_type][["date", col1]]
        for col2 in delta_cols:
            temp_delta = Y_TRAIN_STAT[["date", col2]].dropna()
            temp_joined = temp_lr.merge(temp_delta, how="inner", on="date")
            temp_joined.drop(columns="date", inplace=True)
            input_list.append([temp_joined, option_type])


In [14]:
my_pool = Pool(multiprocessing.cpu_count())

results = my_pool.map(kendall_rank, input_list)

# Make sure column names correspond to the order returned by the function
RESULTS_DF = pd.DataFrame(data=results, columns=["tau", "pval", "id1", "id2", "option type"])

### Violin plot to visualize the efficacy of different metrics

In [15]:
tau_violin = go.Figure()

tau_violin.add_trace(go.Violin(x=RESULTS_DF["id1"][RESULTS_DF["option type"] == "call"],
                               y=RESULTS_DF["tau"][RESULTS_DF["option type"] == "call"],
                               name="call",
                               side="positive", line={"color": "orange"}))

tau_violin.add_trace(go.Violin(x=RESULTS_DF["id1"][RESULTS_DF["option type"] == "put"],
                               y=RESULTS_DF["tau"][RESULTS_DF["option type"] == "put"],
                               name="put",
                               side="negative", line={"color": "blue"}))

tau_violin.update_traces(meanline_visible=True)
tau_violin.update_layout(violingap=0.1, violinmode='overlay',
                         title="Kendall Tau Correlation Distributions of Various Metrics",
                         yaxis_title="Kendall Tau Correlation",
                         font={"size": 14})

tau_violin.show(renderer="browser")

tau_violin.write_image("./img/EDA2_tau_violin.pdf", width=1200, height=800)


![violin](../img/EDA2_tau_violin.pdf)

### Line plot to visualize the effects of increasing lag on correlation

In [16]:
tau_scatter = make_subplots(rows=1, cols=2, shared_yaxes=True,
                            subplot_titles=("Call", "Put"))

for option_type in ["call", "put"]:
    for metric in np.unique(RESULTS_DF["id1"]):
        if option_type == "call":
            ncol = 1
        else:
            ncol = 2
        tau_scatter.add_trace(
            go.Scatter(x=RESULTS_DF[(RESULTS_DF["option type"] == option_type) & (RESULTS_DF["id1"] == metric)]["id2"],
                       y=RESULTS_DF[(RESULTS_DF["option type"] == option_type) & (RESULTS_DF["id1"] == metric)]["tau"],
                       name=metric, mode="lines+markers"),
            row=1, col=ncol)


tau_scatter.update_layout(title="Kendall Tau Correlation vs. Lag of Various Metrics",
                          yaxis_title="Kendall Tau Correlation",
                          font={"size": 14})

tau_scatter.show(renderer="browser")

tau_scatter.write_image("./img/EDA2_tau_scatter.pdf", width=1600, height=800)


![scatter](../img/EDA2_tau_scatter.pdf)

From the above, we see that features `baseline_i`, call `sign_i`, and put `baseline_s` have the best performance. Most features on the put side have an average tau of ~0.05.

We also notice that, aside from call `baseline_i`, the 3 other "significant" features get better the further out we are trying to predict. This is counter-intuitive, because we expect the predictive ability of data to decay the further out we go.

A possible explanation for the poor performance of metrics derived from engineered features (e.g. linear regression weighted by `|change in open contracts|`, or the more complicated `|change in open contracts * ask/bid price|`) could be due to the systematic irregularities of how options are traded. Here are some possible explanations:
- Some options are bought to "lock in" gains in an investor's portfolio. For example, if an investor who recently invested in stock `ABC` wants to cash out, but also wants to avoid the short term capital gains tax. He/she can achieve this by selling call options or buying put options. These movements are usually large (since they want to cover all of their holdings), and are not reflective of the short term "sentiment" of stock `ABC`.
- By taking a look at the net change in open interest on various days, I noticed that the values are usually inflated shortly before ex-dividend dates.
    - For stock `CVX` (Chevron Corporation), there were spikes on 04-28, 05-16, 08-16 and 11-15 in 2016. The ex-dividend dates were 02-16, 05-17, 08-17 and 11-16. This happens with stocks that are dividend heavy, as in the case with `CVX`.
    - This could be because investors want to cash out on the dividends, other quant firms trade dividend events ...etc.