# Option Volume: Feature Selection and Hyperparameter Tuning

In this chapter we continue our prediction of option volume rank with the data that we wrangled in the previous chapter.  In particular, we will:

1. Fit a linear regression to four features.
1. Fit a lasso regression to our features to find the ones that have predictive power.
1. Find the optimal `n_neighbors for` a `KNeighborsRegressor`.
1. Run our backtest with the `KNeighborsRegressor` using the optimal `n_neighbors`.
1. Find the optimal `max_depth` for a `RandomForestRegressor`.
1. Run our backtest with the `RandomForestRegressor` using the optimal `max_depth`.

## Import Packages

Let's begin by importing the packages that we will need.

In [None]:
import pandas as pd
import numpy as np

## Reading-In Data

Next, let's read-in our training data and backtest data.

In [None]:
df_train = pd.read_csv('../data/option_train_2017.csv')
df_test = pd.read_csv('../data/option_test_2018.csv')

## Starting Features

We will start with the following four features.

In [None]:
features = ['iv_change_one_lag', 'scaled_return_one_lag', 'rank_one_lag', 'rank_change_one_lag']

## Linear Regression

We can now run a linear regression with these features.

In [None]:
from sklearn.linear_model import LinearRegression
df_features = df_train[features]
df_label = df_train[['daily_volume_rank']]
linear_regression = LinearRegression()
linear_regression.fit(df_features, np.ravel(df_label.values))
linear_regression.score(df_features, df_label)

0.3654074977254579

Let's check the parameters and see if they make intuitive sense.

In [None]:
df_linear_regression_coefficients = \
    pd.DataFrame({
        'feature':features,
        'coefficient':linear_regression.coef_
    })
df_linear_regression_coefficients

Unnamed: 0,feature,coefficient
0,iv_change_one_lag,0.010543
1,scaled_return_one_lag,-1.208765
2,rank_one_lag,0.680732
3,rank_change_one_lag,-0.274649


**Interpretation:**
1. `iv_change` - a positive change in implied vol could be caused by supply/demand effects of increased option buying, which could carry through to the following day 
1. `scaled_return` - when a stock goes down, long positions in the stock get fearful (or greedy) and option buying increases
1. `rank_one_lag` - if an underlying has high rank one day, it will likely have high rank the next day
1. `rank_change_one_lag` - if an underlying has a jump in volume one day, it will usually revert back to previous levels the next day

## Feature Selection Using Lasso

Lasso regression is a linear regression technique that minimizes an objective function that involves residual-sum-of-squares and also the magnitude of the regression coefficients.

In particular, it penalizes the objective for the collective magnitude of the regression coefficients.  This has the effect of making the coefficients of the non-predictive features equal to zero.

Thus, lasso regression can be a way of weeding out non-predictive coefficients.

Let's next fit a lasso regression to our four features.

In [None]:
from sklearn.linear_model import Lasso
df_features = df_train[features]
df_label = df_train[['daily_volume_rank']]
lasso = Lasso(alpha=0.10)
lasso.fit(df_features, np.ravel(df_label.values))
lasso.score(df_features, df_label)

0.3654059332222376

We can now examine the coefficients.  Notice that `iv_change_one_lag` has a value of 0, and thus it is not that predictive.

In [None]:
df_lasso_coefficients = \
    pd.DataFrame({
        'feature':features,
        'coefficient':lasso.coef_
    })
df_lasso_coefficients

Unnamed: 0,feature,coefficient
0,iv_change_one_lag,0.0
1,scaled_return_one_lag,-1.091635
2,rank_one_lag,0.680723
3,rank_change_one_lag,-0.274577


The `alpha` hyperparameter controls how much the coefficient sizes are penalized.  We can use cross-validation to choose the optimal level of `alpha`.

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import Lasso
df_features = df_train[features]
df_label = df_train[['daily_volume_rank']]
alphas = np.linspace(0.1, 1, 10)
for ix_alpha in alphas:
   lasso = Lasso(alpha=ix_alpha)
   cvs = cross_val_score(lasso, df_features, df_label, cv = 10)
   print(np.round(ix_alpha, 2), cvs.mean())

0.1 0.3565912774339949
0.2 0.35658669987577574
0.3 0.35657893041868144
0.4 0.35656800276703626
0.5 0.35655390432426176
0.6 0.35653663509035816
0.7 0.35651619506532517
0.8 0.3564925842491631
0.9 0.3564673297611635
1.0 0.3564481756197352


In our case, the value of `alpha` doesn't seem to matter that much.  So we'll leave it as is.

## Selecting Predictive Features

We can remove `iv_change_one_lag` as our lasso regression showed that it has low predictive power.

In [None]:
features = ['scaled_return_one_lag', 'rank_one_lag', 'rank_change_one_lag']

## User Defined Functions

Let's create the user defined functions we will need to use our top-$n$ metric in our backtest.  These functions were introduced in the previous chapter.

In [None]:
def top_n_volume(n):
    df_test = pd.read_csv("../data/option_test_2018.csv")
    df_top_n_volume = \
    (
    df_test
        .query('daily_volume_rank <= @n')
        .groupby(['quotedate'])[['totalvol']].sum()
        .reset_index()
        .rename(columns={'totalvol':'top_' + str(n) + '_volume'})
    )
    return(df_top_n_volume)

In [None]:
def calc_top_n_ratio(n, trade_date, df_test, model=None, features=[]):
    
    # grabbing top-n volume for each day in backtest
    df_top_n = top_n_volume(n)
    
    # grabbing feature observations for trade_date
    df_prediction = df_test.query('quotedate == @trade_date').copy()
    
    # selecting features from df_X
    df_X = df_prediction[features]
    
    # calculating model predictions
    if model is not None:
        df_prediction['prediction'] = model.predict(df_X) # predictions based on model
    else:
        df_prediction['prediction'] = df_prediction['rank_one_lag'] # simple-rule based predictor
    
    # sorting by predicted rank
    df_prediction = df_prediction.sort_values(['prediction'])
    # calculating predicted top-n volume
    predicted_top_n_volume = df_prediction.head(n)['totalvol'].sum()
    # querying for actual top-n volume
    actual_top_n_volume = df_top_n.query('quotedate == @trade_date')['top_' + str(n) + '_volume'].values[0]
    
    # return the top-n-ratio
    return(predicted_top_n_volume / actual_top_n_volume)

In [None]:
def backtest(n, df_test, model=None, features=[]):
    # all trade dates in backtest period
    trade_dates = df_test['quotedate'].unique().tolist()
    
    # calculating all top-n ratios
    top_n_ratios = []
    for ix_trade_date in trade_dates:
        top_n_ratios.append(calc_top_n_ratio(n, ix_trade_date, df_test, model, features))

    # creating a dataframe of daily top-n ratios
    df_daily = pd.DataFrame({
        'trade_date':trade_dates,
        'top_'+str(n)+'_volume': np.round(top_n_ratios, 3),
    })

    # calculating summary statsics of top-n ratios during backtest period
    df_stats = pd.DataFrame({
        'model':[str(model)],
        'average':[np.mean(top_n_ratios).round(3)],
        'std_dev':[np.std(top_n_ratios).round(3)],
        'minimum':[np.min(top_n_ratios).round(3)],
        'maximum':[np.max(top_n_ratios).round(3)],
    })

    return([df_daily, df_stats])

## K Nearest Neighbors

In this section we'll fit a `KNeighborsRegressor` to our training data and see how it performs during the backtest period.

First, let's use a 10-fold cross validation (using $R^2$ as metric) to determine optimal value of `n_neighbors`.

In [None]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import cross_val_score
df_features = df_train[features]
df_label = df_train[['daily_volume_rank']]
k = range(100, 1100, 100)
for ix_k in k:
    knn = KNeighborsRegressor(n_neighbors=ix_k)
    cvs = cross_val_score(knn, df_features, df_label, cv = 10)
    print(ix_k, cvs.mean())

100 0.36101494817841573
200 0.36386843260646606
300 0.3646175979349766
400 0.36480461178161694
500 0.3647867229860612
600 0.3646215082035253
700 0.3644970083233438
800 0.36423480801137564
900 0.36403894419896105
1000 0.3638359335614717


The model doesn't seem particularly sensitive to the value of `n_neighbors`, so let's just use 400 because it had the highest $R^2$ and the run-time seems reasonable.

Next, let's fit a `KNeighborsRegressor` to the entirety of our training data use `n_neighbors=400`.

In [None]:
from sklearn.neighbors import KNeighborsRegressor
df_features = df_train[features]
df_label = df_train[['daily_volume_rank']]
knn = KNeighborsRegressor(n_neighbors=400)
knn.fit(df_features, np.ravel(df_label.values))
knn.score(df_features, df_label)

0.37700389902243137

We can now use our fitted model to perform our backtest using top-25 ratio as our metric for success.

In [None]:
backtest(10, df_test, knn, features)

[    trade_date  top_10_volume
 0   2018-01-05          0.831
 1   2018-01-08          0.577
 2   2018-01-09          0.659
 3   2018-01-10          0.421
 4   2018-01-11          0.532
 5   2018-01-12          0.309
 6   2018-01-16          0.599
 7   2018-01-17          0.467
 8   2018-01-18          0.398
 9   2018-01-19          0.625
 10  2018-01-22          0.708
 11  2018-01-23          0.657
 12  2018-01-24          0.496
 13  2018-01-25          0.616
 14  2018-01-26          0.763
 15  2018-01-29          0.608
 16  2018-01-30          0.471
 17  2018-01-31          0.690,
                                   model  average  std_dev  minimum  maximum
 0  KNeighborsRegressor(n_neighbors=400)    0.579    0.131    0.309    0.831]

As we can see, our KNN model actually performs worse than the simple rule-based predictor that we introduced in the previous chapter.

## Random Forest

In this section we'll run our backtest using a `RandomForestRegressor`.  I've already run a cross-validation analysis that `n_estimators=10` has a good trade off of performance and run time.

Let's find an optimal value of `max_depth` using a 10-fold cross validation.

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor
df_features = df_train[features]
df_label = df_train[['daily_volume_rank']]
d = range(1, 21, 1)
for ix_d in d:
    random_forest = RandomForestRegressor(n_estimators=10, max_depth=ix_d)
    cvs = cross_val_score(random_forest, df_features, np.ravel(df_label.values), cv = 10)
    print(ix_d, cvs.mean())

1 0.21323542515349816
2 0.28667668385608386
3 0.32823705965447214
4 0.350204904212642
5 0.3604337950155293
6 0.3643107573903573
7 0.36517895534208716
8 0.3638805625055838
9 0.3627851611360925
10 0.3582779076013644
11 0.3539930070013009
12 0.34701968731147276
13 0.34101687956906235
14 0.3326469340556921
15 0.3248038486930812
16 0.31443952345817633
17 0.3059982711930008
18 0.29436642171242966
19 0.2866621073083838
20 0.27945630028663293


Based on our cross-validation analysis above, let's use `max_depth=7` to train our model.

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor
df_features = df_train[features]
df_label = df_train[['daily_volume_rank']]
random_forest = RandomForestRegressor(n_estimators = 10, max_depth=7)
random_forest.fit(df_features, np.ravel(df_label.values))
random_forest.score(df_features, df_label)

0.387368629069925

We can now run our backtest using the top-25 metric for our measure of success. 

In [None]:
backtest(10, df_test, random_forest, features)

[    trade_date  top_10_volume
 0   2018-01-05          0.831
 1   2018-01-08          0.577
 2   2018-01-09          0.694
 3   2018-01-10          0.421
 4   2018-01-11          0.532
 5   2018-01-12          0.338
 6   2018-01-16          0.599
 7   2018-01-17          0.511
 8   2018-01-18          0.398
 9   2018-01-19          0.625
 10  2018-01-22          0.728
 11  2018-01-23          0.571
 12  2018-01-24          0.496
 13  2018-01-25          0.661
 14  2018-01-26          0.763
 15  2018-01-29          0.576
 16  2018-01-30          0.471
 17  2018-01-31          0.690,
                                                model  average  std_dev   
 0  RandomForestRegressor(max_depth=7, n_estimator...    0.582    0.128  \
 
    minimum  maximum  
 0    0.338    0.831  ]

As we can see, our random forest model also under performs relative our simple rule based model.