# Option Volume: **xgboost**

XGBoost is a general framework for constructing gradient boosted trees; the Python implementation is a package called **xgboost**.

This chapter is a continuation of the option volume prediction work we have been doing in previous chapters.  In particular, we show how to use **xgboost** in that context.

## Importing Packages

Let's begin by importing the packages that we will need.

In [None]:
import pandas as pd
import numpy as np
import pandas as pd
import numpy as np
import sklearn
from xgboost import XGBRegressor
import warnings
warnings.filterwarnings('ignore')

## Reading-In Data

Next, let's read-in our training data and testing data.

In [None]:
df_train = pd.read_csv('../data/option_train_2017.csv')
df_test = pd.read_csv('../data/option_test_2018.csv')

## Feature Selection

For this exercise we will use all the features available in our training data set.

In [None]:
features = ['iv_change_one_lag', 'iv_change_two_lag', 'scaled_return_one_lag', 
            'scaled_return_two_lag', 'rank_one_lag', 'rank_two_lag',
            'rank_change_one_lag', 'rank_change_two_lag',]

## User Defined Functions

In this section we import the three custom functions that are needed to execute our backtest.  These functions were introduced in a previous chapter.

In [None]:
def top_n_volume(n):
    df_test = pd.read_csv("../data/option_test_2018.csv")
    df_top_n_volume = \
    (
    df_test
        .query('daily_volume_rank <= @n')
        .groupby(['quotedate'])[['totalvol']].sum()
        .reset_index()
        .rename(columns={'totalvol':'top_' + str(n) + '_volume'})
    )
    return(df_top_n_volume)

In [None]:
def calc_top_n_ratio(n, trade_date, df_test, model=None, features=[]):
    
    # grabbing top-n volume for each day in backtest
    df_top_n = top_n_volume(n)
    
    # grabbing feature observations for trade_date
    df_prediction = df_test.query('quotedate == @trade_date').copy()
    
    # selecting features from df_X
    df_X = df_prediction[features]
    
    # calculating model predictions
    if model is not None:
        df_prediction['prediction'] = model.predict(df_X) # predictions based on model
    else:
        df_prediction['prediction'] = df_prediction['rank_one_lag'] # simple-rule based predictor
    
    # sorting by predicted rank
    df_prediction = df_prediction.sort_values(['prediction'])
    # calculating predicted top-n volume
    predicted_top_n_volume = df_prediction.head(n)['totalvol'].sum()
    # querying for actual top-n volume
    actual_top_n_volume = df_top_n.query('quotedate == @trade_date')['top_' + str(n) + '_volume'].values[0]
    
    # return the top-n-ratio
    return(predicted_top_n_volume / actual_top_n_volume)

In [None]:
def backtest(n, df_test, model=None, features=[]):
    # all trade dates in backtest period
    trade_dates = df_test['quotedate'].unique().tolist()
    
    # calculating all top-n ratios
    top_n_ratios = []
    for ix_trade_date in trade_dates:
        top_n_ratios.append(calc_top_n_ratio(n, ix_trade_date, df_test, model, features))

    # creating a dataframe of daily top-n ratios
    df_daily = pd.DataFrame({
        'trade_date':trade_dates,
        'top_'+str(n)+'_volume': np.round(top_n_ratios, 3),
    })

    # calculating summary statistics of top-n ratios during backtest period
    df_stats = pd.DataFrame({
        'model':[str(model)],
        'average':[np.mean(top_n_ratios).round(3)],
        'std_dev':[np.std(top_n_ratios).round(3)],
        'minimum':[np.min(top_n_ratios).round(3)],
        'maximum':[np.max(top_n_ratios).round(3)],
    })

    return([df_daily, df_stats])

## Hyperparameter Tuning

The `learning_rate` is rate at which successive trees are boosted; a lower `learning_rate` amounts to slower learning.

Here we use a 5-fold cross-validation to select the optimal `learning_rate`.  We will use $R^2$ as our goodness of fit metric. 

In [None]:
from sklearn.model_selection import cross_val_score
df_features = df_train[features]
df_label = df_train[['daily_volume_rank']]
alphas = np.linspace(0.1, 1, 10)
for ix_alpha in alphas:
   xgb_model = XGBRegressor(n_estimators=25, max_depth=3, learning_rate=ix_alpha, random_state=0)
   cvs = cross_val_score(xgb_model, df_features, df_label, cv = 5)
   print(np.round(ix_alpha, 2), cvs.mean())

0.1 0.36246615782763447
0.2 0.38969475875156323
0.3 0.3910816017808643
0.4 0.38967754013651945
0.5 0.38808813540767895
0.6 0.38692635332836656
0.7 0.38474370336623864
0.8 0.3841157442722449
0.9 0.38194061118381856
1.0 0.37935841896562783


As we can see, `learning_rate = 0.3` yields the highest $R^2$.

## Fitting Model

Now we are ready to fit the our model with `learning_rate = 0.3`.  Notice that we are increasing `n_estimators=500`.

In [None]:
df_features = df_train[features]
df_label = df_train[['daily_volume_rank']]
xg_model = XGBRegressor(n_estimators=500, max_depth=3, learning_rate=0.3, random_state=0)
xg_model.fit(df_features, df_label)

In [None]:
sklearn.metrics.r2_score(df_label, xg_model.predict(df_features))

0.4621096942085411

## Backtest

Let's run our backtest with the fit model.

In [None]:
backtest(25, df_test, xg_model, features)

[    trade_date  top_25_volume
 0   2018-01-05          0.827
 1   2018-01-08          0.574
 2   2018-01-09          0.694
 3   2018-01-10          0.535
 4   2018-01-11          0.780
 5   2018-01-12          0.562
 6   2018-01-16          0.584
 7   2018-01-17          0.552
 8   2018-01-18          0.475
 9   2018-01-19          0.642
 10  2018-01-22          0.717
 11  2018-01-23          0.563
 12  2018-01-24          0.599
 13  2018-01-25          0.588
 14  2018-01-26          0.868
 15  2018-01-29          0.577
 16  2018-01-30          0.546
 17  2018-01-31          0.679,
                                                model  average  std_dev   
 0  XGBRegressor(base_score=None, booster=None, ca...    0.631    0.105  \
 
    minimum  maximum  
 0    0.475    0.868  ]

And we can compare our results to the simple rules based strategy.

In [None]:
backtest(25, df_test)

[    trade_date  top_25_volume
 0   2018-01-05          0.768
 1   2018-01-08          0.556
 2   2018-01-09          0.624
 3   2018-01-10          0.467
 4   2018-01-11          0.678
 5   2018-01-12          0.504
 6   2018-01-16          0.591
 7   2018-01-17          0.516
 8   2018-01-18          0.419
 9   2018-01-19          0.610
 10  2018-01-22          0.675
 11  2018-01-23          0.562
 12  2018-01-24          0.550
 13  2018-01-25          0.563
 14  2018-01-26          0.722
 15  2018-01-29          0.592
 16  2018-01-30          0.525
 17  2018-01-31          0.753,
   model  average  std_dev  minimum  maximum
 0  None    0.593    0.094    0.419    0.768]

---

**Code Challenge:** Search the documentation and find a model hyper-parameter to tune.  Then see how the new model performs with that hyper-parameter set to the optimal value that you found.

---