<a href="https://colab.research.google.com/github/nschantz21/studious-succotash/blob/develop/notebooks/1_1_ns_predictive_feature_exploration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Predictive Feature Exploration
last updated: 2021-12-20

After an initial exploratory data analysis, I will explore features that we would want to predict. Again, I will use a sub sample to prevent over fitting on the modeling step.  

I will then make a Lasso regression based on these theoretical models to see if they are worth pursuing further.  

Based on the available data, models for the following may be possible:
* Price Movement
* Liquidity
* Supply/Demand Pressure
* Price Volatility

**Executive Summary**  
The causal model for Liquidity for the best bids and asks (i.e. bq0 - aq0) yielded mildly significant results. I will move forward with this model for the sake of this exercise.

In [None]:
# imports
import pandas as pd
import random
import matplotlib.pyplot as plt

In [None]:
# constants
# you could change these and re-run the entire 
input_data_fp = "../data/interim/20190612.csv"
p = 0.1  # percent of lines in file to sample
random.seed(42)

# import data
# keep the header, then take only 1% of lines
# if random from [0,1] interval is greater than 0.01 the row will be skipped
df = pd.read_csv(
         input_data_fp,
         header=0, 
         skiprows=lambda i: i>0 and random.random() > p)

In [None]:
# to avoid inter-timestep comparisons I will drop duplicate timestamps
df.drop_duplicates(subset="timestamp", inplace=True)

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 26751 entries, 0 to 27350
Data columns (total 25 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   timestamp  26751 non-null  int64  
 1   price      26751 non-null  int64  
 2   side       26751 non-null  object 
 3   bp0        26751 non-null  int64  
 4   bp1        26750 non-null  float64
 5   bp2        26749 non-null  float64
 6   bp3        26748 non-null  float64
 7   bp4        26748 non-null  float64
 8   bq0        26751 non-null  int64  
 9   bq1        26751 non-null  int64  
 10  bq2        26751 non-null  int64  
 11  bq3        26751 non-null  int64  
 12  bq4        26751 non-null  int64  
 13  ap0        26751 non-null  int64  
 14  ap1        26750 non-null  float64
 15  ap2        26750 non-null  float64
 16  ap3        26749 non-null  float64
 17  ap4        26748 non-null  float64
 18  aq0        26751 non-null  int64  
 19  aq1        26751 non-null  int64  
 20  aq2   

In [None]:
# calculate the bid-ask spreads and balance of quantities at each level
for x in range(5):
    df["spread{}".format(x)] = df["ap{}".format(x)] - df["bp{}".format(x)]
    # spread balance: positive is excess demand, negative is excess supply
    df["spread_balance_{}".format(x)] = df["bq{}".format(x)] - df["aq{}".format(x)]

# Price Movement
We may want to predict price movements.  
To normalize this feature, we will want to look at the change in price over a change in time.  

The theoretical causal model in this case would be forward price movement as a function of supply/demand pressure, price momentum, and market liquidity. The supply/demand pressure is measured here as the quantity of bids less the quantity of asks (i.e. a positive number is higher demand, a negative number is higher supply), momentum is price change over a defined period of time, and liquidity is the spread at each price level. 

In [None]:
# Rolling Average Price Change per Microsecond
steps_forward = 5000
fwd_price_change = (df["price"].diff(steps_forward) / df["timestamp"].diff(steps_forward)).shift(-steps_forward).rename("fwd_price")

In [None]:
supply_demand_pressure = df.filter(regex="spread_balance_*").diff(steps_forward)

In [None]:
momentum = df["price"].diff(steps_forward).rename("momentum")

In [None]:
spread = df.filter(regex="spread[0-5]").diff(steps_forward)

In [None]:
# ask vwap less bid vwap
vwap_spread = (
    df[["bid_vwap", "ask_vwap"]]
    .diff(axis=1)
    .dropna(axis=1)
    .iloc[:, 0]
    .rename("vwap_spread")
).rolling(steps_forward).mean()

In [None]:
# concat the features
price_model_features = pd.concat(
    [
     fwd_price_change,
     supply_demand_pressure,
     momentum,
     spread,
     vwap_spread
    ],
    axis=1
).dropna()

In [None]:
price_model_correlations = price_model_features.corr()
price_model_correlations

Unnamed: 0,fwd_price,spread_balance_0,spread_balance_1,spread_balance_2,spread_balance_3,spread_balance_4,momentum,spread0,spread1,spread2,spread3,spread4,vwap_spread
fwd_price,1.0,0.022761,0.014014,-0.004272,0.030558,0.013282,-0.353078,0.012061,0.011599,0.011599,0.011599,0.011599,0.023994
spread_balance_0,0.022761,1.0,0.174422,0.065772,0.085057,-0.088058,-0.003318,-0.000466,-0.000317,-0.000317,-0.000317,-0.000317,0.001808
spread_balance_1,0.014014,0.174422,1.0,0.216832,-0.000796,-0.028053,-0.018626,0.00722,0.007679,0.007679,0.007679,0.007679,-0.041018
spread_balance_2,-0.004272,0.065772,0.216832,1.0,0.024383,0.135294,-0.029555,-0.003669,-0.001435,-0.001435,-0.001435,-0.001435,-0.05679
spread_balance_3,0.030558,0.085057,-0.000796,0.024383,1.0,-0.075604,-0.071895,-0.021453,-0.019218,-0.019218,-0.019218,-0.019218,-0.01688
spread_balance_4,0.013282,-0.088058,-0.028053,0.135294,-0.075604,1.0,-0.080799,0.022921,0.022444,0.022444,0.022444,0.022444,-0.012128
momentum,-0.353078,-0.003318,-0.018626,-0.029555,-0.071895,-0.080799,1.0,-0.008956,-0.010039,-0.010039,-0.010039,-0.010039,-0.070293
spread0,0.012061,-0.000466,0.00722,-0.003669,-0.021453,0.022921,-0.008956,1.0,0.994436,0.994436,0.994436,0.994436,-0.020215
spread1,0.011599,-0.000317,0.007679,-0.001435,-0.019218,0.022444,-0.010039,0.994436,1.0,1.0,1.0,1.0,-0.023434
spread2,0.011599,-0.000317,0.007679,-0.001435,-0.019218,0.022444,-0.010039,0.994436,1.0,1.0,1.0,1.0,-0.023434


Unfortunately I do not see much of a pattern in the potential model features, aside from momentum

## Modeling

In [None]:
from sklearn.linear_model import Lasso

In [None]:
reg = Lasso(alpha=1.0, max_iter=10000)
y = price_model_features.iloc[:, 0].values
X = price_model_features.iloc[:, 1:].values
reg.fit(X, y)

Lasso(max_iter=10000)

In [None]:
reg.coef_

array([ 0.,  0., -0.,  0.,  0., -0.,  0.,  0.,  0.,  0.,  0.,  0.])

In [None]:
reg.score(X, y)

0.0

In [None]:
# it just predicts a constant
reg.predict(X)

array([-8.61627915e-09, -8.61627915e-09, -8.61627915e-09, ...,
       -8.61627915e-09, -8.61627915e-09, -8.61627915e-09])

# Liquidity (Volume)

Liquidity in this case is the quanity of bids at the best (lowest) bid price less the quantity of asks at the best (highest) ask price.  

The causal model is forward change in liquidity (bq0 - aq0) is a function of past changes in liquidity across all levels, price volatility, and measures of price momentum.


In [None]:
liquidity_steps_forward = 5000
balances = df.filter(regex="spread_balance_0").diff(liquidity_steps_forward).shift(-liquidity_steps_forward).iloc[:,0].rename("fwd_liq")

In [None]:
other_balances = df.filter(regex="spread_balance_[0-4]").diff(liquidity_steps_forward)

# I did split the momentum up at one point, but it didn't help much
#bid_momentum = df.where(df["side"]=="b")["price"].diff(liquidity_steps_forward).rename("bid_momentum").fillna(0.0)
#ask_momentum = df.where(df["side"]=="a")["price"].diff(liquidity_steps_forward).rename("ask_momentum").fillna(0.0)

# volatility
price_volatility = df["price"].rolling(liquidity_steps_forward).std().rename("price_volatility")

In [None]:
#liquidity_features = balances.join(other_balances).join(bid_momentum).join(ask_momentum).join(price_volatility).dropna()
liquidity_features = pd.concat(
    [
        balances,
        other_balances,
        momentum,
        price_volatility,
        df["bid_vwap"].diff(liquidity_steps_forward), df["ask_vwap"].diff(liquidity_steps_forward)
     ],
     axis=1).dropna()
liquidity_features.corr()

Unnamed: 0,fwd_liq,spread_balance_0,spread_balance_1,spread_balance_2,spread_balance_3,spread_balance_4,momentum,price_volatility,bid_vwap,ask_vwap
fwd_liq,1.0,-0.555715,-0.069717,-0.096067,-0.089546,-0.033427,0.02264,-0.01519,0.054511,0.06255
spread_balance_0,-0.555715,1.0,0.140163,0.119778,0.092532,-0.01404,-0.030589,-0.01167,-0.02012,-0.023207
spread_balance_1,-0.069717,0.140163,1.0,0.274589,0.049585,0.111267,0.016474,0.024141,0.006573,-0.025773
spread_balance_2,-0.096067,0.119778,0.274589,1.0,0.133496,0.310591,-0.033338,0.029259,-0.045877,-0.06378
spread_balance_3,-0.089546,0.092532,0.049585,0.133496,1.0,0.163764,-0.079905,0.068325,-0.177581,-0.180629
spread_balance_4,-0.033427,-0.01404,0.111267,0.310591,0.163764,1.0,-0.122788,0.162085,-0.286988,-0.277309
momentum,0.02264,-0.030589,0.016474,-0.033338,-0.079905,-0.122788,1.0,-0.017172,0.16119,0.131859
price_volatility,-0.01519,-0.01167,0.024141,0.029259,0.068325,0.162085,-0.017172,1.0,-0.813067,-0.81957
bid_vwap,0.054511,-0.02012,0.006573,-0.045877,-0.177581,-0.286988,0.16119,-0.813067,1.0,0.969166
ask_vwap,0.06255,-0.023207,-0.025773,-0.06378,-0.180629,-0.277309,0.131859,-0.81957,0.969166,1.0


In [None]:
liquidity_features.describe()

Unnamed: 0,fwd_liq,spread_balance_0,spread_balance_1,spread_balance_2,spread_balance_3,spread_balance_4,momentum,price_volatility,bid_vwap,ask_vwap
count,16751.0,16751.0,16751.0,16751.0,16751.0,16751.0,16751.0,16751.0,16751.0,16751.0
mean,0.698167,0.921258,2.363023,3.541759,3.846099,1.136469,-2.392991,37.977189,-8.843506,-9.822971
std,96.911873,94.132133,72.00497,81.877473,98.830915,102.039654,51.105569,5.445878,26.359053,26.63722
min,-334.0,-359.0,-283.0,-315.0,-333.0,-301.0,-310.0,31.946416,-167.007792,-72.222988
25%,-66.0,-63.0,-43.0,-48.0,-64.0,-72.0,-25.0,33.086668,-36.337515,-38.511823
50%,0.0,1.0,3.0,4.0,5.0,1.0,-5.0,37.655575,-3.57113,-2.611945
75%,67.0,65.0,50.0,56.0,72.0,69.0,25.0,39.79492,11.68737,10.603571
max,337.0,337.0,256.0,276.0,301.0,376.0,275.0,50.874445,38.461849,34.729762


In [None]:
reg1 = Lasso(alpha=1.0, max_iter=10000)
y1 = liquidity_features.iloc[:, 0].values
X1 = liquidity_features.iloc[:, 1:].values
reg1.fit(X1, y1)

Lasso(max_iter=10000)

In [None]:
reg1.coef_

array([-0.56821984,  0.02677982, -0.02525054, -0.02310299, -0.02030745,
       -0.01153164,  0.67561676, -0.23307728,  0.47907567])

In [None]:
# not a good score, but an improvement
reg1.score(X1, y1)

0.31448416026347825

In [None]:
df.head()

Unnamed: 0,timestamp,price,side,bp0,bp1,bp2,bp3,bp4,bq0,bq1,bq2,bq3,bq4,ap0,ap1,ap2,ap3,ap4,aq0,aq1,aq2,aq3,aq4,bid_vwap,ask_vwap,spread0,spread_balance_0,spread1,spread_balance_1,spread2,spread_balance_2,spread3,spread_balance_3,spread4,spread_balance_4
0,0,10095,b,10095,,,,,1,0,0,0,0,10100,,,,,51,0,0,0,0,10095.0,10100.0,5,-50,,0,,0,,0,,0
4,12,10090,b,10095,10090.0,,,,4,17,0,0,0,10100,10105.0,10110.0,,,121,8,32,0,0,10090.952381,10102.236025,5,-117,15.0,9,,-32,,0,,0
7,30,10115,a,10095,10090.0,10085.0,,,4,18,17,0,0,10100,10105.0,10110.0,10115.0,,121,8,35,110,0,10088.333333,10107.445255,5,-117,15.0,10,25.0,-18,,-110,,0
9,58,10070,b,10095,10090.0,10085.0,10080.0,10075.0,4,18,17,57,6,10100,10105.0,10110.0,10115.0,10120.0,121,8,35,110,10,10078.651316,10110.092025,5,-117,15.0,10,25.0,-18,35.0,-53,45.0,-4
10,114,10130,a,10095,10090.0,10085.0,10080.0,10075.0,4,18,17,57,6,10100,10105.0,10110.0,10115.0,10120.0,121,8,35,110,10,10077.34104,10112.459459,5,-117,15.0,10,25.0,-18,35.0,-53,45.0,-4
