# Overview

We need to develop a model capable of predicting the closing price movements for hundreds of Nasdaq listed stocks using data from the order book and the closing auction of the stock. Information from the auction can be used to adjust prices, assess supply and demand dynamics, and identify trading opportunities.

### Description

Stock exchanges are fast-paced, high-stakes environments where every second counts. The intensity escalates as the trading day approaches its end, peaking in the critical final ten minutes. These moments, often characterised by heightened volatility and rapid price fluctuations, play a pivotal role in shaping the global economic narrative for the day.

Each trading day on the Nasdaq Stock Exchange concludes with the Nasdaq Closing Cross auction. This process establishes the official closing prices for securities listed on the exchange. These closing prices serve as key indicators for investors, analysts and other market participants in evaluating the performance of individual securities and the market as a whole.

Within this complex financial landscape operates Optiver, a leading global electronic market maker. Fueled by technological innovation, Optiver trades a vast array of financial instruments, such as derivatives, cash equities, ETFs, bonds, and foreign currencies, offering competitive, two-sided prices for thousands of these instruments on major exchanges worldwide.

In the last ten minutes of the Nasdaq exchange trading session, market makers like Optiver merge traditional order book data with auction book data. This ability to consolidate information from both sources is critical for providing the best prices to all market participants.

### Objective

In this competition, you are challenged to develop a model capable of predicting the closing price movements for hundreds of Nasdaq listed stocks using data from the order book and the closing auction of the stock. Information from the auction can be used to adjust prices, assess supply and demand dynamics, and identify trading opportunities.

Your model can contribute to the consolidation of signals from the auction and order book, leading to improved market efficiency and accessibility, particularly during the intense final ten minutes of trading. You'll also get firsthand experience in handling real-world data science problems, similar to those faced by traders, quantitative researchers and engineers at Optiver.

### Files 

[train/test].csv The auction data. The test data will be delivered by the API.

* stock_id - A unique identifier for the stock. Not all stock IDs exist in every time bucket.
* date_id - A unique identifier for the date. Date IDs are sequential & consistent across all stocks.
* imbalance_size - The amount unmatched at the current reference price (in USD).
* imbalance_buy_sell_flag - An indicator reflecting the direction of auction imbalance.
  * buy-side imbalance; 1
  * sell-side imbalance; -1
  * no imbalance; 0
* reference_price - The price at which paired shares are maximized, the imbalance is minimized and the distance from the bid-ask midpoint is minimized, in that order. Can also be thought of as being equal to the near price bounded between the best bid and ask price.
* matched_size - The amount that can be matched at the current reference price (in USD).
* far_price - The crossing price that will maximize the number of shares matched based on auction interest only. This calculation excludes continuous market orders.
* near_price - The crossing price that will maximize the number of shares matched based auction and continuous market orders.
* [bid/ask]_price - Price of the most competitive buy/sell level in the non-auction book.
* [bid/ask]_size - The dollar notional amount on the most competitive buy/sell level in the non-auction book.
* wap - The weighted average price in the non-auction book.


    𝐵𝑖𝑑𝑃𝑟𝑖𝑐𝑒∗𝐴𝑠𝑘𝑆𝑖𝑧𝑒+𝐴𝑠𝑘𝑃𝑟𝑖𝑐𝑒∗𝐵𝑖𝑑𝑆𝑖𝑧𝑒/(𝐵𝑖𝑑𝑆𝑖𝑧𝑒+𝐴𝑠𝑘𝑆𝑖𝑧𝑒)
* seconds_in_bucket - The number of seconds elapsed since the beginning of the day's closing auction, always starting from 0.
* target - The 60 second future move in the wap of the stock, less the 60 second future move of the synthetic index. Only provided for the train set.
  * The synthetic index is a custom weighted index of Nasdaq-listed stocks constructed by Optiver for this competition.
  * The unit of the target is basis points, which is a common unit of measurement in financial markets. A 1 basis point price move is equivalent to a 0.01% price move.
  * Where t is the time at the current observation, we can define the target:
𝑇𝑎𝑟𝑔𝑒𝑡=(𝑆𝑡𝑜𝑐𝑘𝑊𝐴𝑃𝑡+60𝑆𝑡𝑜𝑐𝑘𝑊𝐴𝑃𝑡−𝐼𝑛𝑑𝑒𝑥𝑊𝐴𝑃𝑡+60𝐼𝑛𝑑𝑒𝑥𝑊𝐴𝑃𝑡)∗10000

In [4]:
!pip install featuretools==0.27.0



In [6]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

In [21]:
df = pd.read_csv('train.csv')

In [None]:
import sys
sys.path.append('/path/to/directory/')

In [24]:
import public_timeseries_testing_util as pt_util

SyntaxError: invalid syntax (public_timeseries_testing_util.py, line 25)

### Import models 

### Features

In [14]:
from sklearn.base import BaseEstimator, TransformerMixin
import datetime

class Delta_prices(BaseEstimator, TransformerMixin):
    
    """
    Differences and ratios between prices, also combining sizes
    """
    def __init__(self, auction = True): 
        self.auction = auction
    def fit(self, X, y=None):
        return self # nothing else to do
    
    def transform(self, X, y=None):
        X = X.eval("delta_bid_ask = (bid_price - ask_price) / (bid_price + ask_price)")
        X = X.eval("delta_wap_ref = (wap - reference_price) / (wap + reference_price)")
        X = X.eval("delta_bid_ask_size = (bid_size - ask_size) / (bid_size + ask_size + 1)")
        X = X.eval("ratio_bid_ask_matched_size = (bid_size - ask_size) / (matched_size + 1)")
        X = X.eval("imbalance_signed = imbalance_buy_sell_flag * imbalance_size")
        X = X.eval("delta_bid_ref = (bid_price - reference_price) / (bid_price + reference_price)")
        X = X.eval("delta_ask_ref = (ask_price - reference_price) / (ask_price + reference_price)")
        ##############--
        X = X.eval("delta_ask_wap = (ask_price - wap) / (ask_price + wap)")
        X = X.eval("delta_bid_wap = (bid_price - wap) / (bid_price + wap)")
        X = X.eval("imbalance_per_delta_bidask_price = (imbalance_signed) * (bid_price - ask_price)")
        X = X.eval("delta_imbalance_matched = (imbalance_signed - matched_size)/(matched_size + imbalance_signed)")
        X = X.eval("ratio_bid_ask_size = bid_size / ask_size")


        if self.auction:
            X = X.eval("delta_near_far = (near_price - far_price) / (near_price + far_price)")
            X = X.eval("delta_far_ref = (far_price - reference_price) / (far_price + reference_price)")
            X = X.eval("delta_near_wap = (near_price - wap) / (near_price + wap)")
            X = X.eval("delta_near_ref = (near_price - reference_price) / (near_price + reference_price)")
            X = X.eval("delta_near_far_on_matched = (near_price - far_price) / (matched_size + 1)* 10000") #26/10

        return X
    


class wawap_computer(BaseEstimator, TransformerMixin):
    """Computes the average wap in the set of stocks at the same time instant; then I subtract it to the wap of each stock"""
    """The compute the average ask size and bid size of the set of stocks at time t"""
        
    def __init__(self, wawap = True): 
        self.wawap = wawap

    def fit(self, X, y=None):
        return self # nothing else to do
    
    def transform(self, X, y=None):

        if self.wawap:

            def compute_w_a_wap(wap, ask_size, bid_size):
                
                return (wap * (bid_size + ask_size)).sum() / (bid_size + ask_size).sum()


            _ = X.groupby(['date_id', 'seconds_in_bucket'])\
                .apply(lambda x : compute_w_a_wap(x.wap, x.ask_size, x.bid_size))\
                .reset_index().rename(columns = {0 : 'w_a_wap'})

            X = X.merge(_, on = ['date_id', 'seconds_in_bucket'], validate = 'm:1')\
                .assign(wap_less_wawap = lambda df_ : (df_.wap - df_.w_a_wap) * 10000)
           
        
        return X


class grouped_aggs(BaseEstimator, TransformerMixin):
    """Computes the average wap in the set of stocks at the same time instant; then I subtract it to the wap of each stock"""
    """The compute the average ask size and bid size of the set of stocks at time t"""
        
    def __init__(self, cols = ['ask_size', 'bid_size'], funcs = ['mean', 'std']): 
        self.cols = cols
        self.funcs = funcs

    def fit(self, X, y=None):
        return self # nothing else to do
    
    def transform(self, X, y=None):

        for func in self.funcs:
            
            newcols = [col + f'_{func}' for col in self.cols]
            
            X[newcols] = X.groupby(['date_id', 'seconds_in_bucket'])[self.cols].transform(func)

        X = X.fillna(0)
        
        return X


class grouped_aggs_less(BaseEstimator, TransformerMixin):
    """Computes the average wap in the set of stocks at the same time instant; then I subtract it to the wap of each stock"""
    """The compute the average ask size and bid size of the set of stocks at time t"""
        
    def __init__(self, cols = ['ask_size', 'bid_size'], funcs = ['mean', 'std']): 
        self.cols = cols
        self.funcs = funcs

    def fit(self, X, y=None):
        return self # nothing else to do
    
    def transform(self, X, y=None):
        
        for col in self.cols:
            
            for func in self.funcs:

                newcol = f'{func}_{col}_less_{col}'

                X[newcol] = X[col] - X.groupby(['date_id', 'seconds_in_bucket'])[col].transform(func)

        return X


class Imbalancer_2(BaseEstimator, TransformerMixin):
    """implement the above idea"""

    def __init__(self, triplets_imb2 = [['ask_size', 'bid_size', 'imbalance_signed']], triplets_imb3 = [['wap', 'reference_price', 'far_price']]):
        self.triplets_imb2 = triplets_imb2
        self.triplets_imb3 = triplets_imb3


    def fit(self, X, y=None):
        return self # nothing else to do
    
    def transform(self, X, y=None):
        
        for triplet in self.triplets_imb2:
            X[f'imb2_{triplet[0]}_{triplet[1]}_{triplet[2]}'] = X[triplet].min(axis = 1) / X[triplet].max(axis = 1)
            
        for triplet in self.triplets_imb3:
            X[f'imb3_{triplet[0]}_{triplet[1]}_{triplet[2]}'] = (X[triplet].max(axis = 1) -  X[triplet].median(axis = 1)) /\
                                                                (X[triplet].median(axis = 1) - X[triplet].min(axis = 1))

        
        return X
            




class Difference_Computer(BaseEstimator, TransformerMixin):
    """implement the above idea"""

    def __init__(self, step = 1, cols_lag = ['wap', 'reference_price'], cols_diff = ['wap', 'reference_price'], old_features = True, deltafeats = False, deltaold = True):
        self.step = step
        self.cols_lag = cols_lag
        self.cols_diff = cols_diff
        self.old_features = old_features
        self.deltafeats = deltafeats
        self.deltaold = deltaold

    def fit(self, X, y=None):
        return self # nothing else to do
    
    def transform(self, X, y=None):
        
        if self.deltafeats:

            for step in range(1, self.step + 1):
                new_cols = [feat + f'_delta_{step}' for feat in self.cols_diff]
                
                X[new_cols] = X.groupby([ 'stock_id'])[self.cols_diff].pct_change(periods = step)*100
  
        if self.old_features:

            for lag in range(1, self.step + 1):
                new_cols = [feat + f'_{lag}' for feat in self.cols_lag]
                X[new_cols] = X.groupby([ 'stock_id'])[self.cols_lag].shift(lag)
        
        return X
            

class Rolling_mean(BaseEstimator, TransformerMixin):
    def __init__(self,  cols = ['wap'], cols_range = ['wap'], macd_style = True, rolling_range = True):
        self.cols = cols
        self.cols_range = cols_range
        self.macd = macd_style
        self.rolling_range = rolling_range
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        
        if self.macd:
            
            newcols = [f'{col}_rolling_macd' for col in self.cols]

            X[newcols] = X.groupby([ 'stock_id'], as_index = False)[self.cols].rolling(8).mean()[self.cols] \
                - X.groupby(['stock_id'], as_index = False)[self.cols].rolling(16).mean()[self.cols] #12 e 22
            
            
        if self.rolling_range:

            newcols_min = [f'{col}_rolling_min' for col in self.cols_range]
            
            X[newcols_min] = X.groupby([ 'stock_id'], as_index = False)[self.cols_range].rolling(10).min()[self.cols_range]
            
            newcols_max = [f'{col}_rolling_max' for col in self.cols_range]
            
            X[newcols_max] = X.groupby([ 'stock_id'], as_index = False)[self.cols_range].rolling(10).max()[self.cols_range]
            
            for col in self.cols_range:
                X[f'{col}_rolling_range'] = X[f'{col}_rolling_max'] - X[f'{col}_rolling_min']
                del X[f'{col}_rolling_max'], X[f'{col}_rolling_min']
                
        return X 
    
class Expanding_feats(BaseEstimator, TransformerMixin):
    def __init__(self,  cols = ['wap'], funcs = ['mean', 'median', 'min'], subtract_to_feat = True):
        
        self.cols = cols
        self.funcs = funcs
        self.subtract = subtract_to_feat
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        
        newcols = [f'{col}_expanding_{func}' for col in self.cols for func in self.funcs]
        
        X[newcols] = X.groupby([ 'date_id','stock_id'])[self.cols].expanding().aggregate(self.funcs).droplevel([0,1])
        
        
        if self.subtract:
            
            for col in self.cols:
                for func in self.funcs:
                    X[f'{col}_expanding_{func}_delta'] = X[col] - X[f'{col}_expanding_{func}']
            
        
        X = X.fillna(0)

        return X
    
    
class Ewm_feats(BaseEstimator, TransformerMixin):
    def __init__(self,  cols_mean = ['wap'], cols_std = ['reference_price'], span = 2):
        
        self.cols_mean = cols_mean
        self.cols_std = cols_std
        self.span = span

    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):

        newcols = [f'{col}_ewm_mean' for col in self.cols_mean]
            
        X[newcols] = X.groupby([ 'stock_id'], as_index = False)[self.cols_mean].transform(lambda x : x.ewm(span = self.span).mean())

        newcols = [f'{col}_ewm_std' for col in self.cols_std]
            
        X[newcols] = X.groupby([ 'stock_id'], as_index = False)[self.cols_std].transform(lambda x : x.ewm(span = self.span).std())

        return X
    
class MACD_computer_v2(BaseEstimator, TransformerMixin):
    def __init__(self,  cols = ['wap']):
        self.cols = cols
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        newcols_MACD = [f'{col}_MACD' for col in self.cols]
        newcols_sigline = [f'{col}_sigline' for col in self.cols]

        X[newcols_MACD] = X.groupby([ 'stock_id'])[self.cols].transform(lambda x : x.ewm(span = 2).mean()) \
            - X.groupby(['stock_id'])[self.cols].transform(lambda x : x.ewm(span = 8).mean()) #12 e 22


        return X

stock_id_map = {0: 2, 1: 10, 2: 10, 3: 2, 4: 5, 5: 12, 6: 8, 7: 4,
 8: 5, 9: 6, 10: 3, 11: 8, 12: 1, 13: 4, 14: 11, 15: 5,
 16: 8, 17: 4, 18: 12, 19: 1, 20: 8, 21: 3, 22: 4, 23: 4,
 24: 3, 25: 3, 26: 6, 27: 4, 28: 6, 29: 2, 30: 2, 31: 10,
 32: 8, 33: 12, 34: 1, 35: 8, 36: 2, 37: 2, 38: 3, 39: 2,
 40: 5, 41: 2, 42: 8, 43: 3, 44: 3, 45: 10, 46: 10, 47: 5,
 48: 12, 49: 6, 50: 10, 51: 12, 52: 6, 53: 8, 54: 11, 55: 2,
 56: 12, 57: 10, 58: 11, 59: 10, 60: 12, 61: 10, 62: 10, 63: 6,
 64: 3, 65: 2, 66: 3, 67: 10, 68: 2, 69: 10, 70: 10, 71: 3,
 72: 4, 73: 3, 74: 10, 75: 11, 76: 3, 77: 1, 78: 10, 79: 3,
 80: 10, 81: 1, 82: 10, 83: 4, 84: 9, 85: 8, 86: 10, 87: 4,
 88: 10, 89: 4, 90: 8, 91: 12, 92: 8, 93: 6, 94: 12, 95: 2,
 96: 4, 97: 1, 98: 5, 99: 3, 100: 10, 101: 8, 102: 12, 103: 12,
 104: 2, 105: 2, 106: 8, 107: 10, 108: 7, 109: 2, 110: 3, 111: 7,
 112: 10, 113: 12, 114: 8, 115: 10, 116: 8, 117: 3, 118: 12, 119: 10,
 120: 3, 121: 2, 122: 5, 123: 1, 124: 8, 125: 1, 126: 4, 127: 12,
 128: 8, 129: 8, 130: 5, 131: 1, 132: 8, 133: 10, 134: 3, 135: 12,
 136: 8, 137: 3, 138: 10, 139: 5, 140: 4, 141: 4, 142: 4, 143: 10,
 144: 2, 145: 8, 146: 8, 147: 6, 148: 2, 149: 2, 150: 12, 151: 8,
 152: 3, 153: 8, 154: 12, 155: 6, 156: 8, 157: 11, 158: 1, 159: 8,
 160: 2, 161: 10, 162: 7, 163: 2, 164: 2, 165: 5, 166: 2, 167: 3,
 168: 8, 169: 11, 170: 3, 171: 3, 172: 10, 173: 8, 174: 8, 175: 9,
 176: 12, 177: 10, 178: 12, 179: 10, 180: 10, 181: 3, 182: 5, 183: 4,
 184: 12, 185: 3, 186: 2, 187: 3, 188: 7, 189: 2, 190: 10, 191: 8,
 192: 10, 193: 10, 194: 5, 195: 2, 196: 3, 197: 12, 198: 3, 199: 6}

class Sectors(BaseEstimator, TransformerMixin):
    def __init__(self,  map = stock_id_map, cols = ['imbalance_signed', 'matched_size', 'wap_less_wawap']):
        self.mappa = stock_id_map
        self.cols = cols
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        X['sector'] = X['stock_id'].map(self.mappa)

        newcols = [f'{col}_sector' for col in self.cols]
        X[newcols] = X.groupby(['date_id', 'sector','seconds_in_bucket'])[self.cols].transform('mean')
        
        return X
    
class grouped_aggs_std(BaseEstimator, TransformerMixin):
    
        
    def __init__(self, cols = ['ask_size', 'bid_size']): 
        self.cols = cols

    def fit(self, X, y=None):
        return self # nothing else to do
    
    def transform(self, X, y=None):
        
        for col in self.cols:
            
            newcol = f'{col}_std_{col}'
            X[newcol] = X[col] / X.groupby(['date_id', 'seconds_in_bucket'])[col].transform('std')

        
        return X
    
class rankino(BaseEstimator, TransformerMixin):
        
    def __init__(self, cols = ['ask_size', 'bid_size']): 
        self.cols = cols

    def fit(self, X, y=None):
        return self # nothing else to do
    
    def transform(self, X, y=None):
            
        newcols = [col + f'_rank' for col in self.cols]
        X[newcols] = X.groupby([ 'stock_id'], as_index = False)[self.cols].rolling(5).rank(method = 'min')[self.cols].fillna(0).astype('int32')

        
        return X
    

class Fillna(BaseEstimator, TransformerMixin):
    def __init__(self):
        
        return None
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):

        X.replace([np.inf, -np.inf], [99999, -99999], inplace = True)
        X = X.fillna(0)

        return X


### Pipeline Building 

In [15]:
from sklearn.pipeline import Pipeline

cols_for_diff = ['ask_price', 'bid_price',  'matched_size', 'imbalance_signed', 'reference_price',
                 'delta_bid_ask', 'tgt_1']

cols_for_lag = ['ask_price','matched_size', 'imbalance_signed', 'reference_price',
                'near_price', 'far_price',   'delta_bid_ask', 
                'wap_less_wawap']

triplets_imb2 = [['reference_price', 'far_price', 'near_price'], ['ask_size', 'bid_size', 'matched_size'], ['ask_size', 'bid_size', 'imbalance_signed']]

triplets_imb3 = [ ['wap', 'ask_price', 'bid_price'], ['ask_size', 'bid_size', 'matched_size'], ['ask_size', 'bid_size', 'imbalance_signed']]

cols_for_MACD = [ 'matched_size', 'imbalance_signed', 'reference_price',
                'near_price', 'delta_bid_ask', 
                 'delta_wap_ref']

cols_for_grouped_aggs = ['ask_size', 'bid_size']

cols_for_grouped_aggs_less = [ 'imbalance_signed', 'reference_price', 
                              'matched_size', 'ratio_bid_ask_size','delta_wap_ref', 'delta_ask_wap', 'delta_bid_ask_size']

cols_for_rolling = ['ask_price', 'bid_price', 'ask_size', 'bid_size', 'matched_size', 'imbalance_signed', 'reference_price', 
                    'near_price', 'far_price', 'wap', 'ratio_bid_ask_size', 'delta_bid_ask']

cols_for_rolling_range = ['wap', 'bid_price', 'ask_price', 'delta_bid_ask']

cols_for_expanding = ['bid_price', 'ask_size',  'matched_size', 'imbalance_signed', 'reference_price', 
                      'wap', 'ratio_bid_ask_size', 'delta_bid_ask', 'delta_bid_ask_size'] #'far_price','bid_size','ask_price', 

cols_for_rank = ['bid_price', 'ask_size',  'matched_size', 'imbalance_signed', 'reference_price', 
                      'wap', 'ratio_bid_ask_size', 'delta_bid_ask', 'delta_bid_ask_size', 'wap_less_wawap']

cols_grouped_std = [
 'matched_size', 'far_price',  'bid_size',
  'wap', 'delta_bid_ask',
 'delta_wap_ref', 'delta_bid_ask_size', 'ratio_bid_ask_matched_size', 'imbalance_signed',  
  'delta_ask_ref', 'delta_ask_wap', 'delta_bid_wap',
 'ratio_bid_ask_size']

cols_ewm_mean = ['ask_size', 'matched_size', 'ratio_bid_ask_size', 'delta_bid_ask', 'delta_bid_ask_size']

cols_ewm_std = ['bid_price', 'reference_price']

imbalancer = Delta_prices()

wawapper = wawap_computer()

diff_computer = Difference_Computer(step = 3, cols_lag = cols_for_lag, cols_diff = cols_for_diff, old_features=True,  deltafeats=True, deltaold = False)

imbalancer2 = Imbalancer_2(triplets_imb2 = triplets_imb2, triplets_imb3 = triplets_imb3)

rolling_computer = Rolling_mean(cols = cols_for_rolling, cols_range = cols_for_rolling_range, macd_style = False, rolling_range = True)

macd_computer = MACD_computer_v2(cols =cols_for_MACD )

grouped_aggs = grouped_aggs(cols = cols_for_grouped_aggs, funcs = ['mean'])

grouped_aggs_less = grouped_aggs_less(cols = cols_for_grouped_aggs_less, funcs = ['mean'])

expanding = Expanding_feats(cols = cols_for_expanding, subtract_to_feat = True, funcs = ['median'])

rankin = rankino(cols = cols_for_rank)

grouped_std = grouped_aggs_std(cols = cols_grouped_std)

sector = Sectors(map = stock_id_map, cols = ['ask_size', 'bid_size', 'matched_size', 'imbalance_signed', 'reference_price'])

ewm = Ewm_feats(cols_mean = cols_ewm_mean, cols_std = cols_ewm_std, span = 2)


pipeline = Pipeline([
    ('imbalancer', imbalancer),
    ('wawapper', wawapper),
    ('diff_comp', diff_computer),
    ('imbalancer2', imbalancer2),
    #('log_computer', log_return),
    ('macd_computer', macd_computer),
    ('sector', sector),
    ('ewm', ewm),
    ('expanding', expanding),
    ('grouped_aggs', grouped_aggs),
    ('grouped_aggs_less', grouped_aggs_less),
    ('rolling', rolling_computer),
    #('rank', rankin),
    ('grouped_std', grouped_std),
    ('fillna', Fillna())
    

])

In [16]:
cols_to_drop = ['date_id', 'row_id', 'imbalance_buy_sell_flag', 'imbalance_size', 'w_a_wap', 'time_id', 'row_id', 'target','mean_delta_wap_ref_less_delta_wap_ref', 'mean_delta_ask_wap_less_delta_ask_wap','mean_delta_bid_ask_size_less_delta_bid_ask_size'] 

In [17]:
df = df.astype({'stock_id' : np.int16, 'date_id' : np.int16, 'seconds_in_bucket':np.int16, 'imbalance_size':'float32',
       'imbalance_buy_sell_flag':np.int16, 'reference_price':'float32', 'matched_size':'float32',
       'far_price':'float32', 'near_price':'float32', 'bid_price':'float32', 'bid_size':'float32', 'ask_price':'float32',
       'ask_size':'float32', 'wap':'float32', 'target':'float32', 'time_id':np.int16})

### Submission 

In [18]:
import optiver2023
env = optiver2023.make_env()
iter_test = env.iter_test()

ModuleNotFoundError: No module named 'optiver2023'

In [19]:
df = df.query("date_id < 478 & date_id >= 100")

In [20]:
counter = 0
last_date_id = 477
current_date_id = 478
last_retraining = 0

for (test, revealed_targets, sample_prediction) in iter_test:

    
    test = test.reset_index(drop = True)
    test = test.astype({'stock_id' : np.int16, 'date_id' : np.int16, 'seconds_in_bucket':np.int16, 'imbalance_size':'float32',
       'imbalance_buy_sell_flag':np.int16, 'reference_price':'float32', 'matched_size':'float32',
       'far_price':'float32', 'near_price':'float32', 'bid_price':'float32', 'bid_size':'float32', 'ask_price':'float32',
       'ask_size':'float32', 'wap':'float32'})
    currently_scored = test.pop('currently_scored')

    df = pd.concat([df, test], axis = 0, ignore_index = True)

    current_second = test.loc[0, 'seconds_in_bucket']

    if len(revealed_targets) > 10:
        revealed_targets = revealed_targets.reset_index(drop = True)
        current_date_id = revealed_targets.loc[0, 'date_id']
        revealed_targets = revealed_targets.drop('date_id', axis = 1)
        if counter == 0:
            
            tgts = revealed_targets.astype({'stock_id' : np.int16, 'revealed_date_id': np.int16, 'revealed_time_id': np.int16, 
                                            'seconds_in_bucket': np.int16, 'revealed_target' : 'float32'})
        else:
            
            tgts = pd.concat([tgts, revealed_targets]).astype({'stock_id' : np.int16, 'revealed_date_id': np.int16, 'revealed_time_id': np.int16, 
                                            'seconds_in_bucket': np.int16, 'revealed_target' : 'float32'})
            
        df = df.merge(tgts, how = 'left', left_on = ['stock_id', 'date_id',  'seconds_in_bucket'], 
                 right_on = ['stock_id', 'revealed_date_id', 'seconds_in_bucket'])
        
        df.target = df.target.fillna(df['revealed_target'])
        
        df = df.drop(['revealed_target', 'revealed_date_id', 'revealed_time_id', ], axis = 1)

        
    if (currently_scored.loc[0] == True) & ((current_date_id - last_retraining) > 22):
        
        last_retraining = current_date_id.copy()
        df = df[original_columns]
        df_tgt_agg = df.groupby(['date_id', 'stock_id'])['target'].agg(['mean', 'std', 'median', 'skew'])\
        .reset_index().rename(columns = {'mean':'tgt_1_mean', 'std':'tgt_1_std', 'median':'tgt_1_median', 'skew':'tgt_1_skew'})
        df_tgt_agg['date_id'] += 1
        df = df.merge(df_tgt_agg, how = 'left', on = ['date_id', 'stock_id'])
        df['tgt_1'] = df.groupby([ 'stock_id', 'seconds_in_bucket'])['target'].shift(1)

        
        transf = pipeline.transform(df)
        transf = transf.dropna(subset = 'target')

        pred_columns = [col for col in transf.columns if col not in cols_to_drop]
         
        transf = transf[pred_columns + ['target'] + ['date_id']]
        
        #lgbm retrain
        import lightgbm as lgb
        train_set = lgb.Dataset(data = transf.query(f'date_id >= 130')[pred_columns], label = transf.query(f'date_id >= 130')['target'])
        val_set = lgb.Dataset(data = transf.query(f'date_id < 130')[pred_columns], label = transf.query(f'date_id < 130')['target'])
        del transf

        
        lgb_params = {"objective" : "mae", "n_estimators" : 9999, "random_state":1021, 
                      "num_leaves" : 378, "subsample" : 0.4, "colsample_bytree" : 0.4, "learning_rate" : 0.008
                      }
        
        stopping = lgb.early_stopping(100, first_metric_only=False, verbose=True)
        lgbmregr = lgb.train(params = lgb_params, train_set = train_set, valid_sets = [val_set], callbacks = [stopping])
        del train_set, val_set
        df = df[original_columns]
        

        
    if currently_scored.loc[0] == True:
        df1 = df.query(f"date_id >= {current_date_id - 1}")
        df_tgt_agg = df1.groupby(['date_id', 'stock_id'])['target'].agg(['mean', 'std', 'median', 'skew'])\
        .reset_index().rename(columns = {'mean':'tgt_1_mean', 'std':'tgt_1_std', 'median':'tgt_1_median', 'skew':'tgt_1_skew'})
        df_tgt_agg['date_id'] += 1
        df1 = df1.merge(df_tgt_agg, how = 'left', on = ['date_id', 'stock_id'])
        df1['tgt_1'] = df1.groupby([ 'stock_id', 'seconds_in_bucket'])['target'].shift(1)
        df1 = df1.query(f"(date_id >= {current_date_id - 1} & seconds_in_bucket >= 350) | date_id == {current_date_id}")
        
        df1 = pipeline.transform(df1)
        X = df1.query(f"date_id == {current_date_id} & seconds_in_bucket == {current_second}")

        pred_columns = [col for col in X.columns if col not in cols_to_drop]

        X = X[pred_columns]
        

        preds = lgbmregr.predict(X)
        sample_prediction['target'] =  preds - preds.mean()

    else:
        sample_prediction['target'] = 0

    env.predict(sample_prediction)
    
    
    counter += 1 

NameError: name 'iter_test' is not defined