## Preprocessing 

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
'''
Welcome to Optiver 2023 MoC Challenge! We designed this notebook for you to explore the data given by the challenge, research and design your own factors based on
the training data given and the helper functions we built for you!
Feel free to play around with the data and get familiar with it!
'''

# We will import the necessary libraries here. You can import any librariese you need for your beautiful factors!
from factor_design_ver2_beta import utils
from factor_design_ver2_beta import factor_design
from factor_design_ver2_beta import factor_backtest
# from joblib import Parallel, delayed # waited to be optimized
import numpy as np
import pandas as pd
from tqdm import tqdm

In [8]:
# DON'T RE-RUN THIS CELL TOO OFTEN! The loading time for training data is a bit long.
# Please make sure you have been provided with the training data with such name in the same folder as this notebook.
# You can play around with this file for your own research and factor design.
df_train = pd.read_csv('./research_train_set.csv')

# We built a index map to speed up the factor design, you don't need to worry about this.
col2index_map = utils.load_json('./factor_design_ver2_beta/col2index_map.json')

## Factor Design

In [3]:
# We designed a function to transform the df_train you loaded into a dictionary to speed up the research process.
# You don't need to worry about this. This will take up to half minutes to run. DON'T RE-RUN THIS CELL TOO OFTEN!
df_train_dic_sorted = utils.load_json('./factor_design_ver2_beta/df_train_dic_sorted.json')

In [5]:
'''
In this part you will design your own factors! We have provided you with some helper functions to speed up the process.
We look forward to receving your beautiful factors!
'''
# Before you start, we provide you with a sample factor design here.
# Please STRICTLY FOLLOW the instructions below to design your own factors!
'''
1. You can design your own factors based on the training data given. However, you CANNOT use the target variable (i.e. 'target') in your factor design!!
2. You can use any public libraries you want to design your factors.
3. The factor design should be the following format:
def your_factor_name(current_data: dict, hist_list: list) -> np.ndarray:
    The overall training data given by Kaggle competition is a dataset that contains relevant stocks' features of last 9 minutes (540 seconds) of multiple
    trading days. The data is in the form that, in the order of each trading day, we will have all available stocks' features for 55 time buckets, 10 seconds long
    for each bucket. For example, the first 10 seconds of the first trading day, we will have all available stocks' features, then the second 10 seconds, and so on.
    
    We are predicting the target variable (i.e. 'target') for each 10 seconds bucket on each day (time snapshot).
    
    When we submit our prediction, the test data will come in with same features of all available stocks at a time snapshot, and we predict only based on all the
    time snapshots given we have. The current_data in the sample function will be the data in the current time snapshot, for example, the data from 0s to 10s of the
    first trading day for all stocks. The hist_list is set to be empty list as default, and THIS IS DESIGNED for you in case you want to STORE any previous data for
    your factor design. For example, you can create a time-series factor that uses the data from previous time snapshots, and you can store the previous data in the
    hist_list.


    The return will be the factor value you calculated. The format HAS TO BE np.ndarray, and the shape of the return HAS TO BE (n_stocks, )
'''
def origin_ask_size(current_data: dict, hist_list = []) -> np.ndarray:
    current_data = np.array(list(current_data.values()), dtype=float).T
    current_data = np.nan_to_num(current_data)
    return current_data[:, col2index_map['ask_size']]

In [4]:
def mid_price(current_data: dict, hist_list = []) -> np.ndarray:
    '''
    This will be the main function to design your factors for the competition. Please
    define only one factor here each time. We provide you with:

    Current_data: a dictionary in the format of {column_name: column_value}, where column_name is from the original
    dataframe

    hist_list: A list for you to save the previous factor values (optional). For instance,
    if you are calculating a 100-day Moving Average (MA), then you can save the first calculated
    MA in hist_list, and then for the next MA calculation, you can use the saved ones.
    '''
    ###################### ADD YOUR CODE HERE FOR FACTORS DESIGN ######################
    # convert the current_data to your choice of numpy or pandas dataframe
    # current_data = pd.DataFrame(current_data)
    current_data = np.array(list(current_data.values()),
                            dtype=float).T  # this is faster than pd.DataFrame(current_data).values
    res = current_data[:, col2index_map['ask_price']] - current_data[:, col2index_map['bid_price']]
    ############################## NAN/Inf handling ######################################
    # if you have nan in your factor value, please fill it reasonably
    # res = np.nan_to_num(res) # this is slow because it also checks for inf.
    # res = np.where(np.isnan(res), 0, res)  # this is slightly faster than np.nan_to_num
    res[np.isnan(res)] = 0  # this is the fastest way to fill nan with 0
    ############################## END OF YOUR CODE ##############################
    return res  # The return value MUSE BE a numpy array, with no NaN value
    ####################################################################################


In [5]:
def s1_imbalance(current_data: dict, hist_list = []) -> np.ndarray:
    '''
    This will be the main function to design your factors for the competition. Please
    define only one factor here each time. We provide you with:

    Current_data: a dictionary in the format of {column_name: column_value}, where column_name is from the original
    dataframe

    hist_list: A list for you to save the previous factor values (optional). For instance,
    if you are calculating a 100-day Moving Average (MA), then you can save the first calculated
    MA in hist_list, and then for the next MA calculation, you can use the saved ones.
    '''
    ###################### ADD YOUR CODE HERE FOR FACTORS DESIGN ######################
    # convert the current_data to your choice of numpy or pandas dataframe
    # current_data = pd.DataFrame(current_data)
    current_data = np.array(list(current_data.values()),
                            dtype=float).T  # this is faster than pd.DataFrame(current_data).values
    res = (current_data[:, col2index_map['bid_size']] - current_data[:, col2index_map['ask_size']])/\
          (current_data[:, col2index_map['bid_size']] + current_data[:, col2index_map['ask_size']])
    ############################## NAN/Inf handling ######################################
    # if you have nan in your factor value, please fill it reasonably
    # res = np.nan_to_num(res) # this is slow because it also checks for inf.
    # res = np.where(np.isnan(res), 0, res)  # this is slightly faster than np.nan_to_num
    res[np.isnan(res)] = 0  # this is the fastest way to fill nan
    ############################## END OF YOUR CODE ##############################
    return res  # The return value MUSE BE a numpy array, with no NaN value
    ####################################################################################


In [6]:
'''
Each factor should be defined as a function described above. After you designed all your factors and you are all good to test your factors,
you can simply ADD your factor name to the factor_list below, and run the cell. The backtest result will be printed out for you to see!
'''
# new_factor_list = ['origin_ask_size']
new_factor_list = ['mid_price', 's1_imbalance']

In [9]:
# DO NOT MODIFY THE FOLLOWING CODE
# Run this cell once you want to calculate your factor values and prepare for the test of your factor performance!
new_factors = {factor_name: utils.flatten_factor_value(factor_design.run_factor_value(df_train_dic_sorted, eval(factor_name), factor_name), factor_name)[factor_name] for factor_name in tqdm(new_factor_list)}


  0%|          | 0/2 [00:00<?, ?it/s]

[A

Start calculating factor mid_price
Finished calculating factor mid_price for 0 dates
Finished calculating factor mid_price for 100 dates
Finished calculating factor mid_price for 200 dates
Finished calculating factor mid_price for 300 dates
Finished calculating factor mid_price for 400 dates
Accepted!!: Used 6.67 seconds for calculation factors. The limit is 300 seconds.




Start calculating factor s1_imbalance
Finished calculating factor s1_imbalance for 0 dates


  res = (current_data[:, col2index_map['bid_size']] - current_data[:, col2index_map['ask_size']])/\


Finished calculating factor s1_imbalance for 100 dates
Finished calculating factor s1_imbalance for 200 dates
Finished calculating factor s1_imbalance for 300 dates
Finished calculating factor s1_imbalance for 400 dates
Accepted!!: Used 5.61 seconds for calculation factors. The limit is 300 seconds.


100%|██████████| 2/2 [00:14<00:00,  7.36s/it]


In [10]:
# check if all values in new_factor and original_factor_dict does not contain nan, no NA is allowed in the factor value
for factor_name, factor_value in new_factors.items():
    assert not np.isnan(factor_value).any(), f'{factor_name} contains nan'

## Factor Backtesting

In [11]:
'''
The factor_backtest is an object that can be used to backtest your factors. It takes three arguments:

    existed_facors: This is the dictionary stored all the passed factors with their names as keys and values as values
    testing_factors: This is the dictionary stored all the factors waited to be tested
    factor_performance: This is the dictionary stored all the factors' performance score (which is the Pearson correlation coefficient
    of factor values vs corresponded target vector)
'''
# We will firstly load the existed factors and the existed factors' performance for you
# This will take 12 seconds, you only need to run this cell once for loading the existed factors and their performance
existed_factors = utils.load_json_factors('./factor_design_ver2_beta/existed_factors.json')
factor_performance = utils.load_json('./factor_design_ver2_beta/factor_performance.json')

In [12]:
demo_backtest = factor_backtest.Factor_Backtest(existed_factors=existed_factors, testing_factors=new_factors, factor_performance=factor_performance)

In [13]:
demo_backtest.run_testing() # this will print out the in-sample performance of your factors

Factor mid_price failed in-sample performance check with correlation coefficient 0.007206974790691211
Factor s1_imbalance failed in-sample performance check with correlation coefficient -0.11723116519487425
Time used for checking in-sample performance: 0.7930006980895996 seconds
Number of factors passed this step: 0
Correlation coefficient of each test factor: {'mid_price': 0.007206974790691211, 's1_imbalance': -0.11723116519487425}
Factors passed this step: []
Factors failed this step: ['mid_price', 's1_imbalance']
Factor mid_price passed in-sample correlation check with all existed factors
Factor mid_price has in-sample performance {'origin_seconds_in_bucket': -0.15319060735379686, 'origin_imbalance_size': -0.07807498693573299, 'origin_imbalance_buy_sell_flag': 0.00909503691500761, 'origin_reference_price': 0.015928115729871267, 'origin_matched_size': -0.13553737979085623, 'origin_far_price': -0.08587093457834077, 'origin_near_price': -0.11991574616747172, 'origin_bid_price': -0.0269