# ML Pipeline: Feature Selection & Hyperparameter Tuning

This notebook demonstrates the core machine learning workflow of the trading research framework. It covers the end-to-end process of ingesting data, generating a vast array of technical indicators, systematically selecting the most predictive features using multiple advanced techniques, and performing iterative hyperparameter optimization with Optuna to find a robust model.

In [1]:
import stockml as st
from stockml import dataset as st_dt
from stockml import optimizations as st_op
from stockml import utils as st_ut
from stockml.optimizations import select_features
import warnings
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.io as pio
from datetime import date,datetime

%reload_ext autoreload
%autoreload 2
warnings.filterwarnings('ignore')
pio.templates.default = "plotly_dark"


### 1. Project Setup & Data Ingestion

The first step is to connect to the data sources. This involves pulling contract metadata and historical price data from the Interactive Brokers API and storing it in a local PostgreSQL database, managed via SQLAlchemy. This ensures data is persistent and efficiently accessible for subsequent research.

- `pull_ib_contract_list`: Fetches a list of available assets from the API.
- `get_all_rows(st.Stocks)`: Retrieves the stored asset metadata from the SQL database.
- `pull_ib_stock_data`: Downloads historical OHLCV data for a specific asset (UPRO in this case).
- `write_historical_data`: Writes the newly downloaded data into the `StockPrices30m` table in the database.

In [None]:
con_list = st.interactive_brokers.pull_ib_contract_list()
con_list.head(20)

Unnamed: 0_level_0,con_id,company_name,scan_data,last_price,listing_exchange,sec_type
symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
SPY,756733,SPDR S&P 500 ETF TRUST,5.910M,1.151,ARCA,STK
QQQ,320227571,INVESCO QQQ TRUST SERIES 1,3.483M,1.4247,NASDAQ.NMS,STK
IWM,9579970,ISHARES RUSSELL 2000 ETF,1.148M,1.0515,ARCA,STK
TQQQ,72539702,PROSHARES ULTRAPRO QQQ,285.711K,0.8635,NASDAQ.NMS,STK
SOXL,73340487,DIREXION DAILY SEMI BULL 3X,169.939K,0.4185,ARCA,STK
XLF,4215220,FINANCIAL SELECT SECTOR SPDR,162.904K,0.472,ARCA,STK
EWZ,10753244,ISHARES MSCI BRAZIL ETF,162.761K,0.4362,ARCA,STK
ARKK,172522644,ARK INNOVATION ETF,146.739K,0.4279,ARCA,STK
FXI,31421120,ISHARES CHINA LARGE-CAP ETF,144.247K,0.4436,ARCA,STK
KWEB,132310537,KRANESHARES CSI CHINA INTERN,122.530K,0.6799,ARCA,STK


In [None]:
con_details = st.contract_details(con_list.loc['SPY', 'con_id'])
st.write_stock_metadata(con_details, st.Stocks)

stocks_df = st.get_all_rows(st.Stocks)
stocks_df

Unnamed: 0,ticker,conid,currency,listing_exchange,country_code,name,asset_class,group,sector,sector_group,type,has_options
0,UDOW,72539713,USD,ARCA,US,PROSHARES ULTRAPRO DOW30,STK,,,,ETF,True
1,SMH,229725622,USD,NASDAQ,US,VANECK SEMICONDUCTOR ETF,STK,,,,ETF,True
2,UPRO,61228752,USD,ARCA,US,PROSHARES ULTRAPRO S&P 500,STK,,,,ETF,True
3,FAS,97276826,USD,ARCA,US,DIREXION DAILY FIN BULL 3X,STK,,,,ETF,True
4,XLY,4215215,USD,ARCA,US,CONSUMER DISCRETIONARY SELT,STK,,,,ETF,True
5,XLP,4215210,USD,ARCA,US,CONSUMER STAPLES SPDR,STK,,,,ETF,True
6,SPY,756733,USD,ARCA,US,SPDR S&P 500 ETF TRUST,STK,,,,ETF,True


In [None]:
#Request OHCLV data from the ibapi, specifically for UPRO 
con_id = stocks_df.iat[2,1]
ticker = stocks_df.iat[2,0]

stock_data = st.pull_ib_stock_data(con_id, ticker)

#Write the OHLCV data into a Stock Price SQL Table
st.write_historical_data(stock_data, st.StockPrices30m)
stock_data

Unnamed: 0,price_id,datasource,ticker,timestamp,open,high,low,close,volume
0,UPRO ibapi 2023-12-08 01:00:00,ibapi,UPRO,2023-12-08 01:00:00,48.94,49.12,48.93,49.07,17019.0
1,UPRO ibapi 2023-12-08 01:30:00,ibapi,UPRO,2023-12-08 01:30:00,49.07,49.08,49.06,49.08,1260.0
2,UPRO ibapi 2023-12-08 02:00:00,ibapi,UPRO,2023-12-08 02:00:00,49.10,49.13,49.06,49.13,8200.0
3,UPRO ibapi 2023-12-08 02:30:00,ibapi,UPRO,2023-12-08 02:30:00,49.15,49.17,49.14,49.14,7522.0
4,UPRO ibapi 2023-12-08 03:00:00,ibapi,UPRO,2023-12-08 03:00:00,49.14,49.14,49.08,49.08,4579.0
...,...,...,...,...,...,...,...,...,...
8013,UPRO ibapi 2024-12-06 14:30:00,ibapi,UPRO,2024-12-06 14:30:00,99.42,99.42,99.42,99.42,0.0
8014,UPRO ibapi 2024-12-06 15:00:00,ibapi,UPRO,2024-12-06 15:00:00,99.39,99.39,99.20,99.20,3000.0
8015,UPRO ibapi 2024-12-06 15:30:00,ibapi,UPRO,2024-12-06 15:30:00,99.20,99.20,99.20,99.20,100.0
8016,UPRO ibapi 2024-12-06 16:00:00,ibapi,UPRO,2024-12-06 16:00:00,99.15,99.18,99.15,99.15,1555.0


In [None]:

#Get all of the historical data that we have stored for UPRO from the stock historical data table
stock_ml_df = st.get_filtered_data(st.StockPrices30m, 'ibapi', 'UPRO')
stock_ml_df

Unnamed: 0,price_id,datasource,ticker,timestamp,open,high,low,close,volume
0,UPRO ibapi 2023-10-16 01:00:00,ibapi,UPRO,2023-10-16 01:00:00,42.19,42.22,41.99,42.18,30851.0
1,UPRO ibapi 2023-10-16 01:30:00,ibapi,UPRO,2023-10-16 01:30:00,42.26,42.39,42.26,42.36,2474.0
2,UPRO ibapi 2023-10-16 02:00:00,ibapi,UPRO,2023-10-16 02:00:00,42.30,42.36,42.28,42.30,3900.0
3,UPRO ibapi 2023-10-16 02:30:00,ibapi,UPRO,2023-10-16 02:30:00,42.29,42.39,42.29,42.31,3300.0
4,UPRO ibapi 2023-10-16 03:00:00,ibapi,UPRO,2023-10-16 03:00:00,42.24,42.27,42.17,42.27,1792.0
...,...,...,...,...,...,...,...,...,...
8769,UPRO ibapi 2024-11-15 08:00:00,ibapi,UPRO,2024-11-15 08:00:00,89.93,90.35,89.84,90.29,52338.0
8770,UPRO ibapi 2024-11-15 08:30:00,ibapi,UPRO,2024-11-15 08:30:00,90.32,90.37,89.51,89.65,58416.0
8771,UPRO ibapi 2024-11-15 09:00:00,ibapi,UPRO,2024-11-15 09:00:00,89.72,89.72,89.38,89.44,49917.0
8772,UPRO ibapi 2024-11-15 09:30:00,ibapi,UPRO,2024-11-15 09:30:00,89.43,89.66,89.14,89.43,70932.0


### 2. Feature Engineering & Target Creation

#### 2.1. Generating the Feature Universe

With the raw price data loaded, we generate a comprehensive "feature universe." The `get_all_pandas_ta` function is a wrapper that calculates **all available technical indicators** from the `pandas-ta` library. This creates a high-dimensional dataset with hundreds of potential predictors, which will be the input for our feature selection process.


In [14]:
#Validate the data before calculating technical indicators by removing uneeded columns
ohlcv_df = st.validate(stock_ml_df)
print(ohlcv_df.duplicated().all().any(), ohlcv_df.isna().all().any())
print('\n')

#Calculate all of the technical indicators available in the pandas_ta library, store all the dataframes (one for each indicator category) in a dictionary
stock_tas = st.get_all_pandas_ta(ohlcv_df)

%store stock_tas

st_ut.tools.check_same_length_and_identify(stock_tas)

False False




3it [00:02,  1.10it/s]
1it [00:01,  1.19s/it]
39it [00:03, 10.17it/s]
32it [00:02, 12.10it/s]
2it [00:00,  2.25it/s]
10it [00:00, 10.67it/s]
14it [00:01,  9.68it/s]
14it [00:01, 12.39it/s]
14it [00:00, 15.54it/s]

Stored 'stock_tas' (dict)
All DataFrames have the same length.
All DataFrames have the same first and last index.





True

#### 2.2. Engineering the Target Variable

For this classification problem, our goal is to predict the direction of the next price move.

- **`generate_target_vars`**: Creates a `change_shift` feature, which is the percentage change of the *next* candle's closing price.
- **`eda.categorize_instances`**: This function discretizes the continuous `change_shift` into a categorical target with three classes: **-1 (Down), 0 (Neutral), and 1 (Up)**, based on specified quantile thresholds. This transforms the regression problem into a classification problem. In this instance we specify the thresholds to be at 33% and 66% in order to get a balanced set of classes.
- **`display_and_describe`**: a simple utility function that outputs a histogram graph and statistical summary for a numerical column

In [7]:
#Generate target variables (such as percent change in close or difference in closing prices, which are both shifted) using a dataframe that does not contain any technical indicators
X_df = stock_tas['df0']
y_df = st.generate_target_vars(st.validate(stock_ml_df[200:]))
y_df.head()

Unnamed: 0_level_0,open,high,low,close,volume,low_shift,open_shift,drop_shift,drop_percent_shift,high_shift,peak_shift,peak_percent_shift,change,high_diff,low_diff,diff,diff_shift,change_shift,close_shift
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
2023-10-24 05:00:00,39.36,39.49,39.14,39.45,91833.0,39.38,39.45,-0.07,-0.17744,39.5,0.05,0.126743,,,,,0.01,0.025349,39.46
2023-10-24 05:30:00,39.45,39.5,39.38,39.46,31439.0,39.27,39.47,-0.2,-0.506714,39.47,0.0,0.0,0.025349,0.01,0.24,0.01,-0.14,-0.35479,39.32
2023-10-24 06:00:00,39.47,39.47,39.27,39.32,78226.0,39.27,39.32,-0.05,-0.127162,39.67,0.35,0.890132,-0.35479,-0.03,-0.11,-0.14,0.33,0.839268,39.65
2023-10-24 06:30:00,39.32,39.67,39.27,39.65,823876.0,39.3,39.65,-0.35,-0.882724,39.83,0.18,0.453972,0.839268,0.2,0.0,0.33,0.17,0.428752,39.82
2023-10-24 07:00:00,39.65,39.83,39.3,39.82,745820.0,39.66,39.82,-0.16,-0.401808,39.93,0.11,0.276243,0.428752,0.16,0.03,0.17,-0.07,-0.175791,39.75


In [None]:
describe_dict = st.display_and_describe(y_df[['change_shift']])

Unnamed: 0,change_shift
nobs,8573.0
missing,0.0
mean,0.010355
std_err,0.004326
upper_ci,0.018834
lower_ci,0.001876
std,0.400551
iqr,0.301331
iqr_normal,0.223377
mad,0.246575

Unnamed: 0,change_shift
skew,-0.417509
kurtosis,23.136715
jarque_bera,145092.501342
jarque_bera_pval,0.0
mode,0.0
mode_freq,0.044558
median,0.0
1%,-1.198367
5%,-0.572781
10%,-0.360862


In [None]:
from stockml.dataset import eda
y_copy = eda.categorize_instances(y_df, 0.3333, 0.6666)
y_copy

Lower bound value:-0.0733//Upper bound value:0.1008


Unnamed: 0_level_0,change_shift,diff_shift,change_encoded
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2023-10-24 05:00:00,0.025349,0.01,0
2023-10-24 05:30:00,-0.354790,-0.14,-1
2023-10-24 06:00:00,0.839268,0.33,1
2023-10-24 06:30:00,0.428752,0.17,1
2023-10-24 07:00:00,-0.175791,-0.07,-1
...,...,...,...
2024-11-15 07:30:00,0.444988,0.40,1
2024-11-15 08:00:00,-0.708827,-0.64,-1
2024-11-15 08:30:00,-0.234244,-0.21,-1
2024-11-15 09:00:00,-0.011181,-0.01,0


### 3. Advanced Feature Selection

A high-dimensional feature set can lead to overfitting and poor model performance. The following steps implement a multi-stage feature selection cascade to systematically reduce the feature space.

#### 3.1. Stage 1: L1 Regularization (Lasso)

The `select_l1_rf_features` function first applies L1 (Lasso) regularization. This technique shrinks the coefficients of non-informative features to zero, performing an initial, broad-based feature reduction across all indicator sets.

#### 3.2. Stage 2: Random Forest Importance

The features that survive the L1-pass are then fed into a Random Forest model. We use the model's feature importance scores (typically Gini importance) to further rank and select the most predictive features from the L1-reduced set.

In [None]:
test_size = 1000
importance = 0.05

#Initialize a dataframe that contains all of the technical indicators which will be used in the lines below
merged_df = pd.concat(stock_tas.values(), axis=1)
df_cleaned = merged_df.loc[:, ~merged_df.columns.duplicated()]

l1_rf_selected_df = select_features.select_l1_rf_features(stock_tas, y_copy[['change_encoded']], df_cleaned, importance=importance, test_size=test_size)
l1_rf_selected_df

Performing L1-based feature selection
L1 selection results


df1 - Number of features chosen by L1: 63, Number of starting features: 75
df2 - Number of features chosen by L1: 4, Number of starting features: 6
df3 - Number of features chosen by L1: 67, Number of starting features: 76
df4 - Number of features chosen by L1: 13, Number of starting features: 45
df5 - Number of features chosen by L1: 4, Number of starting features: 7
df6 - Number of features chosen by L1: 17, Number of starting features: 21
df7 - Number of features chosen by L1: 21, Number of starting features: 28
df8 - Number of features chosen by L1: 20, Number of starting features: 37
df9 - Number of features chosen by L1: 20, Number of starting features: 24


Performing Random Forest-based feature selection on the L1-selected features
Feature ranking:
1. feature no:2 feature name:volume (0.191723)
2. feature no:59 feature name:low_Z_30_1 (0.087154)
3. feature no:60 feature name:close_Z_30_1 (0.078090)
4. feature no:58 fe

Unnamed: 0_level_0,open_Z_30_1,close_Z_30_1,DMN_14,EBSW_40_10,ZS_30,VHF_28,QS_10,HA_low,TRUERANGE_1,BBB_5_2.0,...,EFI_13,PVOL,KURT_30,VTXP_14,UI_14,NATR_14,ALMA_10_6.0_0.85,KVOs_34_55_13,MASSI_9_25,high
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2023-10-24 05:00:00,0.749747,1.058217,25.195971,0.978588,1.058217,0.343849,0.048,39.14,0.35,0.647137,...,-8486.457199,3622811.85,0.045667,1.047337,1.123455,0.497720,39.194224,5985.495134,21.418169,39.49
2023-10-24 05:30:00,1.076563,1.039700,24.064652,0.969905,1.039700,0.359736,0.044,39.38,0.12,0.456346,...,-7229.193313,1240582.94,0.087632,1.168539,0.990623,0.483774,39.245974,6437.585538,21.430027,39.50
2023-10-24 06:00:00,1.097905,0.389997,26.371830,0.686230,0.389997,0.378472,0.030,39.27,0.20,0.571736,...,-7760.971411,3075846.32,0.360126,1.108247,0.826372,0.487150,39.295096,6654.221802,21.418282,39.47
2023-10-24 06:30:00,0.400556,1.709097,22.721894,0.683640,1.709097,0.351613,0.049,39.27,0.40,1.079222,...,32187.607362,32666683.40,0.196549,1.090517,0.610504,0.520648,39.329201,9443.775772,21.589385,39.67
2023-10-24 07:00:00,1.726347,2.166695,18.974597,0.675432,2.166695,0.333333,0.055,39.30,0.53,1.771948,...,45702.149167,29698552.40,0.038819,1.086643,0.345616,0.576465,39.360577,13955.634342,22.009362,39.83
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2024-11-15 07:30:00,-2.378079,-2.337215,51.905283,-0.962727,-2.337215,0.609859,-0.189,89.76,0.70,3.391148,...,-39759.020105,14402355.58,-0.207709,0.545620,1.976385,0.476131,92.018770,-3327.400153,23.252898,90.46
2024-11-15 08:00:00,-2.301847,-1.818870,47.547030,-0.990263,-1.818870,0.581208,-0.167,89.84,0.51,2.682081,...,-31088.417233,4725598.02,-0.370601,0.611670,1.923261,0.480509,92.041868,-3442.752940,23.541289,90.35
2024-11-15 08:30:00,-1.795842,-2.135665,46.333128,-0.980181,-2.135665,0.609333,-0.234,89.51,0.86,1.925388,...,-31988.106199,5236994.40,-0.336693,0.599647,1.927163,0.517893,91.986904,-3547.424679,23.727980,90.37
2024-11-15 09:00:00,-2.096041,-2.096316,45.901297,-0.999098,-2.096316,0.669468,-0.269,89.38,0.34,1.374872,...,-28915.886742,4464576.48,-0.380190,0.595278,1.963832,0.509183,91.803000,-3615.180008,23.741263,89.72


#### 3.3. Stage 3 (Alternative Path): MRMR Selection

As an alternative to the L1/RF cascade, we also run the Minimum Redundancy Maximum Relevance (MRMR) algorithm. This advanced method selects features that are highly correlated with the target variable but have low correlation with each other, resulting in a diverse and powerful feature set.

In [None]:
#Get a separate set of features by using a different feature selection method that comes from a library called mRMR (which stands for minimum Redundancy - Maximum Relevance https://github.com/smazzanti/mrmr)
mrmr_selected_df = st_op.select_features.select_mrmr_features(stock_tas, y_copy[['change_encoded']], df_cleaned, K=5, test_size=test_size)
mrmr_selected_df

100%|██████████| 5/5 [00:00<00:00, 17.21it/s]
100%|██████████| 5/5 [00:02<00:00,  1.86it/s]
100%|██████████| 5/5 [00:00<00:00, 13.50it/s]
100%|██████████| 5/5 [00:00<00:00, 26.14it/s]
100%|██████████| 5/5 [00:02<00:00,  1.70it/s]
100%|██████████| 5/5 [00:00<00:00,  5.00it/s]
100%|██████████| 5/5 [00:01<00:00,  4.14it/s]
100%|██████████| 5/5 [00:00<00:00, 10.26it/s]
100%|██████████| 5/5 [00:00<00:00,  8.58it/s]
100%|██████████| 5/5 [00:02<00:00,  2.42it/s]


Unnamed: 0_level_0,EBSW_40_10,PVOh_12_26_9,CDL_THRUSTING,ZS_30,TRUERANGE_1,low,SQZPRO_ON_NARROW,HILOs_13_21,HILOl_13_21,PDIST,...,PCTRET_1,CDL_TAKURI,PVOL,KURT_30,VTXP_14,AROOND_14,open,MASSI_9_25,high,PSARr_0.02_0.2
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2023-10-24 05:00:00,0.978588,-6.326469,0.0,1.058217,0.35,39.14,0,0.000000,39.073333,0.68,...,0.000507,0.0,3622811.85,0.045667,1.047337,50.000000,39.36,21.418169,39.49,0
2023-10-24 05:30:00,0.969905,-6.188582,0.0,1.039700,0.12,39.38,0,0.000000,39.084286,0.23,...,0.000253,0.0,1240582.94,0.087632,1.168539,42.857143,39.45,21.430027,39.50,1
2023-10-24 06:00:00,0.686230,-2.751633,0.0,0.389997,0.20,39.27,0,0.000000,39.088095,0.26,...,-0.003548,0.0,3075846.32,0.360126,1.108247,35.714286,39.47,21.418282,39.47,0
2023-10-24 06:30:00,0.683640,35.211529,0.0,1.709097,0.40,39.27,0,0.000000,39.107143,0.47,...,0.008393,0.0,32666683.40,0.196549,1.090517,28.571429,39.32,21.589385,39.67,0
2023-10-24 07:00:00,0.675432,44.326974,0.0,2.166695,0.53,39.30,0,0.000000,39.136667,0.89,...,0.004288,0.0,29698552.40,0.038819,1.086643,21.428571,39.65,22.009362,39.83,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2024-11-15 07:30:00,-0.962727,38.231062,0.0,-2.337215,0.70,89.76,0,91.772308,0.000000,1.16,...,-0.002663,0.0,14402355.58,-0.207709,0.545620,100.000000,90.13,23.252898,90.46,0
2024-11-15 08:00:00,-0.990263,27.167795,0.0,-1.818870,0.51,89.84,0,91.662308,0.000000,0.70,...,0.004450,0.0,4725598.02,-0.370601,0.611670,92.857143,89.93,23.541289,90.35,0
2024-11-15 08:30:00,-0.980181,19.392946,0.0,-2.135665,0.86,89.51,0,91.560000,0.000000,1.08,...,-0.007088,0.0,5236994.40,-0.336693,0.599647,100.000000,90.32,23.727980,90.37,0
2024-11-15 09:00:00,-0.999098,12.820623,0.0,-2.096316,0.34,89.38,0,91.401538,0.000000,0.47,...,-0.002342,0.0,4464576.48,-0.380190,0.595278,100.000000,89.72,23.741263,89.72,0


In [None]:
%store l1_rf_selected_df
%store mrmr_selected_df

from stockml.utils.config import models as st_models
from stockml.utils.config import model_names as st_names

#We get the names and actual model instances themselves from config.py
model_dict = st_models
model_names = st_names

l1_rf_dict = {}
mrmr_dict = {}

Stored 'l1_rf_selected_df' (DataFrame)
Stored 'mrmr_selected_df' (DataFrame)


#### 3.4. Stage 4: Sequential Feature Selection (SFS)

To achieve a final, compact feature set (e.g., 5-10 features), we run Sequential Feature Selection on the outputs of both the L1/RF and MRMR pipelines. SFS is a computationally intensive wrapper method that iteratively builds and evaluates models to find the absolute best-performing subset of features for a given model type.

The process was run for multiple scikit-learn models, and the final selected feature sets for the HoeffdingAdaptiveTreeClassifier and ARFClassifier were stored for the next phase. For example, the best performing subset for the ARFClassifier was found to be: ('VHF_28', 'TRUERANGE_1', 'volume', 'KVO_34_55_13', 'STDEV_30', 'SUPERT_7_3.0', 'ISB_26', 'VTXP_14').

In [None]:
for model in model_names:
    l1_rf_feat = select_features.run_sfs_with_model(l1_rf_selected_df, y_copy[['change_encoded']],        test_size=test_size, model = model_dict[model], k_features=(5,10))
    l1_rf_dict[model] = l1_rf_feat

for model in model_names:
    mrmr_feat = select_features.run_sfs_with_model(mrmr_selected_df, y_copy[['change_encoded']], test_size=test_size, model = model_dict[model], k_features=(5,10))
    mrmr_dict[model] = mrmr_feat

In [117]:
print(l1_rf_dict)
#{'hoeffding_adaptive': ['HA_low', 'BBB_5_2.0', 'low', 'ISA_9', 'PDIST'], 'arfc': ['VHF_28', 'TRUERANGE_1', 'volume', 'KVO_34_55_13', 'STDEV_30', 'SUPERT_7_3.0', 'ISB_26', 'VTXP_14']}
%store l1_rf_dict

{'hoeffding_adaptive': ['HA_low', 'BBB_5_2.0', 'low', 'ISA_9', 'PDIST'], 'arfc': ['VHF_28', 'TRUERANGE_1', 'volume', 'KVO_34_55_13', 'STDEV_30', 'SUPERT_7_3.0', 'ISB_26', 'VTXP_14']}
Stored 'l1_rf_dict' (dict)


In [118]:
print(mrmr_dict)
#{'hoeffding_adaptive': ['HILOs_13_21', 'PDIST', 'MFI_14', 'SUPERTd_7_3.0', 'QQE_14_5_4.236_RSIMA', 'SUPERTl_7_3.0', 'KURT_30', 'MASSI_9_25'], 'arfc': ['TRUERANGE_1', 'low', 'SQZPRO_ON_NARROW', 'THERMOs_20_2_0.5', 'INC_1', 'QQE_14_5_4.236_RSIMA', 'ADOSC_3_10', 'PVOL']}
%store mrmr_dict

{'hoeffding_adaptive': ['HILOs_13_21', 'PDIST', 'MFI_14', 'SUPERTd_7_3.0', 'QQE_14_5_4.236_RSIMA', 'SUPERTl_7_3.0', 'KURT_30', 'MASSI_9_25'], 'arfc': ['TRUERANGE_1', 'low', 'SQZPRO_ON_NARROW', 'THERMOs_20_2_0.5', 'INC_1', 'QQE_14_5_4.236_RSIMA', 'ADOSC_3_10', 'PVOL']}
Stored 'mrmr_dict' (dict)


### 4. Iterative Hyperparameter Tuning with Optuna

With a final, reduced set of features, the next step is to find the optimal hyperparameters for our chosen model (a Hoeffding Adaptive Tree Classifier). This is a multi-stage, iterative process.

#### 4.1. Defining the Optimization Objective

For this Optuna run, we are optimizing a classification problem that will try to predict if the next candle will close significantly higher (**buy**), significantly lower (**sell**), or somewhere in between (**no action**).

The optimization metric is a custom reward function. For each trial, Optuna will evaluate a model's performance based on the total "reward" it accumulates. A reward is generated when the model's label prediction (buy/sell/hold) matches the actual outcome of the next candle. The study's goal is to find the set of hyperparameters that maximizes this total cumulative reward over the test period.

#### 4.2. Initial Broad Search

An initial Optuna study is run with a wide search space across many parameters. The `plot_param_importances` visualization allows us to identify which hyperparameters have the most significant impact on the custom reward metric.

In [None]:
#We initialize certain variables one more time before running the main machine learning optimization which is done through optuna. 
model_name = model_names[0]
model_to_optimize = model_dict[model_name]
mrmr_hoeff_a_features = mrmr_dict[model_name]

X = mrmr_selected_df[mrmr_hoeff_a_features][-2000:]
y = y_copy[-2000:]

from stockml.optimizations.model_tuning import optuna_classification_reward
study = optuna_classification_reward(X_data=X, y_data=y,model=model_name, n_trials=50)

In [None]:
#We display the importances of our parameters in order to see what we could focus on in our next run
import optuna
param_importances = optuna.visualization.plot_param_importances(study)
param_importances.show()

#### 4.3. Focused Refined Search

Based on the results of the initial search, we create a new, more focused search space (`new_params`). We narrow the ranges of important parameters and completely exclude unimportant ones (`exclude_params`). Running a second, longer Optuna study on this refined space allows for a more efficient and effective search for the global optimum.

In [None]:
#We change the values of some parameters and delete ones that had low importance before running optuna again
new_params = {'leaf_prediction':('categorical', ['nb'])
}
exclude_params = ['bootstrap_sampling', 'remove_poor_attrs', 'merit_preprune', 'binary_split']


now = datetime.now()
current_time = now.strftime("%Y-%m-%d %H:%M")
current_time = datetime.strptime(current_time, "%Y-%m-%d %H:%M")

#Run optuna again with the new parameter changes
study2 = optuna_classification_reward(X_data=X, 
                                      y_data=y,
                                      model=model_name, 
                                      n_trials=500, 
                                      param_ranges=new_params, exclude_params=exclude_params)


### 5. Results & Finalization

The `show_optuna_results` function displays a sorted DataFrame of the best-performing trials from the Optuna study, ranked by our custom reward metric. The final, best-performing set of hyperparameters is identified.

These results are then compiled into a DataFrame and written to the `StockOptunaResults30m` table in the PostgreSQL database, creating a persistent, queryable log of all research trials.

In [None]:
from stockml.optimizations.model_tuning import show_optuna_results
#Now we display the top 25 trials in terms of rewards and return a dataframe that contains details about the best runs
results_df = show_optuna_results(study2, 0, 25)
results_df

Trial 100 // Reward: 26.520000000000024
Trial 108 // Reward: 24.08000000000007
Trial 20 // Reward: 23.589999999999975
Trial 331 // Reward: 23.149999999999878
Trial 354 // Reward: 22.519999999999953
Trial 163 // Reward: 22.480000000000032
Trial 211 // Reward: 22.2500000000001
Trial 360 // Reward: 21.040000000000134
Trial 160 // Reward: 20.770000000000067
Trial 190 // Reward: 20.710000000000107
Trial 350 // Reward: 20.3100000000001
Trial 436 // Reward: 20.049999999999955
Trial 98 // Reward: 19.830000000000084
Trial 237 // Reward: 19.490000000000023
Trial 234 // Reward: 19.320000000000064
Trial 479 // Reward: 19.32000000000005
Trial 265 // Reward: 19.22999999999996
Trial 78 // Reward: 19.199999999999918
Trial 200 // Reward: 19.150000000000077
Trial 67 // Reward: 18.950000000000102
Trial 482 // Reward: 18.869999999999962
Trial 476 // Reward: 18.849999999999966
Trial 488 // Reward: 18.830000000000084
Trial 261 // Reward: 18.570000000000064
Trial 84 // Reward: 18.379999999999953


Unnamed: 0,trial_num,trial_rewards,trial_metric,trial_num_preds,trial_ave_profit,trial_ave_loss,trial_params
0,100,26.52,Precision: 38.29%,268.0,0.3491,-0.2287,"{'grace_period': 117, 'split_criterion': 'gini..."
1,108,24.08,Precision: 35.73%,273.0,0.3581,-0.2273,"{'grace_period': 81, 'split_criterion': 'gini'..."
2,20,23.59,Precision: 35.43%,237.0,0.3615,-0.215,"{'grace_period': 112, 'split_criterion': 'gini..."
3,331,23.15,Precision: 35.75%,242.0,0.3585,-0.2252,"{'grace_period': 58, 'split_criterion': 'gini'..."
4,354,22.52,Precision: 37.15%,237.0,0.3605,-0.2297,"{'grace_period': 50, 'split_criterion': 'info_..."
5,163,22.48,Precision: 34.67%,242.0,0.3666,-0.221,"{'grace_period': 234, 'split_criterion': 'gini..."
6,211,22.25,Precision: 36.62%,234.0,0.3693,-0.2244,"{'grace_period': 104, 'split_criterion': 'info..."
7,360,21.04,Precision: 35.94%,202.0,0.3869,-0.2302,"{'grace_period': 67, 'split_criterion': 'gini'..."
8,160,20.77,Precision: 35.78%,263.0,0.3578,-0.2312,"{'grace_period': 107, 'split_criterion': 'info..."
9,190,20.71,Precision: 38.00%,255.0,0.3392,-0.2314,"{'grace_period': 231, 'split_criterion': 'hell..."


In [None]:
#We can check the parameters of the best trial as well
study2.best_params

{'grace_period': 117,
 'split_criterion': 'gini',
 'delta': 0.02934919086783071,
 'tau': 0.03420910250735763,
 'leaf_prediction': 'nb',
 'nb_threshold': 89,
 'drift_window_threshold': 177,
 'switch_significance': 0.01642919578646868,
 'min_branch_fraction': 0.020437890996069415,
 'max_share_to_split': 0.5144247690012861,
 'adwin_delta': 0.06678408825164404,
 'adwin_clock': 42,
 'adwin_max_buckets': 7,
 'adwin_min_window_length': 16,
 'adwin_grace_period': 43}

In [None]:
#Initiate somem variables that will be needed before we can store all the results in an SQL Table
model = 'BaggingClassifier(HoeffdingAdaptiveTreeClassifier)'
source = 'ibapi'
timeframe = '30m'
trial_start = X.index[0]
trial_end = X.index[-1]


results_df['ml_model'] = model
results_df['study_datetime'] =current_time
results_df['ticker'] = ticker
results_df['source'] = source
results_df['timeframe'] = timeframe
results_df['trial_start'] = trial_start
results_df['trial_end'] = trial_end
results_df['trial_metric'] = results_df['trial_metric'].astype(str)

results_df['trial_id'] = (
    results_df['study_datetime'].astype(str) + ' ' +
    results_df['ticker'].astype(str) + ' ' +
    results_df['trial_num'].astype(str) + ' ' +
    results_df['trial_rewards'].astype(str)
)

%store results_df
%store study2

results_df

Unnamed: 0,trial_num,trial_rewards,trial_metric,trial_num_preds,trial_ave_profit,trial_ave_loss,trial_params,ml_model,study_datetime,ticker,source,timeframe,trial_start,trial_end,trial_id
0,100,26.52,Precision: 38.29%,268.0,0.3491,-0.2287,"{'grace_period': 117, 'split_criterion': 'gini...",BaggingClassifier(HoeffdingAdaptiveTreeClassif...,2024-11-16 14:26:00,UPRO,ibapi,30m,2024-08-20 02:30:00,2024-11-15 09:30:00,2024-11-16 14:26:00 UPRO 100 26.52
1,108,24.08,Precision: 35.73%,273.0,0.3581,-0.2273,"{'grace_period': 81, 'split_criterion': 'gini'...",BaggingClassifier(HoeffdingAdaptiveTreeClassif...,2024-11-16 14:26:00,UPRO,ibapi,30m,2024-08-20 02:30:00,2024-11-15 09:30:00,2024-11-16 14:26:00 UPRO 108 24.08
2,20,23.59,Precision: 35.43%,237.0,0.3615,-0.215,"{'grace_period': 112, 'split_criterion': 'gini...",BaggingClassifier(HoeffdingAdaptiveTreeClassif...,2024-11-16 14:26:00,UPRO,ibapi,30m,2024-08-20 02:30:00,2024-11-15 09:30:00,2024-11-16 14:26:00 UPRO 20 23.59
3,331,23.15,Precision: 35.75%,242.0,0.3585,-0.2252,"{'grace_period': 58, 'split_criterion': 'gini'...",BaggingClassifier(HoeffdingAdaptiveTreeClassif...,2024-11-16 14:26:00,UPRO,ibapi,30m,2024-08-20 02:30:00,2024-11-15 09:30:00,2024-11-16 14:26:00 UPRO 331 23.15
4,354,22.52,Precision: 37.15%,237.0,0.3605,-0.2297,"{'grace_period': 50, 'split_criterion': 'info_...",BaggingClassifier(HoeffdingAdaptiveTreeClassif...,2024-11-16 14:26:00,UPRO,ibapi,30m,2024-08-20 02:30:00,2024-11-15 09:30:00,2024-11-16 14:26:00 UPRO 354 22.52
5,163,22.48,Precision: 34.67%,242.0,0.3666,-0.221,"{'grace_period': 234, 'split_criterion': 'gini...",BaggingClassifier(HoeffdingAdaptiveTreeClassif...,2024-11-16 14:26:00,UPRO,ibapi,30m,2024-08-20 02:30:00,2024-11-15 09:30:00,2024-11-16 14:26:00 UPRO 163 22.48
6,211,22.25,Precision: 36.62%,234.0,0.3693,-0.2244,"{'grace_period': 104, 'split_criterion': 'info...",BaggingClassifier(HoeffdingAdaptiveTreeClassif...,2024-11-16 14:26:00,UPRO,ibapi,30m,2024-08-20 02:30:00,2024-11-15 09:30:00,2024-11-16 14:26:00 UPRO 211 22.25
7,360,21.04,Precision: 35.94%,202.0,0.3869,-0.2302,"{'grace_period': 67, 'split_criterion': 'gini'...",BaggingClassifier(HoeffdingAdaptiveTreeClassif...,2024-11-16 14:26:00,UPRO,ibapi,30m,2024-08-20 02:30:00,2024-11-15 09:30:00,2024-11-16 14:26:00 UPRO 360 21.04
8,160,20.77,Precision: 35.78%,263.0,0.3578,-0.2312,"{'grace_period': 107, 'split_criterion': 'info...",BaggingClassifier(HoeffdingAdaptiveTreeClassif...,2024-11-16 14:26:00,UPRO,ibapi,30m,2024-08-20 02:30:00,2024-11-15 09:30:00,2024-11-16 14:26:00 UPRO 160 20.77
9,190,20.71,Precision: 38.00%,255.0,0.3392,-0.2314,"{'grace_period': 231, 'split_criterion': 'hell...",BaggingClassifier(HoeffdingAdaptiveTreeClassif...,2024-11-16 14:26:00,UPRO,ibapi,30m,2024-08-20 02:30:00,2024-11-15 09:30:00,2024-11-16 14:26:00 UPRO 190 20.71


In [None]:
import sys
#Finally we store the results to an SQL Table
del sys.modules['stockml.sql']
import stockml.sql as ss

ss.write_optuna_results(results_df, ss.StockOptunaResults30m)

Optuna results written successfully.


### **6. Conclusion**

This notebook successfully demonstrated an end-to-end machine learning research workflow. Key achievements include:
1.  **Systematic Feature Reduction:** We successfully reduced a high-dimensional feature universe to a compact, powerful set of predictors using a multi-stage cascade of selection techniques.
2.  **Rigorous Hyperparameter Optimization:** We employed an iterative, two-stage Optuna study with a custom reward metric to find a robust and high-performing set of model hyperparameters.
3.  **Persistent Research Logging:** All final results, including the best parameters and performance metrics, were successfully stored in a PostgreSQL database, creating a queryable and reproducible log of the research process.

The optimized features and hyperparameters discovered in this pipeline are now ready to be used in backtesting or a live trading environment.