## Notebook to try sequential feature selection

### Approach
#### Setup
1. Take each of the 3CV x 36-target model predictions for each of the 3 CV `test` sets.
2. Split the 3 CV test sets into 3 `validation` (120 eras) and 3 `test` sets (120 eras).
3. The validation eras are further split into `val1` and `val2` 60 eras each.

#### Sequential feature selection
1. We start with the cyrus only prediction.
2. Then we perform regress 35 regressions where we pairwise regress cyrus+ each of the 35 other targets individually on `val1` (60 eras).
3. We then rank them based on their performance on `val2` and select the best.
4. In the next round we add predictions from the next best target amongst 34 options and continue.

### Observations
The performance on `val2` improved with each addition, the performance on test plummeted as we overfit easily. Abandoned the approach.

### Next steps and revisiting
If revisiting the next time, we should not split validation into `val1` and `val2` but simply regress on the entire `validation` set and select features based on in-sample performance perhaps? That way we at least have 120 eras to regress over instead of 60.

In [1]:
%load_ext autoreload
%autoreload 2

from IPython.display import display, HTML, clear_output
display(HTML("<style>.container { width:100% !important; }</style>"))

from importlib import reload
import logging
reload(logging)
import logging
logging.basicConfig(level=logging.INFO)

import glob
import numpy as np
import datetime
import json
import os
import os.path
from os.path import join
import warnings
import flatdict
import pandas as pd
import mlflow
import gc
import plotly.graph_objects as go
import functools
import matplotlib.pyplot as plt
import seaborn as sns
import plotly
import plotly.express as px
import plotly.offline as pyo
import itertools as it
from typing import List
import time
import copy

import utils as ut

from pprint import pprint, pformat
from tqdm.notebook import tqdm

from utils import ERA_COL, TARGET_COL


pyo.init_notebook_mode()
pd.options.mode.chained_assignment = None  # default='warn'
# Filter the setuptools UserWarning until we stop relying on distutils
warnings.filterwarnings("ignore")
DF = pd.DataFrame
sns.set_theme()

## 1. Constants

In [2]:
LOCAL = False
if LOCAL:
    ML_TRACKING_SERVER_URI = "http://127.0.0.1:5000"
    AWS_CREDENTIALS_FILE = "~/.aws/personal_credentials"
else:
    ML_TRACKING_SERVER_URI = "http://18.218.213.146:5500/"
    AWS_CREDENTIALS_FILE = "~/.aws/credentials"
EXPERIMENT_NAME = f"ensemble_tgts_for_cyrus_2023-04-26_19h-36m"
DATA_PATH = "./data/"
VAL_PRED_S3_PATH = (
    "s3://numerai-v1/experiments/"
    "ensemble_tgts_for_cyrus_2023-04-26_19h-36m/"
    "ckpt1_cv_val_preds_no_feats.pkl"
)
ENSEMBLE_MODELS_S3_PATH = (
    "s3://numerai-v1/experiments/"
    "ensemble_tgts_for_cyrus_2023-04-26_19h-36m/"
    "ensemble_models/"
)

log = ut.Logger(root_dir="./")

EXPT_LOCAL_DIR = os.path.join(DATA_PATH, "experiments", EXPERIMENT_NAME)
MODEL_DIR = join(EXPT_LOCAL_DIR, "models")
for fld in [EXPT_LOCAL_DIR, MODEL_DIR]:
    os.makedirs(fld, exist_ok=True)
    log.info(f"Making {fld}")
    
log.info(f"{EXPERIMENT_NAME=}")
log.info(f"{VAL_PRED_S3_PATH=}")

[2023-04-28 19:46:25]  Making ./data/experiments/ensemble_tgts_for_cyrus_2023-04-26_19h-36m
[2023-04-28 19:46:25]  Making ./data/experiments/ensemble_tgts_for_cyrus_2023-04-26_19h-36m/models
[2023-04-28 19:46:25]  EXPERIMENT_NAME='ensemble_tgts_for_cyrus_2023-04-26_19h-36m'
[2023-04-28 19:46:25]  VAL_PRED_S3_PATH='s3://numerai-v1/experiments/ensemble_tgts_for_cyrus_2023-04-26_19h-36m/ckpt1_cv_val_preds_no_feats.pkl'


In [3]:
log.info(f"{TARGET_COL=}")

[2023-04-28 19:46:26]  TARGET_COL='target_cyrus_v4_20'


## 2. Download data and load up the model predns on val data

In [4]:
# mlflow.set_tracking_uri(ML_TRACKING_SERVER_URI)
# try:
#     cv_expt_id = mlflow.create_experiment(name=EXPERIMENT_NAME)
# except Exception:
#     cv_expt_id = mlflow.get_experiment_by_name(name=EXPERIMENT_NAME).experiment_id
# log.info(f"{EXPERIMENT_NAME=}, {cv_expt_id=}")

In [5]:
ut.download_s3_file(
    local_path=EXPT_LOCAL_DIR,
    s3_path=VAL_PRED_S3_PATH,
    aws_credential_fl=AWS_CREDENTIALS_FILE,
)

[2023-04-28 19:46:29]  Loading aws credenitals from ~/.aws/credentials...
[2023-04-28 19:46:29]  Would have downloaded s3://numerai-v1/experiments/ensemble_tgts_for_cyrus_2023-04-26_19h-36m/ckpt1_cv_val_preds_no_feats.pkl to ./data/experiments/ensemble_tgts_for_cyrus_2023-04-26_19h-36m/ckpt1_cv_val_preds_no_feats.pkl. But ./data/experiments/ensemble_tgts_for_cyrus_2023-04-26_19h-36m/ckpt1_cv_val_preds_no_feats.pkl exists. Will not download again ...


In [6]:
cv_valpreds_orig = ut.unpickle_obj(fl=join(EXPT_LOCAL_DIR, "ckpt1_cv_val_preds_no_feats.pkl"))

In [7]:
# HACK: Accidentally duplicated the columns twice, only include them once instead
NUM_TARGETS = 36
cv_valpreds = copy.deepcopy(cv_valpreds_orig)
cv_valpreds["cv_predcols_map"] = [
    cv_pc[:NUM_TARGETS]
    for cv_pc in cv_valpreds["cv_predcols_map"]
]

In [8]:
gc.collect()

23810

In [9]:
split = 0
cv_valpreds["cv_to_val_test_map"][split]["val"].head()

Unnamed: 0_level_0,target,target_nomi_v4_20,target_nomi_v4_60,target_tyler_v4_20,target_tyler_v4_60,target_victor_v4_20,target_victor_v4_60,target_ralph_v4_20,target_ralph_v4_60,target_waldo_v4_20,...,pred_target_ben_v4_20_cv0,pred_target_ben_v4_60_cv0,pred_target_alan_v4_20_cv0,pred_target_alan_v4_60_cv0,pred_target_paul_v4_20_cv0,pred_target_paul_v4_60_cv0,pred_target_george_v4_20_cv0,pred_target_george_v4_60_cv0,pred_target_william_v4_20_cv0,pred_target_william_v4_60_cv0
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
n001f768affa1cc2,1.0,1.0,0.75,0.75,0.75,1.0,0.75,1.0,0.75,1.0,...,0.493943,0.475514,0.494528,0.478933,0.45402,0.450485,0.455395,0.43663,0.499045,0.476133
n002cc5b29f8705f,0.5,0.5,0.5,0.25,0.5,0.5,0.5,0.5,0.5,0.5,...,0.495377,0.509305,0.496175,0.49055,0.474163,0.483902,0.486066,0.495735,0.498365,0.507437
n00361f031876c68,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,...,0.488254,0.486235,0.493248,0.48034,0.482077,0.454602,0.475474,0.466946,0.485363,0.479478
n00385e672d049e6,0.0,0.0,0.25,0.25,0.5,0.0,0.25,0.0,0.25,0.0,...,0.491807,0.488826,0.51142,0.501787,0.45579,0.444844,0.475221,0.449966,0.49438,0.498582
n00503d13b28d441,0.5,0.5,0.5,0.5,0.25,0.5,0.5,0.5,0.5,0.5,...,0.524468,0.498629,0.533116,0.521875,0.532201,0.516652,0.52554,0.507236,0.533293,0.513035


In [10]:
cv_valpreds["cv_to_val_test_map"][split]["test"].head()

Unnamed: 0_level_0,target,target_nomi_v4_20,target_nomi_v4_60,target_tyler_v4_20,target_tyler_v4_60,target_victor_v4_20,target_victor_v4_60,target_ralph_v4_20,target_ralph_v4_60,target_waldo_v4_20,...,pred_target_ben_v4_20_cv0,pred_target_ben_v4_60_cv0,pred_target_alan_v4_20_cv0,pred_target_alan_v4_60_cv0,pred_target_paul_v4_20_cv0,pred_target_paul_v4_60_cv0,pred_target_george_v4_20_cv0,pred_target_george_v4_60_cv0,pred_target_william_v4_20_cv0,pred_target_william_v4_60_cv0
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
n00164cb9c597154,1.0,1.0,0.5,0.75,0.5,1.0,0.5,1.0,0.5,0.75,...,0.46214,0.453088,0.488272,0.476812,0.432102,0.393445,0.457779,0.419536,0.475697,0.45826
n0028609fde88b03,1.0,1.0,0.75,0.75,0.5,0.75,0.5,0.75,0.5,1.0,...,0.52529,0.538764,0.500456,0.50537,0.526002,0.543837,0.53227,0.560145,0.518491,0.522983
n002bfff507f118e,0.5,0.5,0.5,0.5,0.25,0.5,0.5,0.25,0.5,0.25,...,0.496298,0.508979,0.501143,0.494654,0.529,0.535165,0.514607,0.524918,0.508741,0.520494
n002d0d989a01142,0.75,0.75,0.5,0.75,0.5,0.75,0.5,0.75,0.5,0.75,...,0.50193,0.506025,0.514477,0.505899,0.552279,0.565865,0.508856,0.536651,0.522526,0.516137
n00620b3b0a59ab1,0.75,0.75,0.5,1.0,0.5,0.75,0.5,0.5,0.5,0.75,...,0.479739,0.485134,0.504286,0.501196,0.535674,0.544112,0.506822,0.511838,0.491546,0.476516


### 2.1 Download the previously trained ensemble models

In [11]:
ut.download_from_s3_recursively(
    local_path=MODEL_DIR,
    s3_path=ENSEMBLE_MODELS_S3_PATH,
    aws_credential_fl=AWS_CREDENTIALS_FILE,
)

[2023-04-28 19:46:35]  Loading aws credenitals from ~/.aws/credentials...
[2023-04-28 19:46:36]  Would have downloaded s3://numerai-v1/experiments/ensemble_tgts_for_cyrus_2023-04-26_19h-36m/ensemble_models/en__alpha_0.0001_l1_ratio_0.001_cv0.pkl.pkl to ./data/experiments/ensemble_tgts_for_cyrus_2023-04-26_19h-36m/models/en__alpha_0.0001_l1_ratio_0.001_cv0.pkl.pkl. But ./data/experiments/ensemble_tgts_for_cyrus_2023-04-26_19h-36m/models/en__alpha_0.0001_l1_ratio_0.001_cv0.pkl.pkl exists. Will not download again ...
[2023-04-28 19:46:36]  Would have downloaded s3://numerai-v1/experiments/ensemble_tgts_for_cyrus_2023-04-26_19h-36m/ensemble_models/en__alpha_0.0001_l1_ratio_0.001_cv1.pkl.pkl to ./data/experiments/ensemble_tgts_for_cyrus_2023-04-26_19h-36m/models/en__alpha_0.0001_l1_ratio_0.001_cv1.pkl.pkl. But ./data/experiments/ensemble_tgts_for_cyrus_2023-04-26_19h-36m/models/en__alpha_0.0001_l1_ratio_0.001_cv1.pkl.pkl exists. Will not download again ...
[2023-04-28 19:46:36]  Would have 

## 3. Compute baseline metrics which is simply taking the average

In [12]:
def extract_cols_like(all_cols, match_cols):
    return [
        col for col in all_cols
        if any(partial in col for partial in match_cols)
    ]

In [13]:
import re
def refmt_predcols(col):
    """Use regex to extract col name 
    pred_target_arthur_v4_20_cv1 -> pred_arthur_v4_20"""
    return re.search(r"pred_target_(\w+)_cv", col).group(1)

In [14]:
def score_baselines(
    cv_valpreds,
    predcols_subset=None,
    agg_fn=np.mean,
    baseline_name="",
):
    """
    :param predcols_subset: This could be a partial name like ['arthur_v4_20', 'nomi_v4_60']
    """
    cv = len(cv_valpreds["cv_predcols_map"])
    # xval metrics, xval ensemble columns, xval prediction value describe
    bl_cv_metrics, bl_cv_predcols, cv_pred_descs = [], [], []
    for split, predcols, val_test_map in tqdm(
        zip(
            range(cv),
            cv_valpreds["cv_predcols_map"],
            cv_valpreds["cv_to_val_test_map"],
        ),
        desc="CV split",
        total=cv,
    ):
        train_df, test_df = val_test_map["val"], val_test_map["test"]
        log.info(
            f"{split=}, {train_df[predcols].shape=}, {train_df.era.min()=}, "
            f"{train_df.era.max()=}, {train_df.era.nunique()=}",
        )
        log.info(
            f"{split=}, {test_df[predcols].shape=}, {test_df.era.min()=}, "
            f"{test_df.era.max()=}, {test_df.era.nunique()=}",
        )
        ensmbl_predcol = f"ensemble_{baseline_name}_cv{split}"
        if predcols_subset is None:
            chosen_fts = predcols
        else:
            chosen_fts = extract_cols_like(all_cols=predcols, match_cols=predcols_subset)
        log.info(f"Chosen {len(chosen_fts)} features: \n{chosen_fts}")
        test_df[ensmbl_predcol] = agg_fn(test_df[chosen_fts], axis=1)
        bl_cv_metrics.append(
            ut.validation_metrics(
                validation_data=test_df, pred_cols=[ensmbl_predcol], target_col=TARGET_COL
            ),
        )
        cv_pred_descs.append(test_df[ensmbl_predcol].describe())
    log.info("Prediction distribution")
    display(pd.concat(cv_pred_descs, axis=1))
    baseline_metrics = ut.to_cv_agg_df(bl_cv_metrics)
    display(ut.fmt_metrics_df(baseline_metrics))
    return baseline_metrics

### 3.1 Average all 36 target models

In [15]:
bl_allpred_mean_metrics = score_baselines(
    cv_valpreds=cv_valpreds,
    predcols_subset=None,
    agg_fn=np.mean,
    baseline_name="allpreds_mean",
)

CV split:   0%|          | 0/3 [00:00<?, ?it/s]

[2023-04-28 19:46:39]  split=0, train_df[predcols].shape=(587086, 36), train_df.era.min()=820, train_df.era.max()=935, train_df.era.nunique()=116
[2023-04-28 19:46:39]  split=0, test_df[predcols].shape=(590333, 36), test_df.era.min()=948, test_df.era.max()=1059, test_df.era.nunique()=112
[2023-04-28 19:46:39]  Chosen 36 features: 
['pred_target_arthur_v4_20_cv0', 'pred_target_arthur_v4_60_cv0', 'pred_target_thomas_v4_20_cv0', 'pred_target_thomas_v4_60_cv0', 'pred_target_cyrus_v4_20_cv0', 'pred_target_cyrus_v4_60_cv0', 'pred_target_caroline_v4_20_cv0', 'pred_target_caroline_v4_60_cv0', 'pred_target_sam_v4_20_cv0', 'pred_target_sam_v4_60_cv0', 'pred_target_xerxes_v4_20_cv0', 'pred_target_xerxes_v4_60_cv0', 'pred_target_nomi_v4_20_cv0', 'pred_target_nomi_v4_60_cv0', 'pred_target_tyler_v4_20_cv0', 'pred_target_tyler_v4_60_cv0', 'pred_target_victor_v4_20_cv0', 'pred_target_victor_v4_60_cv0', 'pred_target_ralph_v4_20_cv0', 'pred_target_ralph_v4_60_cv0', 'pred_target_waldo_v4_20_cv0', 'pred_t

Unnamed: 0,ensemble_allpreds_mean_cv0,ensemble_allpreds_mean_cv1,ensemble_allpreds_mean_cv2
count,590333.0,571414.0,507469.0
mean,0.499084,0.501019,0.499891
std,0.016239,0.018883,0.025984
min,0.403835,0.392828,0.361744
25%,0.488993,0.488816,0.483469
50%,0.49961,0.501,0.499875
75%,0.509779,0.513161,0.516336
max,0.573454,0.60221,0.649094


Unnamed: 0,mean,std,sharpe
ensemble_allpreds_mean_cv0,2.56%,2.14%,119.57%
ensemble_allpreds_mean_cv1,2.84%,1.85%,153.61%
ensemble_allpreds_mean_cv2,2.71%,1.78%,152.79%
cv_mean,2.71%,1.92%,141.99%
cv_low,2.55%,1.70%,120.01%
cv_high,2.86%,2.14%,163.97%
cv_std,0.14%,0.19%,19.42%


In [16]:
len([c for c in cv_valpreds["cv_to_val_test_map"][0]["test"].columns if c.startswith("pred_target_")])

36

### 3.2 Average top 8 target models

In [142]:
bl_top8pred_mean_metrics = score_baselines(
    cv_valpreds=cv_valpreds,
    predcols_subset=[
    "pred_target_cyrus_v4_20",
    "pred_target_ralph_v4_20",
    "pred_target_sam_v4_20",
    "pred_target_xerxes_v4_20",
    "pred_target_caroline_v4_20",
    "pred_target_waldo_v4_20",
    "pred_target_nomi_v4_20",
    "pred_target_tyler_v4_20",
    ],
    agg_fn=np.mean,
    baseline_name="top8preds_mean",
)

CV split:   0%|          | 0/3 [00:00<?, ?it/s]

[2023-04-28 21:58:16]  split=0, train_df[predcols].shape=(587086, 36), train_df.era.min()=820, train_df.era.max()=935, train_df.era.nunique()=116
[2023-04-28 21:58:16]  split=0, test_df[predcols].shape=(590333, 36), test_df.era.min()=948, test_df.era.max()=1059, test_df.era.nunique()=112
[2023-04-28 21:58:16]  Chosen 8 features: 
['pred_target_cyrus_v4_20_cv0', 'pred_target_caroline_v4_20_cv0', 'pred_target_sam_v4_20_cv0', 'pred_target_xerxes_v4_20_cv0', 'pred_target_nomi_v4_20_cv0', 'pred_target_tyler_v4_20_cv0', 'pred_target_ralph_v4_20_cv0', 'pred_target_waldo_v4_20_cv0']
[2023-04-28 21:58:17]  split=1, train_df[predcols].shape=(571392, 36), train_df.era.min()=556, train_df.era.max()=671, train_df.era.nunique()=116
[2023-04-28 21:58:18]  split=1, test_df[predcols].shape=(571414, 36), test_df.era.min()=684, test_df.era.max()=795, test_df.era.nunique()=112
[2023-04-28 21:58:18]  Chosen 8 features: 
['pred_target_cyrus_v4_20_cv1', 'pred_target_caroline_v4_20_cv1', 'pred_target_sam_v4_2

Unnamed: 0,ensemble_top8preds_mean_cv0,ensemble_top8preds_mean_cv1,ensemble_top8preds_mean_cv2
count,590333.0,571414.0,507469.0
mean,0.499296,0.500676,0.50017
std,0.015129,0.017853,0.024679
min,0.410754,0.413525,0.355457
25%,0.489624,0.488847,0.484232
50%,0.49921,0.500145,0.49955
75%,0.509048,0.511997,0.515318
max,0.574484,0.608171,0.66813


Unnamed: 0,mean,std,sharpe
ensemble_top8preds_mean_cv0,2.60%,2.13%,122.04%
ensemble_top8preds_mean_cv1,2.97%,1.87%,159.35%
ensemble_top8preds_mean_cv2,3.05%,1.65%,184.33%
cv_mean,2.87%,1.88%,155.24%
cv_low,2.60%,1.61%,119.76%
cv_high,3.15%,2.15%,190.71%
cv_std,0.24%,0.24%,31.35%


### 3.3 Only cyrus

In [141]:
bl_cyrus_metrics = score_baselines(
    cv_valpreds=cv_valpreds,
    predcols_subset=[
    "pred_target_cyrus_v4_20",
    ],
    agg_fn=np.mean,
    baseline_name="top8preds_mean",
)

CV split:   0%|          | 0/3 [00:00<?, ?it/s]

[2023-04-28 21:58:06]  split=0, train_df[predcols].shape=(587086, 36), train_df.era.min()=820, train_df.era.max()=935, train_df.era.nunique()=116
[2023-04-28 21:58:06]  split=0, test_df[predcols].shape=(590333, 36), test_df.era.min()=948, test_df.era.max()=1059, test_df.era.nunique()=112
[2023-04-28 21:58:06]  Chosen 1 features: 
['pred_target_cyrus_v4_20_cv0']
[2023-04-28 21:58:08]  split=1, train_df[predcols].shape=(571392, 36), train_df.era.min()=556, train_df.era.max()=671, train_df.era.nunique()=116
[2023-04-28 21:58:08]  split=1, test_df[predcols].shape=(571414, 36), test_df.era.min()=684, test_df.era.max()=795, test_df.era.nunique()=112
[2023-04-28 21:58:08]  Chosen 1 features: 
['pred_target_cyrus_v4_20_cv1']
[2023-04-28 21:58:09]  split=2, train_df[predcols].shape=(500976, 36), train_df.era.min()=292, train_df.era.max()=408, train_df.era.nunique()=117
[2023-04-28 21:58:09]  split=2, test_df[predcols].shape=(507469, 36), test_df.era.min()=421, test_df.era.max()=531, test_df.era

Unnamed: 0,ensemble_top8preds_mean_cv0,ensemble_top8preds_mean_cv1,ensemble_top8preds_mean_cv2
count,590333.0,571414.0,507469.0
mean,0.499325,0.500647,0.499923
std,0.015416,0.018196,0.026042
min,0.399952,0.391883,0.340739
25%,0.489571,0.488724,0.483367
50%,0.499304,0.500166,0.499483
75%,0.509175,0.512116,0.515794
max,0.583978,0.622373,0.680007


Unnamed: 0,mean,std,sharpe
ensemble_top8preds_mean_cv0,2.40%,2.07%,115.92%
ensemble_top8preds_mean_cv1,2.92%,1.95%,149.20%
ensemble_top8preds_mean_cv2,2.98%,1.65%,180.92%
cv_mean,2.77%,1.89%,148.68%
cv_low,2.40%,1.65%,111.90%
cv_high,3.13%,2.14%,185.46%
cv_std,0.32%,0.22%,32.50%


In [19]:
ut.fmt_metrics_df(bl_cyrus_metrics.loc[["cv_mean"]], add_bar=False)

Unnamed: 0,mean,std,sharpe
cv_mean,2.77%,1.89%,148.68%


## 4.1 Sequential feature selection with RidgeRegression

In [44]:
import sklearn.linear_model as sklin
import sklearn.base


In [127]:
PREDCOL_PREFIXES = [c[:-4] for c in cv_valpreds["cv_predcols_map"][0]]
log.info(PREDCOL_PREFIXES)

[2023-04-28 21:45:16]  ['pred_target_arthur_v4_20', 'pred_target_arthur_v4_60', 'pred_target_thomas_v4_20', 'pred_target_thomas_v4_60', 'pred_target_cyrus_v4_20', 'pred_target_cyrus_v4_60', 'pred_target_caroline_v4_20', 'pred_target_caroline_v4_60', 'pred_target_sam_v4_20', 'pred_target_sam_v4_60', 'pred_target_xerxes_v4_20', 'pred_target_xerxes_v4_60', 'pred_target_nomi_v4_20', 'pred_target_nomi_v4_60', 'pred_target_tyler_v4_20', 'pred_target_tyler_v4_60', 'pred_target_victor_v4_20', 'pred_target_victor_v4_60', 'pred_target_ralph_v4_20', 'pred_target_ralph_v4_60', 'pred_target_waldo_v4_20', 'pred_target_waldo_v4_60', 'pred_target_jerome_v4_20', 'pred_target_jerome_v4_60', 'pred_target_janet_v4_20', 'pred_target_janet_v4_60', 'pred_target_ben_v4_20', 'pred_target_ben_v4_60', 'pred_target_alan_v4_20', 'pred_target_alan_v4_60', 'pred_target_paul_v4_20', 'pred_target_paul_v4_60', 'pred_target_george_v4_20', 'pred_target_george_v4_60', 'pred_target_william_v4_20', 'pred_target_william_v4_6

In [132]:
def xval(
    cv_valpreds,
    untrained_mdl,
    model_nm_prefix="",
    pcols_like_list=None,
    overwrite_models=False,
    verbose=1,
    to_split_train=False,
):
    """Cross validates the ensemble model.
    
    :param verbose: > 0 prints everything, -1 just coefficients and average cv perf
        and < -1, nothing.
    :param pcols_like_list: A list of partial predcol names to select
        a subset of predcols. Example: `['pred_target_sam_v4_20']` will
        select `pred_target_sam_v4_20_cv0`, `pred_target_sam_v4_20_cv1`
        and `pred_target_sam_v4_20_cv2`.
    :param to_split_train: split train into half for training and testing. Don't use
        the test set at all.
    """
    cv = len(cv_valpreds["cv_predcols_map"])
    ensmbl_cv_models, ensmbl_cv_predcols, ensmbl_cv_metrics = [], [], []
    cv_pred_descs, cv_num_feats, cv_coef_dfs = [], [], []
    verbose_pos = verbose > 0

    raw_iterand = zip(
        range(cv),
        cv_valpreds["cv_predcols_map"],
        cv_valpreds["cv_to_val_test_map"],
    )
    if verbose_pos:
        iterand = tqdm(raw_iterand, desc="CV split", total=cv)
    else:
        iterand = raw_iterand
    for split, predcols, val_test_map in iterand:
        model_nm = f"{model_nm_prefix}_cv{split}.pkl"
        if to_split_train:
            val_df = val_test_map["val"]
            median = np.median(val_df[ERA_COL].unique())
            train_df = val_df[val_df[ERA_COL] < median]
            test_df = val_df[val_df[ERA_COL] > median + 12]
        else:
            train_df, test_df = val_test_map["val"], val_test_map["test"]
        if pcols_like_list is None:
            chosen_cols = predcols
        else:
            chosen_cols = extract_cols_like(all_cols=predcols, match_cols=pcols_like_list)
        # Try to load the trained model
        loaded_mdl = ut.load_model(model_nm, model_folder=MODEL_DIR)
        train_st_tm = time.time()
        if loaded_mdl and not overwrite_models:
            if verbose_pos:
                log.info(f"Loaded saved model: `{model_nm}`")
            split_mdl = loaded_mdl
        else:
            if verbose_pos:
                log.info(f"Training new model. No model named `{model_nm}` saved...")
            split_mdl = sklearn.base.clone(untrained_mdl)
            # Use train and test df which have prediction columns from models trained on
            # each target, cval split
            if verbose_pos:
                log.info(
                    f"{split=}, {train_df[chosen_cols].shape=}, {train_df.era.min()=}, "
                    f"{train_df.era.max()=}, {train_df.era.nunique()=}",
                )
            # We don't have to filter out NAs as cyrus doesn't have NA values.
            if verbose_pos:
                log.info(f"Training model on {len(chosen_cols)} columns: {chosen_cols}")
            split_mdl.fit(X=train_df[chosen_cols], y=train_df[TARGET_COL])
            if verbose_pos:
                log.info(f"Saving the trained model `{model_nm}`...")
            ut.save_model(model=split_mdl, name=model_nm, model_folder=MODEL_DIR)
        # Model coefficients
        cv_coef_dfs.append(
            pd.DataFrame(
                {f"coef_cv{split}": np.concatenate(([split_mdl.intercept_], split_mdl.coef_))},
                index=["intercept"] + [refmt_predcols(c) for c in chosen_cols],
            )
        )
        ensmbl_predcol = f"ensemble_{model_nm_prefix}__cv{split}"
        if verbose_pos:
            log.info(f"Predicting column: {ensmbl_predcol}")
        test_df[ensmbl_predcol] = split_mdl.predict(X=test_df[chosen_cols])
        metrics_df = ut.validation_metrics(
            validation_data=test_df, pred_cols=[ensmbl_predcol], target_col=TARGET_COL
        )
        # Get stats on number of zeroed out features
        z_coef = split_mdl.coef_==0
        cv_num_feats.append((~z_coef).sum())
        if verbose_pos:
            log.info(f"Percent of zero columns: {z_coef.mean():.0%} ({z_coef.sum()}/{len(z_coef)})")
            log.info(f"Training time: {(time.time() - train_st_tm):.0f} seconds\n")
        # collect data for saving
        ensmbl_cv_models.append(split_mdl)
        ensmbl_cv_predcols.append(ensmbl_predcol)
        ensmbl_cv_metrics.append(metrics_df)
        cv_pred_descs.append(test_df[ensmbl_predcol].describe())
    if verbose_pos:
        log.info("Prediction distribution")
        display(pd.concat(cv_pred_descs, axis=1))
    cv_metrics = ut.to_cv_agg_df(ensmbl_cv_metrics)
    coefs_df = pd.concat(cv_coef_dfs, axis=1).transpose()
    coefs_df.loc['avg'] = coefs_df.mean(axis=0)
    if verbose == -1:
        display(coefs_df.style.bar(align="zero", color=["#d65f5f", "#74A662"]))
        display(cv_metrics.loc[["cv_mean"]])
    return {
        "cv_models": ensmbl_cv_models,
        "cv_pred_cols": ensmbl_cv_predcols,
        "cv_metrics": cv_metrics,
        "num_feats": int(np.mean(cv_num_feats)),
        "coefs": coefs_df,
    }

In [133]:
SFS_MAXCOLS = 8
SFS_RIDGE_PARAMS = dict(alpha=100., fit_intercept=False, random_state=42)
SFS_ESTIMATOR = sklin.Ridge(**SFS_RIDGE_PARAMS)
CHOSEN_METRIC = "sharpe"

In [134]:
def get_best_ft(ft_metric_map, metric):
    """Given a dict mapping key to scores, returns key with max score"""
    return sorted(ft_metric_map.items(), key=lambda kv: -kv[1].loc["cv_mean", metric])[0][0]

In [140]:
chosen_features = ["pred_target_cyrus_v4_20"]

num_feats_to_add = SFS_MAXCOLS-len(chosen_features)
for i in tqdm(range(num_feats_to_add), total=num_feats_to_add, desc="Rounds"):
    ft_opts = sorted(c for c in PREDCOL_PREFIXES if c not in chosen_features)
    ft_metric_map = {}
    ft_coef_map = {}
    for ft_to_try in tqdm(ft_opts, desc="Trying feature"):
        xval_res = xval(
            cv_valpreds=cv_valpreds,
            untrained_mdl=sklearn.base.clone(SFS_ESTIMATOR),
            model_nm_prefix=f"sfs_{hash(tuple(chosen_features + [ft_to_try]))}",
            pcols_like_list=chosen_features + [ft_to_try],
            overwrite_models=True,
            verbose=-2,
            to_split_train=True,
        )
        ft_metric_map[ft_to_try] = xval_res["cv_metrics"].loc[["cv_mean"]]
        ft_coef_map[ft_to_try] = xval_res['coefs']
        #display(ft_metric_map[ft_to_try])
    best_ft = get_best_ft(ft_metric_map, metric="mean")   # metric to choose feature
    best_metrics = ft_metric_map[best_ft]
    
    log.info(f"Best ft in round {i} is {best_ft} with validation metrics:")
    display(ut.fmt_metrics_df(best_metrics, add_bar=False))
    # With the new feature added, cross validate on out of sample test set
    chosen_features.append(best_ft)
    oos_metrics = xval(
        cv_valpreds=cv_valpreds,
        untrained_mdl=sklearn.base.clone(SFS_ESTIMATOR),
        model_nm_prefix=f"sfs_full_{hash(tuple(chosen_features))}",
        pcols_like_list=chosen_features,
        overwrite_models=True,
        verbose=-2,
        to_split_train=False,
    )["cv_metrics"].loc[["cv_mean"]]
    log.info(f"Best ft in round {i} is {best_ft} with test metrics:")
    display(ut.fmt_metrics_df(oos_metrics, add_bar=False))
    
    log.info(f"Adding {best_ft}...")
    display(ut.fmt_metrics_df(ft_coef_map[best_ft]))

Rounds:   0%|          | 0/7 [00:00<?, ?it/s]

Trying feature:   0%|          | 0/35 [00:00<?, ?it/s]

[2023-04-28 21:52:32]  Best ft in round 0 is pred_target_waldo_v4_20 with validation metrics:


Unnamed: 0,mean,std,sharpe
cv_mean,3.09%,1.66%,197.32%


[2023-04-28 21:52:36]  Best ft in round 0 is pred_target_waldo_v4_20 with test metrics:


Unnamed: 0,mean,std,sharpe
cv_mean,2.86%,1.89%,154.02%


[2023-04-28 21:52:36]  Adding pred_target_waldo_v4_20...


Unnamed: 0,intercept,cyrus_v4_20,waldo_v4_20
coef_cv0,0.00%,53.39%,46.59%
coef_cv1,0.00%,52.75%,47.01%
coef_cv2,0.00%,52.46%,47.37%
avg,0.00%,52.87%,46.99%


Trying feature:   0%|          | 0/34 [00:00<?, ?it/s]

[2023-04-28 21:53:02]  Best ft in round 1 is pred_target_victor_v4_20 with validation metrics:


Unnamed: 0,mean,std,sharpe
cv_mean,3.14%,1.61%,202.71%


[2023-04-28 21:53:05]  Best ft in round 1 is pred_target_victor_v4_20 with test metrics:


Unnamed: 0,mean,std,sharpe
cv_mean,2.83%,1.88%,152.76%


[2023-04-28 21:53:05]  Adding pred_target_victor_v4_20...


Unnamed: 0,intercept,cyrus_v4_20,victor_v4_20,waldo_v4_20
coef_cv0,0.00%,35.07%,35.78%,29.14%
coef_cv1,0.00%,37.15%,30.30%,32.33%
coef_cv2,0.00%,36.40%,30.75%,32.70%
avg,0.00%,36.21%,32.28%,31.39%


Trying feature:   0%|          | 0/33 [00:00<?, ?it/s]

[2023-04-28 21:53:30]  Best ft in round 2 is pred_target_tyler_v4_20 with validation metrics:


Unnamed: 0,mean,std,sharpe
cv_mean,3.15%,1.64%,200.17%


[2023-04-28 21:53:33]  Best ft in round 2 is pred_target_tyler_v4_20 with test metrics:


Unnamed: 0,mean,std,sharpe
cv_mean,2.83%,1.87%,154.50%


[2023-04-28 21:53:33]  Adding pred_target_tyler_v4_20...


Unnamed: 0,intercept,cyrus_v4_20,tyler_v4_20,victor_v4_20,waldo_v4_20
coef_cv0,0.00%,27.73%,22.99%,28.31%,20.98%
coef_cv1,0.00%,29.29%,25.30%,21.84%,23.39%
coef_cv2,0.00%,28.35%,26.50%,21.76%,23.23%
avg,0.00%,28.46%,24.93%,23.97%,22.53%


Trying feature:   0%|          | 0/32 [00:00<?, ?it/s]

[2023-04-28 21:53:57]  Best ft in round 3 is pred_target_alan_v4_60 with validation metrics:


Unnamed: 0,mean,std,sharpe
cv_mean,3.17%,1.68%,196.46%


[2023-04-28 21:54:00]  Best ft in round 3 is pred_target_alan_v4_60 with test metrics:


Unnamed: 0,mean,std,sharpe
cv_mean,2.82%,1.86%,153.85%


[2023-04-28 21:54:00]  Adding pred_target_alan_v4_60...


Unnamed: 0,intercept,cyrus_v4_20,tyler_v4_20,victor_v4_20,waldo_v4_20,alan_v4_60
coef_cv0,0.00%,21.53%,17.07%,21.12%,15.27%,25.04%
coef_cv1,0.00%,24.89%,20.93%,16.38%,19.39%,18.25%
coef_cv2,0.00%,25.56%,23.75%,18.22%,20.56%,11.77%
avg,0.00%,23.99%,20.58%,18.58%,18.41%,18.35%


Trying feature:   0%|          | 0/31 [00:00<?, ?it/s]

[2023-04-28 21:54:24]  Best ft in round 4 is pred_target_jerome_v4_60 with validation metrics:


Unnamed: 0,mean,std,sharpe
cv_mean,3.18%,1.68%,196.48%


[2023-04-28 21:54:27]  Best ft in round 4 is pred_target_jerome_v4_60 with test metrics:


Unnamed: 0,mean,std,sharpe
cv_mean,2.82%,1.85%,154.44%


[2023-04-28 21:54:27]  Adding pred_target_jerome_v4_60...


Unnamed: 0,intercept,cyrus_v4_20,tyler_v4_20,victor_v4_20,waldo_v4_20,jerome_v4_60,alan_v4_60
coef_cv0,0.00%,20.67%,16.04%,20.24%,14.02%,5.11%,23.97%
coef_cv1,0.00%,25.09%,21.23%,16.61%,19.73%,-1.37%,18.55%
coef_cv2,0.00%,25.84%,24.08%,18.50%,20.96%,-1.82%,12.30%
avg,0.00%,23.87%,20.45%,18.45%,18.24%,0.64%,18.27%


Trying feature:   0%|          | 0/30 [00:00<?, ?it/s]

[2023-04-28 21:54:50]  Best ft in round 5 is pred_target_william_v4_20 with validation metrics:


Unnamed: 0,mean,std,sharpe
cv_mean,3.19%,1.67%,198.08%


[2023-04-28 21:54:54]  Best ft in round 5 is pred_target_william_v4_20 with test metrics:


Unnamed: 0,mean,std,sharpe
cv_mean,2.80%,1.86%,151.86%


[2023-04-28 21:54:54]  Adding pred_target_william_v4_20...


Unnamed: 0,intercept,cyrus_v4_20,tyler_v4_20,victor_v4_20,waldo_v4_20,jerome_v4_60,alan_v4_60,william_v4_20
coef_cv0,0.00%,18.49%,14.14%,17.94%,12.00%,2.83%,22.51%,12.06%
coef_cv1,0.00%,22.01%,18.60%,13.42%,16.88%,-4.57%,16.70%,16.68%
coef_cv2,0.00%,22.01%,20.64%,14.38%,17.79%,-6.13%,11.09%,19.86%
avg,0.00%,20.84%,17.79%,15.25%,15.56%,-2.62%,16.77%,16.20%


Trying feature:   0%|          | 0/29 [00:00<?, ?it/s]

[2023-04-28 21:55:16]  Best ft in round 6 is pred_target_caroline_v4_20 with validation metrics:


Unnamed: 0,mean,std,sharpe
cv_mean,3.20%,1.68%,198.35%


[2023-04-28 21:55:19]  Best ft in round 6 is pred_target_caroline_v4_20 with test metrics:


Unnamed: 0,mean,std,sharpe
cv_mean,2.81%,1.85%,153.05%


[2023-04-28 21:55:19]  Adding pred_target_caroline_v4_20...


Unnamed: 0,intercept,cyrus_v4_20,caroline_v4_20,tyler_v4_20,victor_v4_20,waldo_v4_20,jerome_v4_60,alan_v4_60,william_v4_20
coef_cv0,0.00%,15.09%,15.74%,11.92%,15.37%,9.46%,1.95%,20.69%,9.79%
coef_cv1,0.00%,17.52%,18.30%,16.05%,10.34%,13.82%,-5.18%,14.87%,14.03%
coef_cv2,0.00%,17.98%,14.82%,18.74%,11.85%,15.09%,-6.47%,9.73%,17.95%
avg,0.00%,16.86%,16.29%,15.57%,12.52%,12.79%,-3.24%,15.10%,13.92%
