<h2>Recall Optimization</h2>

In this notebook, we once again utilize Optuna, however instead of NSGA-II for variable selection, we are interested in tuning model hyperparameters to achieve the best possible Recall score.

We utilize the TPESampler as our primary optimization algorithm, and utilize median pruining - that is, if the Recall result is worse than the current median, the test is scrapped. We compare versions of XGBoost and LogisticRegression with tuned hyperparameters, different decision boundaries, and the usage of a "balanced" mode that seeks to increase the influence of the rare positive class.

More detailed results can be found in our final report!

<h2>Note</h2>

These optimization loops can also take awhile, so please be patient! You will see (and read about why we did so in the report) that we decided to scrap one of the trials because we felt we already had enough information and a performative enough model.

In [1]:
import os
import sys
from pathlib import Path
import optuna
from optuna.samplers import TPESampler
from optuna.pruners import MedianPruner
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import TimeSeriesSplit
from xgboost import XGBClassifier
from sklearn.metrics import (
    accuracy_score,
    recall_score,
    precision_score,
    log_loss,
    roc_auc_score
)
from sklearn.dummy import DummyClassifier

src_path = os.path.abspath(os.path.join(os.getcwd(), '..', '..'))
if src_path not in sys.path:
    sys.path.insert(0, src_path)

from capstone.expanding_scaler import global_expanding_standard_scaler_by_date

BASE_DIR = Path().resolve().parent

In [None]:
class OptimizerClassifier():
    def __init__(
            self,
            search_iter=5000,
            decision_threshold=0.5,
            scoring_metric='recall',
            xgb_objective='binary:logistic',
            random_state=42,
            n_jobs=-1
    ):
        self.search_iter = search_iter
        self.decision_threshold = decision_threshold
        self.scoring_metric = scoring_metric
        self.xgb_objective = xgb_objective
        self.random_state = random_state
        self.n_jobs = n_jobs
        self.base_params = {
            "random_state": self.random_state,
            "n_jobs": self.n_jobs,
        }
        self.scorer = self._make_scorer_cust(self.scoring_metric)
        self.binary_vars = None
        self.date_col = None
        self.cv = None
        self.scaled_data = {}
        self.best_estimator = None
        self.best_score = None
        self.best_params = None
        self.best_use_balance = None

    def _make_scorer_cust(self, scoring_metric: str):
        if scoring_metric in ['logloss', 'mlogloss']:
            return log_loss
        # Uses 1 - accuracy to align with XGBoost error.
        elif scoring_metric in ['error', 'merror']:
            return lambda y_true, y_pred: 1 - (accuracy_score(y_true, y_pred))
        elif scoring_metric == "recall":
            return recall_score
        elif scoring_metric == "precision":
            return precision_score
        elif scoring_metric == "auc":
            return roc_auc_score
        else:
            raise ValueError(f"Unsupported scoring metric: {scoring_metric}")
    
    def _get_scaled_train_test_groups(self, X, y):

        order_idx = X[self.date_col].sort_values().index
        X_sorted = X.loc[order_idx]

        cont_cols = [c for c in X.columns if c not in self.binary_vars and c != self.date_col]
                
        if self.cv is None:
            
            tscv = TimeSeriesSplit(
                n_splits=3,
                test_size=int(round(X_sorted.shape[0] * 0.10, 0)),
                gap=0,
            )
            self.cv = list(tscv.split(X_sorted))

            for split, (train_index, test_index) in enumerate(self.cv):
                X_train, X_test = X_sorted.iloc[train_index], X_sorted.iloc[test_index]

                X_train_scaled = X_train.copy()
                X_test_scaled  = X_test.copy()

                X_train_scaled[cont_cols] = X_train_scaled[cont_cols].astype(float)
                X_test_scaled[cont_cols]  = X_test_scaled[cont_cols].astype(float)

                train_for_scaler = X_train_scaled[cont_cols + [self.date_col]]
                train_scaled_full, train_scaler_state = global_expanding_standard_scaler_by_date(
                    train_for_scaler,
                    date_col=self.date_col,
                    merge_cols=[self.date_col],
                    min_periods=0,
                    return_stats=True,
                )
                X_train_scaled.loc[train_scaled_full.index, cont_cols] = train_scaled_full[cont_cols]

                test_for_scaler = X_test_scaled[cont_cols + [self.date_col]]
                test_scaled_full = global_expanding_standard_scaler_by_date(
                    test_for_scaler,
                    date_col=self.date_col,
                    merge_cols=[self.date_col],
                    min_periods=0,
                    stats=train_scaler_state,
                    return_stats=False,
                )
                X_test_scaled.loc[test_scaled_full.index, cont_cols] = test_scaled_full[cont_cols]

                X_train_lr = X_train_scaled.drop(columns=[self.date_col])
                X_test_lr = X_test_scaled.drop(columns=[self.date_col])

                self.scaled_data[f'train_{split}'] = X_train_lr
                self.scaled_data[f'test_{split}'] = X_test_lr

            X_group_needs_scaling = X[cont_cols + [self.date_col]]
            X_group_scaled = global_expanding_standard_scaler_by_date(
                X_group_needs_scaling,
                date_col=self.date_col,
                merge_cols=[self.date_col],
                min_periods=0
            )
            X_group_scaled_no_date = X_group_scaled.drop(columns=[self.date_col])
            self.scaled_data['all'] = X_group_scaled_no_date

    def _eval_classifier(self, X, y, model_params, trial):

        if self.cv is None:
            raise ValueError('self.cv is not set.')
        
        order_idx = X[self.date_col].sort_values().index
        y_sorted = y.loc[order_idx]
        
        fold_scores = []
        for split, (train_index, test_index) in enumerate(self.cv):
            X_train = self.scaled_data[f'train_{split}']
            X_test  = self.scaled_data[f'test_{split}']

            y_train, y_test = y_sorted.iloc[train_index], y_sorted.iloc[test_index]
            
            # We aren't using the pruining callback because it would interrupt the
            # k-fold cross-validation. Instead, we use early stopping as a parameter
            # of the model, and allow Optuna to then decide where to search next.
            model = self.ModelClass(**model_params)

            if self.ModelClass is XGBClassifier:
                model.fit(
                    X_train, y_train,
                    eval_set=[(X_test, y_test)],
                    verbose=False
                )
            else:
                model.fit(X_train,y_train)
            # During training with an eval_set and early_stopping_rounds,
            # XGBoost tracks the validation score at each boosting round.
            # When validation stops improving for early_stopping_rounds
            # consecutive rounds, training halts and best_iteration is then
            # set to the boosting round (0-based index) with the best validation score.
            best_iter = getattr(model, "best_iteration", None)
            use_proba = self.scoring_metric in ("logloss", "mlogloss", "auc")

            if best_iter is not None:
                y_proba_test = model.predict_proba(X_test, iteration_range=(0, best_iter + 1))[:, 1]
            else:
                y_proba_test = model.predict_proba(X_test)[:, 1]
            
            # yhat_train = (y_proba_train >= decision_threshold).astype(int)
            yhat_test = (y_proba_test  >= self.decision_threshold).astype(int)

            if use_proba:
                fold_score = self.scorer(y_test, y_proba_test)
            else:
                fold_score = self.scorer(y_test, yhat_test)

            if self.ModelClass is XGBClassifier:
                trial.report(fold_score, step=split)
                if trial.should_prune():
                    raise optuna.TrialPruned()

            fold_scores.append(fold_score)
        
        return float(np.mean(fold_scores))

    def _run_optimization(self, X, y):

        self._get_scaled_train_test_groups(X, y)

        def __objective(trial: optuna.Trial) -> float:

            if self.cv is None or not isinstance(self.cv, list):
                raise ValueError('cv_splits is not set.')
            if X is None:
                raise ValueError('X is not set.')
            if y is None:
                raise ValueError('y is not set.')
            if self.ModelClass is None:
                raise ValueError('ModelClass is not set.')
            
            if self.ModelClass is XGBClassifier:
                neg = (y == 0).sum()
                pos = (y == 1).sum()
                balance_eq = neg / pos

                use_balance = trial.suggest_categorical("use_balance_weight", [True, False])
                model_params = {
                    **self.base_params,
                    "n_estimators": 3000,
                    "early_stopping_rounds": 50,
                    "objective": self.xgb_objective,
                    "learning_rate": trial.suggest_float("learning_rate", 1e-3, 0.3, log=True),
                    "max_depth": trial.suggest_int("max_depth", 3, 10),
                    "min_child_weight": trial.suggest_float("min_child_weight", 0.5, 20.0, log=True),
                    "subsample": trial.suggest_float("subsample", 0.6, 1.0),
                    "colsample_bytree": trial.suggest_float("colsample_bytree", 0.6, 1.0),
                    "gamma": trial.suggest_float("gamma", 1e-9, 10.0, log=True),
                }
                if use_balance:
                    model_params["scale_pos_weight"] = balance_eq
            else:
                model_params = {
                    **self.base_params,
                    "max_iter": 1000,
                    # "solver": "liblinear",
                    "class_weight": trial.suggest_categorical("class_weight", ["balanced", None]),
                    "C": trial.suggest_float("C", 1e-9, 10.0, log=True),
                    #"penalty": trial.suggest_categorical("penalty", ["l1", "l2"]),
                }
            

            score = self._eval_classifier(X, y, model_params, trial)

            return score
        
        if self.scoring_metric in ("logloss", "mlogloss", "error", "merror"):
            direction = "minimize"
        else:
            direction = "maximize"
        
        study = optuna.create_study(
            direction=direction,
            pruner=MedianPruner(n_min_trials=self.search_iter // 2),
            sampler=TPESampler(seed=self.random_state)
        )
        study.optimize(
                __objective,
            n_trials=self.search_iter
        )

        order_idx = X[self.date_col].sort_values().index
        X_all_scaled = self.scaled_data['all'].loc[order_idx]
        y_sorted = y.loc[order_idx]

        study_params = dict(study.best_params)

        use_balance = study_params.pop("use_balance_weight", None)
        best_model_params = {**self.base_params, **study_params}
        best_estimator = self.ModelClass(**best_model_params)
        best_estimator.fit(X_all_scaled, y_sorted)

        self.best_estimator = best_estimator
        self.best_score = study.best_value
        self.best_params = best_model_params
        self.best_use_balance = use_balance

    def fit_transform(
        self,
        X,
        y,
        date_col,
        binary_vars=None,
        model_type='xgb_clf'
    ):
        
        self.date_col = date_col
        self.binary_vars = binary_vars or []
        
        if model_type == 'xgb_clf':
            self.ModelClass = XGBClassifier
        elif model_type == 'lr':
            self.ModelClass = LogisticRegression
        else:
            raise ValueError(f"Model type {model_type} is not supported.")

        self._run_optimization(X, y)

        return (self.best_estimator, self.best_score, self.best_params)

In [7]:
df = pd.read_csv(BASE_DIR / 'recall_optimized_modeling' / 'data' / 'natality_10yr_test_data_cat.csv')

X = df.drop(columns=['morbidity_reported'])
y = df['morbidity_reported']

binary_vars = [
    'dmar',
    'ca_anen',
    'ca_mnsb',
    'ca_cchd',
    'ca_cdh',
    'ca_omph',
    'ca_gast',
    'ca_limb',
    'ca_cleft',
    'ca_clpal',
    'ca_hypo',
    'sex',
    'ca_down',
    'precare_binary',
    'prior_dead_term_binary',
    'ca_disor',
    'smoking',
    'hospital_birth_binary'
 ]

In [8]:
dummy = DummyClassifier(strategy='uniform')
dummy2 = DummyClassifier(strategy='stratified')

dummy.fit(X.drop('date', axis=1), y)
dummy2.fit(X.drop('date', axis=1), y)

y_pred = dummy.predict(X)
y_pred2 = dummy2.predict(X)

print(f"Dummy recall:  {recall_score(y, y_pred,  average='macro')}")
print(f"Dummy2 recall: {recall_score(y, y_pred2, average='macro')}")

Dummy recall:  0.500796931570985
Dummy2 recall: 0.49950864589961813


In [5]:
optimizer = OptimizerClassifier(
    search_iter=500,
    random_state=42,
    decision_threshold=0.5,
    scoring_metric='recall',
    xgb_objective='binary:logistic',
)

best_model, best_score, best_params = optimizer.fit_transform(
    X,
    y,
    date_col='date',
    binary_vars=binary_vars,
    model_type='xgb_clf'
)

[I 2025-11-23 09:27:47,763] A new study created in memory with name: no-name-2be96f25-e49e-40d5-8d86-8bf1fa56d190
[I 2025-11-23 09:28:10,752] Trial 0 finished with value: 0.0003388465129506594 and parameters: {'use_balance_weight': False, 'learning_rate': 0.06504856968981275, 'max_depth': 7, 'min_child_weight': 0.8890398459575589, 'subsample': 0.662397808134481, 'colsample_bytree': 0.6232334448672797, 'gamma': 0.4589458612326473}. Best is trial 0 with value: 0.0003388465129506594.
[I 2025-11-23 09:35:09,305] Trial 1 finished with value: 0.0 and parameters: {'use_balance_weight': False, 'learning_rate': 0.001124579825911934, 'max_depth': 10, 'min_child_weight': 10.779361932748845, 'subsample': 0.6849356442713105, 'colsample_bytree': 0.6727299868828402, 'gamma': 6.824095540630416e-08}. Best is trial 0 with value: 0.0003388465129506594.
[I 2025-11-23 09:36:56,180] Trial 2 finished with value: 0.0005923114461669695 and parameters: {'use_balance_weight': False, 'learning_rate': 0.0117484395

In [10]:
optimizer2 = OptimizerClassifier(
    search_iter=500,
    random_state=42,
    decision_threshold=0.3,
    scoring_metric='recall',
    xgb_objective='binary:logistic',
)

best_model2, best_score2, best_params2 = optimizer2.fit_transform(
    X,
    y,
    date_col='date',
    binary_vars=binary_vars,
    model_type='xgb_clf'
)

[I 2025-11-24 11:36:53,428] A new study created in memory with name: no-name-e4e3fb9f-d760-4883-ae4c-a5ac16334911
[I 2025-11-24 11:37:16,891] Trial 0 finished with value: 0.004405499286918578 and parameters: {'use_balance_weight': False, 'learning_rate': 0.06504856968981275, 'max_depth': 7, 'min_child_weight': 0.8890398459575589, 'subsample': 0.662397808134481, 'colsample_bytree': 0.6232334448672797, 'gamma': 0.4589458612326473}. Best is trial 0 with value: 0.004405499286918578.
[I 2025-11-24 11:44:20,243] Trial 1 finished with value: 0.001019029616633373 and parameters: {'use_balance_weight': False, 'learning_rate': 0.001124579825911934, 'max_depth': 10, 'min_child_weight': 10.779361932748845, 'subsample': 0.6849356442713105, 'colsample_bytree': 0.6727299868828402, 'gamma': 6.824095540630416e-08}. Best is trial 0 with value: 0.004405499286918578.
[I 2025-11-24 11:46:09,025] Trial 2 finished with value: 0.005434678942882872 and parameters: {'use_balance_weight': False, 'learning_rate':

KeyboardInterrupt: 

In [None]:
df2 = pd.read_csv(BASE_DIR / 'recall_modeling' / 'data' / 'natality_10yr_test_data_one_hot.csv')

X2 = df2.drop(columns=['morbidity_reported'])
y2 = df2['morbidity_reported']

binary_vars = [
 'ca_down',
 'precare_binary',
 'prior_dead_term_binary',
 'ca_disor',
 'rf_inftr',
 'rf_cesar',
 'hospital_birth_binary',
 'mracehisp_2',
 'mracehisp_3',
 'mracehisp_4',
 'mracehisp_5',
 'mracehisp_6',
 'mracehisp_7',
 'mracehisp_8',
 'meduc_2',
 'meduc_3',
 'meduc_4',
 'meduc_5',
 'meduc_6',
 'meduc_7',
 'meduc_8',
 'fracehisp_2',
 'fracehisp_3',
 'fracehisp_4',
 'fracehisp_5',
 'fracehisp_6',
 'fracehisp_7',
 'fracehisp_8',
 'feduc_2',
 'feduc_3',
 'feduc_4',
 'feduc_5',
 'feduc_6',
 'feduc_7',
 'feduc_8',
 'rf_pdiab_1',
 'rf_gdiab_1',
 'rf_phype_1',
 'rf_ghype_1',
 'cig_rec_0',
 'cig_rec_1',
 'rf_ehype_1',
 'rf_ppterm_1',
 'ip_gon_1',
 'ip_syph_1',
 'ip_chlam_1',
 'ip_hepb_1',
 'ip_hepc_1',
 'ld_indl_1',
 'ld_augm_1',
 'ld_ster_1',
 'ld_antb_1',
 'ld_chor_1',
 'ld_anes_1',
 'me_pres_2',
 'me_pres_3',
 'me_rout_1',
 'me_rout_2',
 'me_rout_3',
 'me_rout_4',
 'attend_1',
 'attend_2',
 'attend_3',
 'attend_4',
 'attend_5',
 'pay_1',
 'pay_3',
 'pay_4',
 'pay_5',
 'pay_6',
 'pay_8',
 'dplural_2',
 'dplural_3',
 'dplural_4',
 'dplural_5',
 'ab_aven1_1',
 'ab_aven6_1',
 'ab_nicu_1',
 'ab_surf_1',
 'ab_anti_1',
 'ab_seiz_1',
 'dob_mm_10',
 'dob_mm_11',
 'dob_mm_12',
 'dob_mm_2',
 'dob_mm_3',
 'dob_mm_4',
 'dob_mm_5',
 'dob_mm_6',
 'dob_mm_7',
 'dob_mm_8'
 ]

In [7]:
optimizer3 = OptimizerClassifier(
    search_iter=500,
    random_state=42,
    decision_threshold=0.5,
    scoring_metric='recall',
    xgb_objective='binary:logistic',
    n_jobs=1
)

best_model3, best_score3, best_params3 = optimizer3.fit_transform(
    X2,
    y2,
    date_col='date',
    binary_vars=binary_vars,
    model_type='lr'
)

[I 2025-11-24 07:36:19,922] A new study created in memory with name: no-name-04167ce7-689c-4b67-bab3-cded54246efb
[I 2025-11-24 07:36:32,328] Trial 0 finished with value: 0.0006805252770653881 and parameters: {'class_weight': None, 'C': 0.02089004704926668}. Best is trial 0 with value: 0.0006805252770653881.
[I 2025-11-24 07:36:38,898] Trial 1 finished with value: 0.6231534791757305 and parameters: {'class_weight': 'balanced', 'C': 3.630322466779864e-08}. Best is trial 1 with value: 0.6231534791757305.
[I 2025-11-24 07:36:52,850] Trial 2 finished with value: 0.0 and parameters: {'class_weight': None, 'C': 0.0010260065124896788}. Best is trial 1 with value: 0.6231534791757305.
[I 2025-11-24 07:37:30,457] Trial 3 finished with value: 0.5699918615258052 and parameters: {'class_weight': 'balanced', 'C': 5.001479828856933}. Best is trial 1 with value: 0.6231534791757305.
[I 2025-11-24 07:37:36,710] Trial 4 finished with value: 0.62145958878436 and parameters: {'class_weight': 'balanced', 'C

In [8]:
optimizer4 = OptimizerClassifier(
    search_iter=500,
    random_state=42,
    decision_threshold=0.3,
    scoring_metric='recall',
    xgb_objective='binary:logistic',
)

best_model4, best_score4, best_params4 = optimizer4.fit_transform(
    X2,
    y2,
    date_col='date',
    binary_vars=binary_vars,
    model_type='lr'
)

[I 2025-11-24 09:00:45,871] A new study created in memory with name: no-name-566689f4-dc60-4ca6-9149-42b3e74ac741
[I 2025-11-24 09:01:11,477] Trial 0 finished with value: 0.004590518566310295 and parameters: {'class_weight': None, 'C': 0.02089004704926668}. Best is trial 0 with value: 0.004590518566310295.
[I 2025-11-24 09:01:28,873] Trial 1 finished with value: 1.0 and parameters: {'class_weight': 'balanced', 'C': 3.630322466779864e-08}. Best is trial 1 with value: 1.0.
[I 2025-11-24 09:01:51,988] Trial 2 finished with value: 0.0005941171771830147 and parameters: {'class_weight': None, 'C': 0.0010260065124896788}. Best is trial 1 with value: 1.0.
[I 2025-11-24 09:02:42,439] Trial 3 finished with value: 0.9513527553233564 and parameters: {'class_weight': 'balanced', 'C': 5.001479828856933}. Best is trial 1 with value: 1.0.
[I 2025-11-24 09:02:57,335] Trial 4 finished with value: 1.0 and parameters: {'class_weight': 'balanced', 'C': 6.580360277501321e-08}. Best is trial 1 with value: 1.