# CS565-DS522 IoT Data Science Mini Project for K-EmoPhone dataset
*This material is a joint work of TAs from IC Lab at KAIST, including Panyu Zhang, Soowon Kang, and Woohyeok Choi. This work is licensed under CC BY-SA 4.0.*

## Instruction
In this mini-project, we will build a model to predict users' self-reported stress using extracted features from K-EmoPhone dataset. This material mainly refers to the public [repository](https://github.com/SteinPanyu/IndependentReproducibility) conducting indepedent reproducibility experiments on K-EmoPhone dataset. In order to save time, we provide the extracted features from the raw data instead of starting from scratch. Besides, traditional machine learning model is used considering limited number of labels and multimodality issue in the in-the-wild K-EmoPhone dataset.



## Guidance

1. Before running the code, please first download the extracted features from the following [link](https://drive.google.com/file/d/1HcyFvzWEzO21osyP5E8VpVmHROX1ew7q/view?usp=sharing).

2. Please change your runtime type to T4-GPU or other runtime types with GPU available since later we may use GPU for
xgboost execution

Install latest version of xgboost > 2.0.0

In [None]:
!pip install xgboost



In [None]:
import pytz
import os
import pandas as pd
import numpy as np
import scipy.stats as st
import cloudpickle
from datetime import datetime
from contextlib import contextmanager
import warnings
import time
from typing import Optional
from contextlib import contextmanager
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score
import random
import torch


def set_seed(seed=42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    os.environ["PYTHONHASHSEED"] = str(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

set_seed(42)

DEFAULT_TZ = pytz.FixedOffset(540)  # GMT+09:00; Asia/Seoul

RANDOM_STATE =42


def log(msg: any):
    print('[{}] {}'.format(datetime.now().strftime('%y-%m-%d %H:%M:%S'), msg))

## 1.Preparation

### 1.1. Mount to Your Google Drive

In [None]:
from google.colab import drive

drive.mount('/content/drive')

Mounted at /content/drive


### 1.2. Load Extracted Features

In [None]:
import pickle
import numpy as np

'''Please specify your dataset path in your Google Drive'''
PATH = '/content/drive/MyDrive/IoT_Data_Science/KEmoPhone/features_stress_fixed_K-EmoPhone.pkl'

X, y, groups, t, datetimes = pickle.load(open(PATH, mode='rb'))

  X, y, groups, t, datetimes = pickle.load(open(PATH, mode='rb'))


X is the extracted features and the feature extraction process refers to the public [repository](https://github.com/SteinPanyu/IndependentReproducibility) and the immediate past time window is set as 15 minutes. y is the array of labels while groups is the user ids.

Please note that here y is binarized using theoretical threshold (if ESM stress > 0, binarize as 1, else 0, ESM label scale [-3, 3])

Since features are already extracted, we do not need to work on preprocessing and feature extraction again.

## 2.Feature Preparation


There exist multiple types of features. Please try different combinations of features to see if there is any model performance improvement.

In [None]:

#The following code is designed for reordering the data
#################################################
# Create a DataFrame with user_id and datetime

df = pd.DataFrame({'user_id': groups, 'datetime': datetimes, 'label': y})

# df_merged = pd.merge(df, X, left_index=True, right_index=True)
df_merged = pd.merge(df, X, left_index=True, right_index=True)

# Sort the DataFrame by datetime
df_merged = df_merged.sort_values(by=['user_id', 'datetime'])

# Update groups and datetimes
groups = df_merged['user_id'].to_numpy()
datetimes = df_merged['datetime'].to_numpy()
y = df_merged['label'].to_numpy()

#X with all the features
X_cleaned = df_merged.drop(columns=['user_id', 'datetime', 'label'])


#Divide the features into different categories
feat_current = X.loc[:,[('#VAL' in str(x)) or ('ESM#LastLabel' in str(x)) for x in X.keys()]]
feat_dsc = X.loc[:,[('#DSC' in str(x))  for x in X.keys()]]
feat_yesterday = X.loc[:,[('Yesterday' in str(x))  for x in X.keys()]]
feat_today = X.loc[:,[('Today' in str(x))  for x in X.keys()]]

feat_ImmediatePast = X.loc[:,[('ImmediatePast_15' in str(x))  for x in X.keys()]]

#################################################################################
#Below are the available features
#Divide the time window features into sensor/ESM self-report features
feat_current_sensor = X.loc[:,[('#VAL' in str(x))  for x in X.keys()]] #Current sensor features (value right before label)
feat_current_ESM = X.loc[:,[('ESM#LastLabel' in str(x)) for x in X.keys()]] #Current ESM features (value right before label)
feat_ImmediatePast_sensor = feat_ImmediatePast.loc[:,[('ESM' not in str(x)) for x in feat_ImmediatePast.keys()]] #Immediate past sensor features (in past 15 minutes before label)
feat_ImmediatePast_ESM = feat_ImmediatePast.loc[:,[('ESM'  in str(x)) for x in feat_ImmediatePast.keys()]]  #Immediate past ESM features
feat_today_sensor = feat_today.loc[:,[('ESM' not in str(x))  for x in feat_today.keys()]] #Today epoch sensor features
feat_today_ESM = feat_today.loc[:,[('ESM'  in str(x)) for x in feat_today.keys()]] #Today epoch ESM features
feat_yesterday_sensor = feat_yesterday.loc[:,[('ESM' not in str(x)) for x in feat_yesterday.keys()]] #Yesterday sensor features
feat_yesterday_ESM = feat_yesterday.loc[:,[('ESM'  in str(x)) for x in feat_yesterday.keys()]] #Yesterday ESM features

feat_sleep = X.loc[:,[('Sleep' in str(x))  for x in X.keys()]]
feat_time = X.loc[:,[('Time' in str(x))  for x in X.keys()]]
feat_pif = X.loc[:,[('PIF' in str(x))  for x in X.keys()]]
################################################################################

#Prepare the final feature set
feat_baseline = pd.concat([ feat_time,feat_dsc,feat_current_sensor, feat_ImmediatePast_sensor],axis=1)
feat_final = pd.concat([feat_baseline],axis=1)

################################################################################
#X for the baseline originally given in the notebook
X = feat_final
cats = X.columns[X.dtypes == bool]

In [None]:
feature_groups = {
    "feat_time": feat_time,
    "feat_dsc": feat_dsc,
    "feat_current_sensor": feat_current_sensor,
    "feat_current_ESM": feat_current_ESM,
    "feat_ImmediatePast_sensor": feat_ImmediatePast_sensor,
    "feat_ImmediatePast_ESM": feat_ImmediatePast_ESM,
    "feat_today_sensor": feat_today_sensor,
    "feat_today_ESM": feat_today_ESM,
    "feat_yesterday_sensor": feat_yesterday_sensor,
    "feat_yesterday_ESM": feat_yesterday_ESM,
    "feat_sleep": feat_sleep,
    "feat_pif": feat_pif,
}

feature_summary = {name: data.shape[1] for name, data in feature_groups.items()}
feature_summary

{'feat_time': 16,
 'feat_dsc': 55,
 'feat_current_sensor': 84,
 'feat_current_ESM': 1,
 'feat_ImmediatePast_sensor': 416,
 'feat_ImmediatePast_ESM': 0,
 'feat_today_sensor': 2496,
 'feat_today_ESM': 6,
 'feat_yesterday_sensor': 2496,
 'feat_yesterday_ESM': 6,
 'feat_sleep': 2,
 'feat_pif': 11}

## 3.Model Training & Evaluation


Here is the revised XGBoost Classifier. We will use random eval_size percent of training set data as evaluation set for early stoppping.

In [None]:
import pandas as pd
import numpy as np
from xgboost import XGBClassifier, DMatrix
from sklearn.base import BaseEstimator
from sklearn.model_selection import StratifiedShuffleSplit, train_test_split
from typing import Union

#Function for revised xgboost classifier
class EvXGBClassifier(BaseEstimator):
    """
    Enhanced XGBClassifier with built-in validation set approach for early stopping.
    """
    def __init__(
        self,
        eval_size=None,
        eval_metric='logloss',
        early_stopping_rounds=10,
        random_state=None,
        **kwargs
        ):
        """
        Initializes the custom XGBoost Classifier.

        Args:
            eval_size (float): The proportion of the dataset to include in the evaluation split.
            eval_metric (str): The evaluation metric used for model training.
            early_stopping_rounds (int): The number of rounds to stop training if hold-out metric doesn't improve.
            random_state (int): Seed for the random number generator for reproducibility.
            **kwargs: Additional arguments to be passed to the underlying XGBClassifier.
        """
        self.random_state = random_state
        self.eval_size = eval_size
        self.eval_metric = eval_metric
        self.early_stopping_rounds = early_stopping_rounds
        # Initialize the XGBClassifier with specified arguments and GPU acceleration.
        self.model = XGBClassifier(
            random_state=self.random_state,
            eval_metric=self.eval_metric,
            early_stopping_rounds=self.early_stopping_rounds,
            tree_method = "hist", device = "cuda", #Use gpu for acceleration
            **kwargs
        )

    @property
    def feature_importances_(self):
        """ Returns the feature importances from the fitted model. """
        return self.model.feature_importances_

    @property
    def feature_names_in_(self):
        """ Returns the feature names from the input dataset used for fitting. """
        return self.model.feature_names_in_

    def fit(self, X: Union[pd.DataFrame, np.ndarray], y: np.ndarray):
        """
        Fit the XGBoost model with optional early stopping using a validation set.

        Args:
            X (Union[pd.DataFrame, np.ndarray]): Training features.
            y (np.ndarray): Target values.
        """
        if self.eval_size:
            # Split data for early stopping evaluation if eval_size is specified.
            X_train_sub, X_val, y_train_sub, y_val = train_test_split(
                X, y, test_size=self.eval_size, random_state=self.random_state)
            # Fit the model with early stopping.
            self.model.fit(
                X_train_sub, y_train_sub,
                eval_set=[(X_val, y_val)],
                verbose=False
            )
        else:
            # Fit the model without early stopping.
            self.model.set_params(early_stopping_rounds=None)
            self.model.fit(X, y, verbose=False)

        # Store the best iteration number for predictions.
        # Best iteration (safe fallback)
        booster = self.model.get_booster()
        self.best_iteration_ = (
            booster.best_iteration if hasattr(booster, "best_iteration") else None
        )
        return self

    def predict(self, X: pd.DataFrame):
        """
        Predict the classes for the given features.

        Args:
            X (pd.DataFrame): Input features.
        """
        if self.best_iteration_ is not None:
            return self.model.predict(X, iteration_range=(0, self.best_iteration_ + 1))
        else:
            return self.model.predict(X)

    def predict_proba(self, X: pd.DataFrame):
        """
        Predict the class probabilities for the given features.

        Args:
            X (pd.DataFrame): Input features.
        """
        if self.best_iteration_ is not None:
            return self.model.predict_proba(X, iteration_range=(0, self.best_iteration_ + 1))
        else:
            return self.model.predict_proba(X)

The following is defined functions for model training and model evaluation (cross-validation).

In [None]:
import os
import pandas as pd
import numpy as np
import time
import traceback
from sklearn.linear_model import LogisticRegression
from sklearn.base import clone
from sklearn.model_selection import StratifiedShuffleSplit, StratifiedKFold, LeaveOneGroupOut, StratifiedGroupKFold
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE, SMOTENC
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import roc_auc_score
from dataclasses import dataclass

@dataclass
class FoldResult:
    name: str
    metrics: dict
    duration: float

def log(message: str):
    print(message)  # Simple logging to stdout or enhance as needed

def train_fold(dir_result: str, fold_name: str, X_train, y_train, X_test, y_test, C_cat, C_num, estimator, normalize, select, oversample, random_state):
    """
    Function to train and evaluate the model for a single fold.
    Args:
        dir_result (str): Directory to store results.
        fold_name (str): Name of the fold for identification.
        X_train, y_train (DataFrame, Series): Training data.
        X_test, y_test (DataFrame, Series): Testing data.
        C_cat, C_num (array): Lists of categorical and numeric feature names.
        estimator (estimator instance): The model to be trained.
        normalize (bool): Flag to apply normalization.
        select (SelectFromModel instance): Feature selection method.
        oversample (bool): Flag to apply oversampling.
        random_state (int): Random state for reproducibility.
    Returns:
        FoldResult: Object containing metrics and duration of the training.
    """
    try:
        start_time = time.time()
        if normalize:
            X_train_N, X_test_N = X_train[C_num].values, X_test[C_num].values
            X_train_C, X_test_C = X_train[C_cat].values, X_test[C_cat].values
            # Standard scaler only applied to numeric data
            scaler = StandardScaler().fit(X_train_N)
            X_train_N = scaler.transform(X_train_N)
            X_test_N = scaler.transform(X_test_N)

            X_train = pd.DataFrame(
                np.concatenate((X_train_C, X_train_N), axis=1),
                columns=np.concatenate((C_cat, C_num))
            )
            X_test = pd.DataFrame(
                np.concatenate((X_test_C, X_test_N), axis=1),
                columns=np.concatenate((C_cat, C_num))
            )

        #Applying the LASSO feature selection method
        if select:

            if isinstance(select, SelectFromModel):
                select = [select]

            for i, s in enumerate(select):
                C = np.asarray(X_train.columns)
                M = s.fit(X=X_train.values, y=y_train).get_support()
                C_sel = C[M]
                C_cat = C_cat[np.isin(C_cat, C_sel)]
                C_num = C_num[np.isin(C_num, C_sel)]

                X_train_N, X_test_N = X_train[C_num].values, X_test[C_num].values
                X_train_C, X_test_C = X_train[C_cat].values, X_test[C_cat].values


                X_train = pd.DataFrame(
                    np.concatenate((X_train_C, X_train_N), axis=1),
                    columns=np.concatenate((C_cat, C_num))
                )
                X_test = pd.DataFrame(
                    np.concatenate((X_test_C, X_test_N), axis=1),
                    columns=np.concatenate((C_cat, C_num))
                )

        if oversample:
            #If there is any categorical data, apply SMOTE-NC, otherwise just SMOTE
            if len(C_cat) > 0:
                sampler = SMOTENC(categorical_features=[X_train.columns.get_loc(c) for c in C_cat], random_state=random_state)
            else:
                sampler = SMOTE(random_state=random_state)
            X_train, y_train = sampler.fit_resample(X_train, y_train)

        estimator = clone(estimator).fit(X_train, y_train)
        y_pred = estimator.predict_proba(X_test)[:, 1]
        #Deafult average method for roc_auc_score is macro
        auc_score = roc_auc_score(y_test, y_pred, average=None)

        result = FoldResult(
            name=fold_name,
            metrics={'AUC': auc_score},
            duration=time.time() - start_time
        )
        log(f'Training completed for {fold_name} with AUC: {auc_score}')
        return result

    except Exception as e:
        log(f'Error in {fold_name}: {traceback.format_exc()}')
        return None

#We modify to include tge category information in to this function
def perform_cross_validation(X, y, groups, estimator, normalize=False, select=None, oversample=False, random_state=None):
    """
    Function to perform cross-validation using StratifiedGroupKFold.
    Args:
        X, y (DataFrame, Series): The entire dataset.
        groups (array): Array indicating the group for each instance in X.
        estimator (estimator instance): The model to be trained.
        normalize, select, oversample (bool): Preprocessing options.
        random_state (int): Seed for reproducibility.
    Returns:
        list: A list containing FoldResult for each fold.
    """
    futures = []
    # Group-k cross validation
    splitter = StratifiedGroupKFold(n_splits=5, shuffle =True, random_state = 42)
    # Loop over all the LOSO splits
    for idx, (train_idx, test_idx) in enumerate(splitter.split(X, y, groups)):
        X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
        y_train, y_test = y[train_idx], y[test_idx]
        C_cat = np.asarray(sorted(cats))
        C_num = np.asarray(sorted(X.columns[~X.columns.isin(C_cat)]))

        job = train_fold('path_to_results', f'Fold_{idx}', X_train, y_train, X_test, y_test, C_cat, C_num, estimator, normalize, select, oversample, random_state)
        futures.append(job)

    return futures

Here, we define the feature selection method and classifier and execute the code. AUC-ROC is calculated as mean of macro AUC-ROC for all folds/users.

In [None]:
#Featur Selection, you may want to change the feature selection methods
SELECT_LASSO = SelectFromModel(
        estimator=LogisticRegression(
        penalty='l1'
        ,solver='liblinear'
        , C=1, random_state=RANDOM_STATE, max_iter=4000
    ),
    # This threshold may impact the model performance as well
    threshold = 0.005
)
#Classifier
#There could exist more parameters. Please search in your defined parameter
#space for model performance improvement
estimator = EvXGBClassifier(
    random_state=RANDOM_STATE,
    eval_metric='logloss',
    eval_size=0.2,
    early_stopping_rounds=10,
    objective='binary:logistic', #Prediction instead of regression
    verbosity=0,
    learning_rate=0.01,
)

#Perform cross validation including model training and evaluation
results = perform_cross_validation(X, y, groups, estimator, normalize=True, select=[SELECT_LASSO], oversample=True, random_state=42)
auc_values = [results[i].metrics['AUC'] for i in range(len(results))]
mean_auc = np.mean(auc_values)
print(mean_auc)

Potential solutions:
- Use a data structure that matches the device ordinal in the booster.
- Set the device for booster before call to inplace_predict.




Training completed for Fold_0 with AUC: 0.5760597892673365
Training completed for Fold_1 with AUC: 0.5194843617920542
Training completed for Fold_2 with AUC: 0.605396012438266
Training completed for Fold_3 with AUC: 0.5232254989908052
Training completed for Fold_4 with AUC: 0.5972818082858423
0.5642894941548608


# Assignment

## Assignment 1. Improve the model performance using different types of feature combinations. (20pts)

 Hint: Currently we are only using feat_baseline. You may want to try other feature combinations.

## Function Refactorization

First we define a function for feature extraction from the dataset. We follow almost all of the categorizations from the above given skeleton, but we just put them in a function for reuse in future.

Secondly, we also refactor the train_fold and the perform_cross_validation functions to include benchmark information for multiple features combinations, which will be more understandable in the experiment section.


In [None]:
def reorder_and_split_features(X, y, groups, datetimes):
    """
    Reorder data chronologically and extract all categorized features.

    Parameters:
    X : raw features pickle loaded from the dataset
    y : Target labels.
    """
    df = pd.DataFrame({'user_id': groups, 'datetime': datetimes, 'label': y})
    df_merged = pd.merge(df, X, left_index=True, right_index=True)
    df_merged = df_merged.sort_values(by=['user_id', 'datetime'])

    groups = df_merged['user_id'].to_numpy()
    datetimes = df_merged['datetime'].to_numpy()
    y = df_merged['label'].to_numpy()
    X_cleaned = df_merged.drop(columns=['user_id', 'datetime', 'label'])

    # Categorized features
    feat_current = X_cleaned.loc[:, [('#VAL' in str(x)) or ('ESM#LastLabel' in str(x)) for x in X_cleaned.columns]]
    feat_dsc = X_cleaned.loc[:, [('#DSC' in str(x)) for x in X_cleaned.columns]]
    feat_yesterday = X_cleaned.loc[:, [('Yesterday' in str(x)) for x in X_cleaned.columns]]
    feat_today = X_cleaned.loc[:, [('Today' in str(x)) for x in X_cleaned.columns]]
    feat_ImmediatePast = X_cleaned.loc[:, [('ImmediatePast_15' in str(x)) for x in X_cleaned.columns]]

    # Fine-grained subcategories
    feat_current_sensor = X_cleaned.loc[:, [('#VAL' in str(x)) for x in X_cleaned.columns]]
    feat_current_ESM = X_cleaned.loc[:, [('ESM#LastLabel' in str(x)) for x in X_cleaned.columns]]
    feat_ImmediatePast_sensor = feat_ImmediatePast.loc[:, [('ESM' not in str(x)) for x in feat_ImmediatePast.columns]]
    feat_ImmediatePast_ESM = feat_ImmediatePast.loc[:, [('ESM' in str(x)) for x in feat_ImmediatePast.columns]]
    feat_today_sensor = feat_today.loc[:, [('ESM' not in str(x)) for x in feat_today.columns]]
    feat_today_ESM = feat_today.loc[:, [('ESM' in str(x)) for x in feat_today.columns]]
    feat_yesterday_sensor = feat_yesterday.loc[:, [('ESM' not in str(x)) for x in feat_yesterday.columns]]
    feat_yesterday_ESM = feat_yesterday.loc[:, [('ESM' in str(x)) for x in feat_yesterday.columns]]

    feat_sleep = X_cleaned.loc[:, [('Sleep' in str(x)) for x in X_cleaned.columns]]
    feat_time = X_cleaned.loc[:, [('Time' in str(x)) for x in X_cleaned.columns]]
    feat_pif = X_cleaned.loc[:, [('PIF' in str(x)) for x in X_cleaned.columns]]

    # Baseline feature combination
    feat_baseline = pd.concat([feat_time, feat_dsc, feat_current_sensor, feat_ImmediatePast_sensor], axis=1)

    return {
        "X_cleaned": X_cleaned,
        "y": y,
        "groups": groups,
        "datetimes": datetimes,
        "feat_current": feat_current,
        "feat_dsc": feat_dsc,
        "feat_yesterday": feat_yesterday,
        "feat_today": feat_today,
        "feat_ImmediatePast": feat_ImmediatePast,
        "feat_current_sensor": feat_current_sensor,
        "feat_current_ESM": feat_current_ESM,
        "feat_ImmediatePast_sensor": feat_ImmediatePast_sensor,
        "feat_ImmediatePast_ESM": feat_ImmediatePast_ESM,
        "feat_today_sensor": feat_today_sensor,
        "feat_today_ESM": feat_today_ESM,
        "feat_yesterday_sensor": feat_yesterday_sensor,
        "feat_yesterday_ESM": feat_yesterday_ESM,
        "feat_sleep": feat_sleep,
        "feat_time": feat_time,
        "feat_pif": feat_pif,
        "feat_baseline": feat_baseline
    }


@dataclass
class FoldResult:
    """
    Data class to store results for each fold.
    """
    name: str
    metrics: dict
    duration: float

def log(message: str):
    """
    Simple logger function.
    """
    print(message)

def train_fold(fold_name, X_train, y_train, X_test, y_test, C_cat, C_num,
               estimator, normalize, select, oversample, random_state):
    """
    Trains and evaluates a model on a single fold.

    Args:
        fold_name: Name of the fold.
        X_train, y_train: Training data and labels.
        X_test, y_test: Test data and labels.
        C_cat: List of categorical feature names.
        C_num: List of numerical feature names.
        estimator: Model to train.
        normalize: Whether to normalize numeric features.
        select: Feature selector(s).
        oversample: Whether to apply oversampling.
        random_state: Random seed.

    Returns:
        FoldResult object with metrics and duration.
    """
    try:
        start_time = time.time()

        # Normalize numeric features if requested
        if normalize:
            X_train_N = X_train[C_num].values
            X_test_N = X_test[C_num].values
            X_train_C = X_train[C_cat].values
            X_test_C = X_test[C_cat].values

            scaler = StandardScaler().fit(X_train_N)
            X_train_N = scaler.transform(X_train_N)
            X_test_N = scaler.transform(X_test_N)

            # Concatenate categorical and normalized numeric features
            X_train = pd.DataFrame(
                np.concatenate((X_train_C, X_train_N), axis=1),
                columns=np.concatenate((C_cat, C_num))
            )
            X_test = pd.DataFrame(
                np.concatenate((X_test_C, X_test_N), axis=1),
                columns=np.concatenate((C_cat, C_num))
            )

        # Feature selection if requested
        if select:
            if isinstance(select, SelectFromModel):
                select = [select]

            for s in select:
                support = s.fit(X_train.values, y_train).get_support()
                selected_cols = X_train.columns[support]
                C_cat = np.intersect1d(C_cat, selected_cols)
                C_num = np.intersect1d(C_num, selected_cols)

                X_train = X_train[selected_cols]
                X_test = X_test[selected_cols]

        # Oversampling if requested
        if oversample:
            if len(C_cat) > 0:
                sampler = SMOTENC(
                    categorical_features=[X_train.columns.get_loc(c) for c in C_cat],
                    random_state=random_state
                )
            else:
                sampler = SMOTE(random_state=random_state)
            X_train, y_train = sampler.fit_resample(X_train, y_train)

        # Train the estimator and compute AUC
        estimator = clone(estimator).fit(X_train, y_train)
        y_pred = estimator.predict_proba(X_test)[:, 1]
        auc_score = roc_auc_score(y_test, y_pred)

        return FoldResult(
            name=fold_name,
            metrics={'AUC': auc_score},
            duration=time.time() - start_time
        )

    except Exception:
        log(f'Error in {fold_name}: {traceback.format_exc()}')
        return None

def perform_cross_validation(X, y, groups, estimator, cats, normalize=False, select=None, oversample=False, random_state=None):
    """
    Performs cross-validation using StratifiedGroupKFold.

    Args:
        X: Feature DataFrame.
        y: Target array.
        groups: Group labels for the samples.
        estimator: Model to train.
        cats: List of categorical feature names.
        normalize: Whether to normalize numeric features.
        select: Feature selector(s).
        oversample: Whether to apply oversampling.
        random_state: Random seed.

    Returns:
        List of FoldResult objects.
    """
    results = []

    #We use the StratifiedGroupKFoldn for validation
    splitter = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=42)

    for idx, (train_idx, test_idx) in enumerate(splitter.split(X, y, groups)):
        X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
        y_train, y_test = y[train_idx], y[test_idx]

        # Separate categorical and numeric columns
        C_cat = np.asarray(sorted(cats))
        C_num = np.asarray([col for col in X.columns if col not in C_cat])

        result = train_fold(f"Fold_{idx}", X_train, y_train, X_test, y_test,
                            C_cat, C_num, estimator, normalize, select, oversample, random_state)
        if result:
            results.append(result)

    # Print AUC for each fold
    for res in results:
        log(f"{res.name} - AUC: {res.metrics['AUC']:.4f} | Duration: {res.duration:.2f}s")

    return results

#### Feature Selection Experiment

We define feature combinations and put them in dataframe for efficient experiment.

In [None]:
#Reload the dataset
X_raw, y, groups, t, datetimes = pickle.load(open(PATH, mode='rb'))

#Feature categorization and sorting for ease-access to the features later on.
features = reorder_and_split_features(X_raw, y, groups, datetimes)


#Selecting combination of features set for running the experiment
feature_sets = {
    "baseline": features["feat_baseline"],

    # all sensors
    "all_sensors": features["X_cleaned"],


    # baseline + feat_today_sensor(2496), feat_today_ESM (6), feat_current_ESM (1) +  feat_pid (1) + feat_sleep (1)
    "baseline+today+current_esm+pid+sleep" : features["feat_baseline"]
        .join(features["feat_today_sensor"])
        .join(features["feat_today_ESM"])
        .join(features["feat_current_ESM"])
        .join(features["feat_pif"])
        .join(features["feat_sleep"]),

    # baseline + feat_today_sensor(2496), feat_today_ESM (6), feat_current_ESM (1) + feat_sleep (1)
    "baseline+today+current_esm+sleep" : features["feat_baseline"]
        .join(features["feat_today_sensor"])
        .join(features["feat_today_ESM"])
        .join(features["feat_current_ESM"])
        .join(features["feat_sleep"]),

    # baseline + baseline+today+current_esm + feat_yesterday_sensor (2496) + feat_yesterday_ESM (6) + immediate_past_ESM (6)
    "baseline+today+current_esm+yesterday+immediate_past_esm" : features["feat_baseline"]
        .join(features["feat_today_sensor"])
        .join(features["feat_today_ESM"])
        .join(features["feat_current_ESM"])
        .join(features["feat_yesterday_sensor"])
        .join(features["feat_yesterday_ESM"])
        .join(features["feat_ImmediatePast_ESM"]),

    # baseline + baseline+today+current_esm + feat_yesterday_sensor (2496) + feat_yesterday_ESM (6) + feat_pid (1) + feat_sleep (1)
    "baseline+today+current_esm+yesterday+pid+sleep" : features["feat_baseline"]
        .join(features["feat_today_sensor"])
        .join(features["feat_today_ESM"])
        .join(features["feat_current_ESM"])
        .join(features["feat_yesterday_sensor"])
        .join(features["feat_yesterday_ESM"])
        .join(features["feat_pif"])
        .join(features["feat_sleep"]),

    #baseline + baseline+today+current_esm + feat_yesterday_sensor (2496) + feat_yesterday_ESM (6) - immediate_past_sensor (2496)(remove from baseline)
    "baseline+today+current_esm+yesterday-no_immediate_past" : features["feat_baseline"]
        .join(features["feat_today_sensor"])
        .join(features["feat_today_ESM"])
        .join(features["feat_current_ESM"])
        .join(features["feat_yesterday_sensor"])
        .join(features["feat_yesterday_ESM"])
        .drop(columns=features["feat_ImmediatePast_sensor"].columns),

    # baseline + baseline+today+current_esm + feat_yesterday_sensor (2496) + feat_yesterday_ESM (6)
    "baseline+today+current_esm+yesterday" : features["feat_baseline"]
        .join(features["feat_today_sensor"])
        .join(features["feat_today_ESM"])
        .join(features["feat_current_ESM"])
        .join(features["feat_yesterday_sensor"])
        .join(features["feat_yesterday_ESM"]),


    # baseline + feat_today_sensor(2496), feat_today_ESM (6), feat_current_ESM (1)
    "baseline+today+current_esm" : features["feat_baseline"]
        .join(features["feat_today_sensor"])
        .join(features["feat_today_ESM"])
        .join(features["feat_current_ESM"]),

    # current+ImmediatePast
    "current+ImmediatePast": features["feat_current_sensor"]
        .join(features["feat_ImmediatePast_sensor"]),

    # current
    "current": features["feat_current"],

    # dsc
    "dsc": features["feat_dsc"],

    # sensor+time
    "sensor+time": features["feat_current_sensor"].join(features["feat_time"]),

}

  X_raw, y, groups, t, datetimes = pickle.load(open(PATH, mode='rb'))


#Running the experiment

In [None]:
for name, X_feat in feature_sets.items():
    print(f"\n--- {name} ---")

    # Use boolean columns as categorical features
    cat_cols = X_feat.columns[X_feat.dtypes == bool]

    # Running the Validation
    results = perform_cross_validation(
        X_feat, features["y"], features["groups"],
        estimator=estimator,
        cats=cat_cols,
        normalize=True,
        select=[SELECT_LASSO],
        oversample=True,
        random_state=42
    )

    auc_scores = [r.metrics["AUC"] for r in results]
    avg_auc = np.mean(auc_scores)

    print(f"Mean AUC: {avg_auc:.4f}")

    # Save results
    pd.DataFrame({
        "AUC Scores": auc_scores,
        "Mean AUC": [avg_auc] * len(auc_scores)
    }).to_csv(f"assignment1_{name}_cv_results.csv", index=False)


--- baseline ---
Fold_0 - AUC: 0.6111 | Duration: 6.07s
Fold_1 - AUC: 0.5195 | Duration: 4.46s
Fold_2 - AUC: 0.5945 | Duration: 4.92s
Fold_3 - AUC: 0.5064 | Duration: 5.43s
Fold_4 - AUC: 0.5734 | Duration: 4.20s
Mean AUC: 0.5610

--- all_sensors ---
Fold_0 - AUC: 0.5982 | Duration: 12.93s
Fold_1 - AUC: 0.5827 | Duration: 9.47s
Fold_2 - AUC: 0.5949 | Duration: 12.29s
Fold_3 - AUC: 0.5330 | Duration: 11.02s
Fold_4 - AUC: 0.5828 | Duration: 12.04s
Mean AUC: 0.5783

--- baseline+today+current_esm+pid+sleep ---
Fold_0 - AUC: 0.6194 | Duration: 14.46s
Fold_1 - AUC: 0.5768 | Duration: 11.30s
Fold_2 - AUC: 0.5850 | Duration: 14.30s
Fold_3 - AUC: 0.5355 | Duration: 10.81s
Fold_4 - AUC: 0.5917 | Duration: 13.36s
Mean AUC: 0.5817

--- baseline+today+current_esm+sleep ---
Fold_0 - AUC: 0.5810 | Duration: 15.67s
Fold_1 - AUC: 0.5318 | Duration: 10.95s
Fold_2 - AUC: 0.6411 | Duration: 14.41s
Fold_3 - AUC: 0.5230 | Duration: 12.16s
Fold_4 - AUC: 0.6184 | Duration: 12.61s
Mean AUC: 0.5790

--- baseli

## Assignment 1: Discussion

The feature combinations were at first selected based on intuitions. However, it was observed that the baseline feature set did not have the ESM data at all. Therefore, as a first step, feature sets with the esm data were tested and a performance improvement was observed.

One thing to note that the sleep data and thermal data (dsc) did not have much impact on the results both as standalone and when combined with other features.

## Assignment 2. Please try different feature selection methods (20pts)

Hint: Currently, we are using LASSO filter for feature selection. Please consider using embedded method as well(same model for both feature selection and model training). Besides, the threshold for LASSO filter may also affect the performance. **Sepcifically, there is a method called 'mean' which is using mean of feature importances of all features as threshold.** Please try both different feature selection methods and different thresholds for filtering features to improve model performance.

We adopt the following three methods for feature selections:


*   LASSO (Linear Model): Used in the baseline; we select feature sets with AUC ≥ 0.579 from the baseline feature combination experiments.
*   SHAP (Model-Agnostic):  All features were used to avoid pre filtering features; SHAP captures global and local importance
* Random Forest (Non-Linear Model): All features included; RF ranks features based on non-linear interactions.

 These are not just three methods, buy they represent three fundamentally different types of feature selection approaches, which we thought to be interesting to experiment with.

## LASSO Selection Approach

We select feature sets with AUC ≥ 0.579 from the baseline feature combination experiments.



In [None]:
feature_sets = {
    "feat_baseline": features['feat_baseline'],

    "baseline+today+current_esm": features['feat_baseline']
        .join(features['feat_today_sensor'])
        .join(features['feat_today_ESM'])
        .join(features['feat_current_ESM']),

    "baseline+today+current_esm+yesterday": features['feat_baseline']
        .join(features['feat_today_sensor'])
        .join(features['feat_today_ESM'])
        .join(features['feat_current_ESM'])
        .join(features['feat_yesterday_sensor'])
        .join(features['feat_yesterday_ESM']),

    "baseline+today+current_esm+yesterday-no_immediate_past": features['feat_baseline']
        .join(features['feat_today_sensor'])
        .join(features['feat_today_ESM'])
        .join(features['feat_current_ESM'])
        .join(features['feat_yesterday_sensor'])
        .join(features['feat_yesterday_ESM'])
        .drop(columns=features['feat_ImmediatePast_sensor'].columns),

    "baseline+today+current_esm+yesterday+immediate_past_esm": features['feat_baseline']
        .join(features['feat_today_sensor'])
        .join(features['feat_today_ESM'])
        .join(features['feat_current_ESM'])
        .join(features['feat_yesterday_sensor'])
        .join(features['feat_yesterday_ESM'])
        .join(features['feat_ImmediatePast_ESM']),
}

We perform **LASSO-based feature selection** using different configurations of:

- **C values**: `[0.1, 1.0, 10.0]` – controls the strength of regularization.
- **Thresholds**: `[0.001, 0.005, 'mean']` – controls how strict the feature selection is.

We create a list of `SelectFromModel` selectors by pairing each C value with each threshold.  
These help us explore how different levels of sparsity impact model performance.

After selecting features, we use the consistent downstream model (`EvXGBClassifier`) to evaluate performance.

The aim is to identify which LASSO configuration best balances simplicity (fewer features) with predictive performance.

In [None]:
# Trying different regularization strengths for LASSO (low to high)
C_values = [0.1, 1.0, 10.0]

# Trying different thresholds to control feature selection sparsity
thresholds = [0.001, 0.005, 'mean']

# Setting up multiple LASSO selectors with different C and threshold combinations
# This helps us tune how strict or loose feature selection should be
selectors = [
    SelectFromModel(
        estimator=LogisticRegression(
            penalty='l1',            # LASSO penalty to shrink irrelevant features
            solver='liblinear',      # Works with L1 penalty
            C=c,                     # Controls regularization strength
            random_state=42,
            max_iter=4000            # Ensure the model converges
        ),
        threshold=thresh            # Controls which features are kept
    )
    for c in C_values
    for thresh in thresholds
]

# Classifier used after LASSO-based feature selection
# EvXGBClassifier = XGBoost-based model used for final predictions
estimator = EvXGBClassifier(
    random_state=42,
    eval_metric='logloss',
    eval_size=0.2,
    early_stopping_rounds=10,
    objective='binary:logistic',
    verbosity=0,
    learning_rate=0.01
)

In [None]:
for fs_name, X_feat in feature_sets.items():
    cat_cols = X_feat.columns[X_feat.dtypes == bool]

    for selector in selectors:
        sel_id = f"C={selector.estimator.C}_thresh={selector.threshold}"
        print(f"\n>>> Feature set: {fs_name} | Selector: {sel_id}")

        results = perform_cross_validation(
            X=X_feat,
            y=features["y"],
            groups=features["groups"],
            estimator=estimator,
            cats=cat_cols,
            normalize=True,
            select=[selector],
            oversample=True,
            random_state=42
        )

        auc_scores = [r.metrics["AUC"] for r in results]
        avg_auc = np.mean(auc_scores)

        print(f"AUC Scores: {auc_scores}")
        print(f"Mean AUC : {avg_auc:.4f}")

        pd.DataFrame({
            "AUC Scores": auc_scores,
            "Mean AUC": [avg_auc] * len(auc_scores)
        }).to_csv(f"assignment2_lasso{fs_name}__{sel_id}.csv", index=False)



>>> Feature set: feat_baseline | Selector: C=0.1_thresh=0.001
Fold_0 - AUC: 0.6071 | Duration: 1.11s
Fold_1 - AUC: 0.4882 | Duration: 1.00s
Fold_2 - AUC: 0.5008 | Duration: 1.11s
Fold_3 - AUC: 0.5548 | Duration: 1.05s
Fold_4 - AUC: 0.5771 | Duration: 1.05s
AUC Scores: [np.float64(0.6071306052438128), np.float64(0.4882139838183794), np.float64(0.5007865374062557), np.float64(0.5547628392016147), np.float64(0.5770954728821156)]
Mean AUC : 0.5456

>>> Feature set: feat_baseline | Selector: C=0.1_thresh=0.005
Fold_0 - AUC: 0.6100 | Duration: 1.18s
Fold_1 - AUC: 0.5241 | Duration: 1.02s
Fold_2 - AUC: 0.5561 | Duration: 0.89s
Fold_3 - AUC: 0.5611 | Duration: 1.07s
Fold_4 - AUC: 0.5740 | Duration: 1.02s
AUC Scores: [np.float64(0.6100465572163686), np.float64(0.5240671416495593), np.float64(0.556118529357966), np.float64(0.5610843238394259), np.float64(0.5740379074085932)]
Mean AUC : 0.5651

>>> Feature set: feat_baseline | Selector: C=0.1_thresh=mean
Fold_0 - AUC: 0.6016 | Duration: 1.06s
Fo

### Lasso: Discussion

The highest AUC (0.6113) was achieved when immediate past ESM features were excluded, suggesting potential overfitting from including temporally close data. In general, lower thresholds like 0.001 led to overly sparse models and underperformance, while very high regularization (C=10) increased computation time without consistent gains. Therefore, we again restate the importance of carefully balancing regularization strength and feature inclusion for optimal performance.

### SHAP-Based Feature Selection

We apply SHAP (SHapley Additive exPlanations) to understand feature importance and guide selection.

Two helper functions are used:

- `get_shap_feature_ranking(X, y, max_samples=1000, random_state=42)`  
  → Computes SHAP values for a trained model using a subset of samples for efficiency.  
  → Returns features ranked by their mean absolute SHAP value.

- `select_top_features_by_shap(X, y, top_n: int = 20)`  
  → Selects the top-N most influential features based on SHAP ranking.  
  → Returns a reduced feature matrix for model training or evaluation.

Also, we use EvXGBooster as the model for SHAP-based exploration.

SHAP is model-agnostic and especially effective for interpreting tree-based models like XGBoost. It helps us select features that contribute most to model decisions.

In [None]:
import shap

def get_shap_feature_ranking(X, y, max_samples=1000, random_state=42):
    """
    Compute SHAP-based feature importance scores using a fitted XGBoost model.

    Parameters:
        X (pd.DataFrame): Input feature matrix.
        y (array-like): Target variable.
        max_samples (int): Maximum number of samples to use for computing SHAP values (for efficiency).
        random_state (int): Seed for reproducibility.

    Returns:
        pd.Series: Features ranked by mean absolute SHAP values (descending order).
    """
    model = EvXGBClassifier(eval_size=None, random_state=random_state)
    model.fit(X, y)

    explainer = shap.Explainer(model.model)
    sample_X = X.iloc[:max_samples]
    shap_values = explainer(sample_X).values
    mean_abs_shap = np.mean(np.abs(shap_values), axis=0)
    return pd.Series(mean_abs_shap, index=X.columns).sort_values(ascending=False)


def select_top_features_by_shap(X, y, top_n: int = 20) -> pd.DataFrame:
    """
    Select the top-N most important features based on SHAP values.

    Parameters:
        X (pd.DataFrame): Input feature matrix.
        y (array-like): Target variable.
        top_n (int): Number of top features to select based on SHAP importance.

    Returns:
        pd.DataFrame: Reduced feature matrix containing only the top-N SHAP-ranked features.
    """
    ranking = get_shap_feature_ranking(X, y)
    top_features = ranking.head(top_n).index
    return X[top_features]

All features were used to avoid pre filtering features as SHAP captures global and local importance. However we can add any other feature combinations to test

In [None]:
feature_sets = {
    # "baseline": features["feat_baseline"], #We can also consider the original baseline just for comparison
    "all_features": features["X_cleaned"],
}


# Define classifier (Baseline)
estimator = EvXGBClassifier(
    random_state=42,
    eval_metric='logloss',
    eval_size=0.2,
    early_stopping_rounds=10,
    objective='binary:logistic',
    verbosity=0,
    learning_rate=0.01
)

#### Data Leakage Caution

SHAP values are powerful for interpreting model predictions, but computing them on the entire dataset before splitting risks data leakage—where information from test samples inadvertently influences feature selection.

* **Cross-Validation Alignment**: To prevent this, we mimicked the exact splits used during evaluation by running `StratifiedGroupKFold` with `n_splits=5` and a fixed `random_state=42`. This ensured consistent partitioning across both SHAP computation and final model evaluation.

* **Train-Only SHAP Extraction**: For SHAP feature ranking, we merged the training indices from all five folds (excluding test data entirely). This merged training set was used to fit a model and compute SHAP values.

* **Top-N Feature Selection**: From the SHAP values, we selected the top-N most important features (based on mean absolute SHAP value) and used these same features for all test folds.

* **Leakage-Free Evaluation**: By isolating SHAP computation to training data only and keeping the test data untouched during selection, we ensured that our feature selection process remained leakage-free and evaluation results remained reliable.

We evaluate SHAP-based feature selection by progressively keeping the top-K features from all 5,589.

In [None]:
TOP_N_list = [10, 20, 30, 40, 45, 50, 60, 65, 70, 80, 90, 100, 200, 300, 500]

#directory for saving values
result_dir = "assignment2_shap/shap"
os.makedirs(result_dir, exist_ok=True)

# Loop over each TOP_N value and feature set

for TOP_N in TOP_N_list:
    for fs_name, X_base in feature_sets.items():

        # Skip if not enough features
        if X_base.shape[1] < TOP_N:
            continue

        print(f"\n Feature Set: {fs_name}")

        # ----------------------------
        # Get all training data across folds using StratifiedGroupKFold
        # This simulates what training data looks like across CV splits,
        # allowing SHAP feature selection without touching any test fold.
        # ----------------------------
        cv = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=42)
        train_indices = []

        for train_idx, _ in cv.split(X_base, features["y"], features["groups"]):
            train_indices.extend(train_idx)

        # Remove duplicates from aggregated training indices
        unique_train_indices = sorted(set(train_indices))

        # Extract merged training data for SHAP feature selection
        X_train_merged = X_base.iloc[unique_train_indices]
        y_train_merged = features["y"][unique_train_indices]

        # Run SHAP feature selection on merged training data
        X_selected = select_top_features_by_shap(X_train_merged, y_train_merged, top_n=TOP_N)
        selected_features = X_selected.columns.tolist()
        selected_features.sort()

        print(f"Selected {len(selected_features)} features from merged training data.")
        print(f"SHAP feature selection for TOP_N = {TOP_N}")

        # Apply selected features to full dataset for evaluation
        X_shap = X_base[selected_features]

        # Identify categorical columns
        cat_cols = X_shap.columns[X_shap.dtypes == bool]

        # Run model evaluation using existing CV pipeline
        results = perform_cross_validation(
            X_shap, features["y"], features["groups"],
            estimator=estimator,
            cats=cat_cols,
            normalize=True,
            select=None,  # SHAP already selects
            oversample=True,
            random_state=42
        )

        # Save evaluation results
        auc_scores = [r.metrics["AUC"] for r in results]
        avg_auc = np.mean(auc_scores)

        print(f"\n AUC Scores: {auc_scores}")
        print(f" Mean AUC: {avg_auc:.4f}")

        pd.DataFrame({
            "AUC Scores": auc_scores,
            "Mean AUC": [avg_auc] * len(auc_scores)
        }).to_csv(os.path.join(result_dir, f"{fs_name}__top{TOP_N}.csv"), index=False)

        # Save selected feature names
        pd.Series(selected_features).to_csv(
            os.path.join(result_dir, f"{fs_name}__top{TOP_N}_features.csv"),
            index=False, header=False
        )


 Feature Set: all_features
Selected 10 features from merged training data.
SHAP feature selection for TOP_N = 10
Fold_0 - AUC: 0.6384 | Duration: 0.11s
Fold_1 - AUC: 0.6108 | Duration: 0.10s
Fold_2 - AUC: 0.5858 | Duration: 0.10s
Fold_3 - AUC: 0.6176 | Duration: 0.12s
Fold_4 - AUC: 0.6068 | Duration: 0.15s

 AUC Scores: [np.float64(0.6384097035040431), np.float64(0.6107595701002295), np.float64(0.5858423266873971), np.float64(0.6176342789863198), np.float64(0.6067906768265352)]
 Mean AUC: 0.6119

 Feature Set: all_features
Selected 20 features from merged training data.
SHAP feature selection for TOP_N = 20
Fold_0 - AUC: 0.6592 | Duration: 0.15s
Fold_1 - AUC: 0.6068 | Duration: 0.16s
Fold_2 - AUC: 0.6266 | Duration: 0.13s
Fold_3 - AUC: 0.6264 | Duration: 0.16s
Fold_4 - AUC: 0.6397 | Duration: 0.17s

 AUC Scores: [np.float64(0.6592011761823083), np.float64(0.6068469991546914), np.float64(0.6265502103530274), np.float64(0.6263596097779771), np.float64(0.6396555036178524)]
 Mean AUC: 0.6

### SHAP: Discussion

The SHAP-based feature selection showed a clear trend of improved performance with the inclusion of more top-ranked features, reaching optimal AUC scores in the range of 40 to 100 features. Performance peaked around 60–200 features, after which it slightly declined, suggesting possible overfitting or inclusion of less relevant features. This demonstrates the importance of selecting an appropriate subset rather than relying on all available features. We will recap this discussion in the later section .

SHAP’s model-agnostic interpretability and ability to capture non-linear relationships contributed significantly to identifying impactful predictors. Importantly, to avoid data leakage, SHAP values were computed using only the merged training folds to ensure fair evaluation.

### Random Forest-Based Feature Selection

In this section, we use a Random Forest classifier to rank features by their importance scores. Random Forest is a powerful tree-based ensemble method that can capture nonlinear relationships and interactions between features, suitable for importance estimation.

To avoid data leakage, feature importance is computed only on the **merged training folds** derived from `StratifiedGroupKFold` with a fixed random seed. This ensures that the test data is completely excluded from the feature selection process, preserving the validity of subsequent evaluations.

We then select the top-N features based on the learned importances and evaluate model performance using cross-validation.


In [None]:
from sklearn.ensemble import RandomForestClassifier

def get_rf_feature_ranking(X, y, max_samples=1000, random_state=42):
    """
    Train a Random Forest classifier and rank features by their importance scores.

    Parameters:
        X (pd.DataFrame): Input feature matrix.
        y (array-like): Target labels.
        max_samples (int): Unused here but included for API consistency.
        random_state (int): Seed for reproducibility.

    Returns:
        pd.Series: Features ranked by importance (descending order) based on Random Forest.
    """
    rf = RandomForestClassifier(n_estimators=100, random_state=random_state, n_jobs=-1)
    rf.fit(X, y)
    importances = rf.feature_importances_
    return pd.Series(importances, index=X.columns).sort_values(ascending=False)


def select_top_features_by_rf(X, y, top_n: int = 20) -> pd.DataFrame:
    """
    Select the top-N most important features using Random Forest importance scores.

    Parameters:
        X (pd.DataFrame): Input feature matrix.
        y (array-like): Target labels.
        top_n (int): Number of top features to select.
        random_state (int): Seed for reproducibility.

    Returns:
        pd.DataFrame: Reduced feature matrix with top-N Random Forest-ranked features.
    """
    ranking = get_rf_feature_ranking(X, y)
    top_features = ranking.head(top_n).index
    return X[top_features]

In [None]:
TOP_N_list = [10, 20, 25, 30, 35, 40, 45, 50, 100, 200, 300]

#directory for saving values
result_dir = "assignment2_rf"
os.makedirs(result_dir, exist_ok=True)

# Loop over each TOP_N value and feature set

for TOP_N in TOP_N_list:
    for fs_name, X_base in feature_sets.items():

        # Skip if not enough features
        if X_base.shape[1] < TOP_N:
            continue

        print(f"\n Feature Set: {fs_name}")

        # ----------------------------
        # Get all training data across folds using StratifiedGroupKFold
        # This simulates what training data looks like across CV splits,
        # allowing RF feature selection without touching any test fold.
        # ----------------------------
        cv = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=42)
        train_indices = []

        for train_idx, _ in cv.split(X_base, features["y"], features["groups"]):
            train_indices.extend(train_idx)

        # Remove duplicates from aggregated training indices
        unique_train_indices = sorted(set(train_indices))

        # Extract merged training data for SHAP feature selection
        X_train_merged = X_base.iloc[unique_train_indices]
        y_train_merged = features["y"][unique_train_indices]

        # Run SHAP feature selection on merged training data
        X_selected = select_top_features_by_rf(X_train_merged, y_train_merged, top_n=TOP_N)
        selected_features = X_selected.columns.tolist()
        selected_features.sort()

        print(f"Selected {len(selected_features)} features from merged training data.")
        print(f"RF feature selection for TOP_N = {TOP_N}")

        # Apply selected features to full dataset for evaluation
        X_shap = X_base[selected_features]

        # Identify categorical columns
        cat_cols = X_shap.columns[X_shap.dtypes == bool]

        # Run model evaluation using existing CV pipeline
        results = perform_cross_validation(
            X_shap, features["y"], features["groups"],
            estimator=estimator,
            cats=cat_cols,
            normalize=True,
            select=None,  # SHAP already selects
            oversample=True,
            random_state=42
        )

        # Save evaluation results
        auc_scores = [r.metrics["AUC"] for r in results]
        avg_auc = np.mean(auc_scores)

        print(f"\n AUC Scores: {auc_scores}")
        print(f" Mean AUC: {avg_auc:.4f}")

        pd.DataFrame({
            "AUC Scores": auc_scores,
            "Mean AUC": [avg_auc] * len(auc_scores)
        }).to_csv(os.path.join(result_dir, f"{fs_name}__top{TOP_N}.csv"), index=False)

        # Save selected feature names
        pd.Series(selected_features).to_csv(
            os.path.join(result_dir, f"{fs_name}__top{TOP_N}_features.csv"),
            index=False, header=False
        )


 Feature Set: all_features
Selected 10 features from merged training data.
RF feature selection for TOP_N = 10
Fold_0 - AUC: 0.6121 | Duration: 0.14s
Fold_1 - AUC: 0.5725 | Duration: 0.10s
Fold_2 - AUC: 0.6290 | Duration: 0.09s
Fold_3 - AUC: 0.5746 | Duration: 0.11s
Fold_4 - AUC: 0.6419 | Duration: 0.16s

 AUC Scores: [np.float64(0.6120803724577308), np.float64(0.5724550175099625), np.float64(0.6290012804097311), np.float64(0.5746103386409509), np.float64(0.6418966510853557)]
 Mean AUC: 0.6060

 Feature Set: all_features
Selected 20 features from merged training data.
RF feature selection for TOP_N = 20
Fold_0 - AUC: 0.6382 | Duration: 0.13s
Fold_1 - AUC: 0.6027 | Duration: 0.12s
Fold_2 - AUC: 0.6024 | Duration: 0.12s
Fold_3 - AUC: 0.5776 | Duration: 0.11s
Fold_4 - AUC: 0.6236 | Duration: 0.12s

 AUC Scores: [np.float64(0.6381524136241118), np.float64(0.6026989494022461), np.float64(0.6023779037863545), np.float64(0.5776098901098901), np.float64(0.6236232951271051)]
 Mean AUC: 0.6089


### Random Forest Feature Selection: Discussion

Random Forest-based feature selection was applied to the full feature set, and performance was evaluated across different top-N feature thresholds. The highest AUC (0.6261) was observed with the top 30 features, but overall, performance remained fairly stable around the 0.60–0.61 range.

Unlike SHAP, RF did not show consistent improvement with more features, possibly due to noise accumulation. As with SHAP, care was taken to avoid data leakage by computing feature importance only on merged training folds.


## Assignment 3. Please try using hyperopt for model hyperparameter tuning (20 pts)

Hint: Please be aware that for revised xgboost classifier EvXGBClassifier, there exist other parameters other than default XGBClassifier parameters such as eval_size.

For hyperparameter tuning, we will use 20% of training set as validation set to avoid data leakage.

If it is too timeconsuming to run the code in colab, please run the code locally and consider using [ray tune](https://docs.ray.io/en/latest/tune/index.html) if needed.

In [None]:
feature_sets = {
    "all_features": features["X_cleaned"],
}

In [None]:
from hyperopt import STATUS_OK, Trials, hp, fmin, tpe, base

#load shape features from all_features__top50_features.csv and assign in features from X_cleaned
shap_features = pd.read_csv("/content/assignment2_shap/all_features__top60_features.csv", index_col=0)


X_selected = features["X_cleaned"].loc[:, shap_features.index].copy()



# define your outer CV
OUTER_CV = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=42)

def objective(params):
    val_scores = []

    # outer loop: split into train_full / test (we will only use train_full for tuning)
    for train_full_idx, _ in OUTER_CV.split(X_selected, y, groups):
        X_train_full = X_selected.iloc[train_full_idx]
        y_train_full = y[train_full_idx]

        # split 20% of the *training fold* into a validation set
        X_train, X_val, y_train, y_val = train_test_split(
            X_train_full, y_train_full,
            test_size=0.20,
            stratify=y_train_full,
            random_state=42
        )

        # Normalize
        scaler = StandardScaler()
        X_train_scaled = scaler.fit_transform(X_train)
        X_val_scaled   = scaler.transform(X_val)

        # (Optional) Oversample on *training only*
        if np.any(X_train_scaled[:, -1] < 1):
            smote = SMOTENC(
                categorical_features=[X_train_scaled.shape[1]-1],
                random_state=int(params['random_state'])
            )
        else:
            smote = SMOTE(random_state=int(params['random_state']))
        X_train_os, y_train_os = smote.fit_resample(X_train_scaled, y_train)

        # Feature selection disabled
        X_train_sel = X_train_os
        X_val_sel   = X_val_scaled

        # Train & score on *validation only*
        clf = EvXGBClassifier(
          random_state=int(params['random_state']),
          eval_metric='logloss',
          max_depth=params['max_depth'],
          learning_rate=params['learning_rate'],
          min_child_weight=params['min_child_weight'],
          subsample=params['subsample'],
          colsample_bytree=params['colsample_bytree'],
          gamma=params['gamma'],
          n_estimators=int(params['n_estimators']),
          reg_alpha=params['reg_alpha'],
          reg_lambda=params['reg_lambda'],
        )

        clf.fit(X_train_sel, y_train_os)


        y_val_prob = clf.predict_proba(X_val_sel)[:, 1]
        val_scores.append(roc_auc_score(y_val, y_val_prob))

    # Hyperopt minimizes “loss”, so negate AUC
    return {'loss': -np.mean(val_scores), 'status': STATUS_OK}


# Define hyperparameter space
# define your search space (fill in any missing parameters e.g. max_depth)
space = {
    'max_depth': hp.choice('max_depth', list(range(3, 10))),
    'min_child_weight': hp.quniform('min_child_weight', 1, 10, 1),
    'subsample': hp.uniform('subsample', 0.6, 1.0),
    'colsample_bytree': hp.uniform('colsample_bytree', 0.6, 1.0),
    'gamma': hp.uniform('gamma', 0, 5),
    'learning_rate': hp.loguniform('learning_rate', np.log(0.01), np.log(0.3)),
    'n_estimators': hp.quniform('n_estimators', 50, 300, 10),
    'reg_alpha': hp.uniform('reg_alpha', 0, 1),
    'reg_lambda': hp.uniform('reg_lambda', 0, 1),
    'random_state': 42
}

# run hyperopt
trials = Trials()
best = fmin(
    fn=objective,
    space=space,
    algo=tpe.suggest,
    max_evals=100,
    trials=trials
)

#save the parameters in a file
best_params = {k: v for k, v in best.items() if k != 'random_state'}
best_params['max_depth'] = [3, 4, 5, 6, 7, 8, 9][best_params['max_depth']]
best_params['n_estimators'] = int(best_params['n_estimators'])
best_params['min_child_weight'] = int(best_params['min_child_weight'])
best_params['random_state'] = 42
result_dir = os.path.join("assignment3", "hyperopt")
os.makedirs(result_dir, exist_ok=True)
pd.DataFrame([best_params]).to_csv(os.path.join(result_dir, "best_hyperparameters.csv"), index=False)

100%|██████████| 100/100 [04:10<00:00,  2.50s/trial, best loss: -0.7371891925366012]


In [None]:
#open from parameter directory
param = pd.read_csv('/content/assignment3/hyperopt/best_hyperparameters.csv')

#print the parameters as a list
print(param.to_dict(orient='records'))

cat_cols = X_selected.columns[X_selected.dtypes == bool]

#hyperparameter tuned model
base_estimator = EvXGBClassifier(
    random_state=42,
    eval_metric='logloss',
    eval_size=0.2,
    early_stopping_rounds=10,
    objective='binary:logistic',
    verbosity=0,
    learning_rate= param['learning_rate'],
    colsample_bytree= param['colsample_bytree'],
    gamma= param['gamma'],
    max_depth=param['max_depth'],
    min_child_weight=param['min_child_weight'],
    n_estimators=param['n_estimators'],
    reg_alpha=param['reg_alpha'],
    reg_lambda=param['reg_lambda'],
    subsample=param['subsample']
)

#Perform cross validation including model training and evaluation
results = perform_cross_validation(
        X, features["y"], features["groups"],
        estimator=base_estimator,
        cats=cat_cols,
        normalize=True,
        select=None,
        oversample=True,
        random_state=42
    )
auc_values = [results[i].metrics['AUC'] for i in range(len(results))]
mean_auc = np.mean(auc_values)
print(mean_auc)

[{'colsample_bytree': 0.8463935920953023, 'gamma': 0.2144097629496495, 'learning_rate': 0.0736917612203914, 'max_depth': 6, 'min_child_weight': 1, 'n_estimators': 280, 'reg_alpha': 0.4775914535854724, 'reg_lambda': 0.6862888539325457, 'subsample': 0.7290628198195405, 'random_state': 42}]
Fold_0 - AUC: 0.5927 | Duration: 0.88s
Fold_1 - AUC: 0.5278 | Duration: 0.78s
Fold_2 - AUC: 0.5604 | Duration: 0.72s
Fold_3 - AUC: 0.5458 | Duration: 1.21s
Fold_4 - AUC: 0.6047 | Duration: 0.79s
0.5663015221318979


## Assignment 4. Please consider replacing the previous traditional machine learning model with deep learning models designed for **tabular data** to improve model performance. (20 pts)

Hint: Since features are already extracted manually, it is impossible to use end-to-end deep learning models. Instead, try replacing xgboost with deep learning models designed for **tabular data** and see if there is performance improvement.

You may need to change runtime to TPU first to use torch or other packages you may want to use.




Please compare it with your previous XGBoost model performance and think about why it is higher or lower than XGBoost.

### TabNet
We replace the previously used XGBoost model with a deep learning model specifically designed for tabular data: TabNet. As our features are already manually extracted, end-to-end deep learning pipelines are not applicable. Instead, TabNet is employed as a drop-in replacement for XGBoost to assess if a deep learning-based architecture can improve classification performance on manually engineered tabular features.

In [None]:
#######Your code for deep learning model#########
!pip install pytorch-tabnet

Collecting pytorch-tabnet
  Downloading pytorch_tabnet-4.1.0-py3-none-any.whl.metadata (15 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.3->pytorch-tabnet)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.3->pytorch-tabnet)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.3->pytorch-tabnet)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.3->pytorch-tabnet)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.3->pytorch-tabnet)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 

We do the necessary feature format conversion for the work. Here we choose three different datasets for trials:

*   SHAP Top 60: Selected based on SHAP (EvXGBoost), having best pre-tuning AUC.
*   All Features: Based on the assumption that deep learning handles feature selection implicitly via regularization (neuron dropout/pruning)
*   Baseline Features



In [None]:
#Feature selection

X_deep = X_selected.copy(deep=True) #Shap Top 60 features
# X_deep = X_cleaned.copy(deep=True)
# X_deep = X.copy(deep=True) #The full feature set
y_deep = y

cat_cols = list(X_deep.columns[X_deep.dtypes ==  bool])
num_cols = list(X_deep.columns[~X_deep.columns.isin(cat_cols)])

scaler = StandardScaler()
X_deep[num_cols] = scaler.fit_transform(X_deep[num_cols])

#convert categorical booleant to float
X_deep[cat_cols] = X_deep[cat_cols].astype(float)

#convert to numpy
X_np = X_deep.to_numpy()
y_np = y

### Running the model for training

In [None]:
from pytorch_tabnet.tab_model import TabNetClassifier

# Initialize a cross-validation strategy that maintains class distribution and group integrity
crossvalidation = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=42)

# List to store evaluation results from each fold
fold_results = []

# Loop over each fold provided by the cross-validation split
for fold, (train_idx, test_idx) in enumerate(crossvalidation.split(X_np, y_np, groups)):
    print(f"\n Fold {fold + 1}")

    # Split the dataset into training and testing sets based on current fold indices
    X_train, X_test = X_np[train_idx], X_np[test_idx]
    y_train, y_test = y_np[train_idx], y_np[test_idx]

    # Initialize the TabNetClassifier with default hyperparameters
    model = TabNetClassifier(
        device_name='cuda',
        n_d=4,
        n_a=4,
        n_steps=5,
        gamma=1.5,
        lambda_sparse=1e-4,
        optimizer_fn=torch.optim.Adam,
        optimizer_params=dict(lr=2e-2),
        scheduler_fn=torch.optim.lr_scheduler.StepLR,
        scheduler_params=dict(gamma=0.95, step_size=20),
        mask_type='sparsemax',
        verbose=0,
        seed=42
    )

    # Train the model with early stopping and evaluation metrics
    model.fit(
        X_train=X_train, y_train=y_train,
        eval_set=[(X_test, y_test)],
        eval_name=["val"],
        eval_metric=["auc"],
        max_epochs=200,
        patience=20,
        batch_size=1024,
        virtual_batch_size=128,
    )

    # Generate predictions and predicted probabilities for the test set
    y_pred = model.predict(X_test)
    y_prob = model.predict_proba(X_test)[:, 1]

    # Calculate evaluation metrics: Accuracy, F1 Score, and AUC
    acc = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    auc = roc_auc_score(y_test, y_prob)

    # Display fold-specific evaluation metrics
    print(f"Accuracy: {acc:.4f} | F1: {f1:.4f} | AUC: {auc:.4f}")

    # Store the metrics for this fold
    fold_results.append(dict(Accuracy=acc, F1=f1, AUC=auc))

# Create a DataFrame from the results and display the average performance across all folds
df = pd.DataFrame(fold_results)

save_dir = './assignment4/tabnet'
os.makedirs(save_dir, exist_ok=True)
df.to_csv(os.path.join(save_dir, 'tabnet_results.csv'), index=False)


print("\n Final Mean Scores:")
print(df.mean())


 Fold 1

Early stopping occurred at epoch 55 with best_epoch = 35 and best_val_auc = 0.70635
Accuracy: 0.7136 | F1: 0.5349 | AUC: 0.7063

 Fold 2





Early stopping occurred at epoch 75 with best_epoch = 55 and best_val_auc = 0.6087
Accuracy: 0.5763 | F1: 0.1570 | AUC: 0.6087

 Fold 3





Early stopping occurred at epoch 61 with best_epoch = 41 and best_val_auc = 0.67783
Accuracy: 0.7078 | F1: 0.3889 | AUC: 0.6778

 Fold 4





Early stopping occurred at epoch 63 with best_epoch = 43 and best_val_auc = 0.63436
Accuracy: 0.6533 | F1: 0.3475 | AUC: 0.6344

 Fold 5





Early stopping occurred at epoch 50 with best_epoch = 30 and best_val_auc = 0.66853
Accuracy: 0.6415 | F1: 0.2629 | AUC: 0.6685

 Final Mean Scores:
Accuracy    0.658499
F1          0.338252
AUC         0.659155
dtype: float64




In [None]:
#test the hyperparamters

#load the hyperparameters
param = pd.read_csv('/content/assignment4/hyperopt/best_hyperparameters.csv')

# Initialize a cross-validation strategy that maintains class distribution and group integrity
crossvalidation = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=42)

# List to store evaluation results from each fold
fold_results = []

# Loop over each fold provided by the cross-validation split
for fold, (train_idx, test_idx) in enumerate(crossvalidation.split(X_np, y_np, groups)):
    print(f"\n Fold {fold + 1}")

    # Split the dataset into training and testing sets based on current fold indices
    X_train, X_test = X_np[train_idx], X_np[test_idx]
    y_train, y_test = y_np[train_idx], y_np[test_idx]

    # Initialize the TabNetClassifier with default hyperparameters
    model = TabNetClassifier(
        device_name='cuda',
        n_d=int(param['n_d']),
        n_a=int(param['n_a']),
        n_steps=int(param['n_steps']),
        gamma=float(param['gamma']),
        lambda_sparse=float(param['lambda_sparse']),
        optimizer_fn=torch.optim.Adam,
        optimizer_params=dict(lr=float(param['2e^-2'])),
        scheduler_fn=torch.optim.lr_scheduler.StepLR,
        scheduler_params=dict(gamma=0.95, step_size=20),
        mask_type='sparsemax',
        verbose=0,
        seed=int(param['random_state'])
    )


    # Train the model with early stopping and evaluation metrics
    model.fit(
        X_train=X_train, y_train=y_train,
        eval_set=[(X_test, y_test)],
        eval_name=["val"],
        eval_metric=["auc"],
        max_epochs=200,
        patience=20,
        batch_size=1024,
        virtual_batch_size=128,
    )

    # Generate predictions and predicted probabilities for the test set
    y_pred = model.predict(X_test)
    y_prob = model.predict_proba(X_test)[:, 1]

    # Calculate evaluation metrics: Accuracy, F1 Score, and AUC
    acc = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    auc = roc_auc_score(y_test, y_prob)

    # Display fold-specific evaluation metrics
    print(f"Accuracy: {acc:.4f} | F1: {f1:.4f} | AUC: {auc:.4f}")

    # Store the metrics for this fold
    fold_results.append(dict(Accuracy=acc, F1=f1, AUC=auc))

# Create a DataFrame from the results and display the average performance across all folds
df = pd.DataFrame(fold_results)

save_dir = './assignment4/tabnet'
os.makedirs(save_dir, exist_ok=True)
df.to_csv(os.path.join(save_dir, 'tabnet_results.csv'), index=False)


print("\n Final Mean Scores:")
print(df.mean())


 Fold 1


  n_d=int(param['n_d']),
  n_a=int(param['n_a']),
  n_steps=int(param['n_steps']),
  gamma=float(param['gamma']),
  lambda_sparse=float(param['lambda_sparse']),
  optimizer_params=dict(lr=float(param['lr'])),
  seed=int(param['random_state'])



Early stopping occurred at epoch 33 with best_epoch = 13 and best_val_auc = 0.62079
Accuracy: 0.6468 | F1: 0.4436 | AUC: 0.6208

 Fold 2


  n_d=int(param['n_d']),
  n_a=int(param['n_a']),
  n_steps=int(param['n_steps']),
  gamma=float(param['gamma']),
  lambda_sparse=float(param['lambda_sparse']),
  optimizer_params=dict(lr=float(param['lr'])),
  seed=int(param['random_state'])



Early stopping occurred at epoch 65 with best_epoch = 45 and best_val_auc = 0.58334
Accuracy: 0.5849 | F1: 0.0242 | AUC: 0.5833

 Fold 3


  n_d=int(param['n_d']),
  n_a=int(param['n_a']),
  n_steps=int(param['n_steps']),
  gamma=float(param['gamma']),
  lambda_sparse=float(param['lambda_sparse']),
  optimizer_params=dict(lr=float(param['lr'])),
  seed=int(param['random_state'])



Early stopping occurred at epoch 48 with best_epoch = 28 and best_val_auc = 0.58251
Accuracy: 0.7154 | F1: 0.0625 | AUC: 0.5825

 Fold 4


  n_d=int(param['n_d']),
  n_a=int(param['n_a']),
  n_steps=int(param['n_steps']),
  gamma=float(param['gamma']),
  lambda_sparse=float(param['lambda_sparse']),
  optimizer_params=dict(lr=float(param['lr'])),
  seed=int(param['random_state'])



Early stopping occurred at epoch 43 with best_epoch = 23 and best_val_auc = 0.5665
Accuracy: 0.6202 | F1: 0.3550 | AUC: 0.5665

 Fold 5


  n_d=int(param['n_d']),
  n_a=int(param['n_a']),
  n_steps=int(param['n_steps']),
  gamma=float(param['gamma']),
  lambda_sparse=float(param['lambda_sparse']),
  optimizer_params=dict(lr=float(param['lr'])),
  seed=int(param['random_state'])



Early stopping occurred at epoch 30 with best_epoch = 10 and best_val_auc = 0.54518
Accuracy: 0.5407 | F1: 0.4205 | AUC: 0.5452

 Final Mean Scores:
Accuracy    0.621592
F1          0.261174
AUC         0.579665
dtype: float64




## Assignment 5. Please try combining all the above methods to push the model performance. (20 pts)

## Ensemble: EvXGBoost X TabNet

The top 60 most important features were selected based on SHAP values from Assignment 2. These features capture the most influential patterns identified by prior model explanations.

**Modeling Approach:**
A soft voting ensemble is used to combine predictions from two different classifiers:

* XGBoost (with hyperparameters optimized via cross-validation)

* TabNet (also tuned using Hyperopt for best performance)

Instead of using the final class predictions, we averaged their predicted probabilities:
  
  $$
  \text{final_proba} = w_1 \cdot \text{xgb_proba} + w_2 \cdot \text{tabnet_proba}
  $$

Why Soft Voting?


Soft voting averages the predicted probabilities from each model rather than their final class labels. This approach:

* Takes into account model confidence

* Often improves overall predictive performance

Reduces variance by balancing the strengths of different model types

This ensemble leverages both tree-based and deep learning-based decision processes to create a more robust classifier.

### Assignment 5 – Final Ensemble Model

In the final stage, we employed a **soft voting ensemble** combining predictions from two models: `evXGBoost` and `TabNet`.

-

- We evaluated several weight combinations. The best performance was achieved with:
  - **XGBoost Weight:** 0.6
  - **TabNet Weight:** 0.4
  - **Resulting AUC:** **0.682**

This ensemble approach improved performance over individual models, demonstrating the benefit of combining complementary learners.


In [None]:
# Cross-validation
cv = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=42)

results = []

for fold, (train_idx, test_idx) in enumerate(cv.split(X_np, y_np, groups)):
    print(f"\nFold {fold + 1}")
    X_train, X_test = X_np[train_idx], X_np[test_idx]
    y_train, y_test = y_np[train_idx], y_np[test_idx]

    # XGBoost model
    xgb_model = EvXGBClassifier(
      random_state=42,
      eval_metric='logloss',
      eval_size=0.2,
      early_stopping_rounds=10,
      objective='binary:logistic',
      verbosity=0,
      learning_rate=0.01
    )
    xgb_model.fit(X_train, y_train)
    xgb_proba = xgb_model.predict_proba(X_test)[:, 1]

    # TabNet model
    tabnet_model = TabNetClassifier(
        device_name='cuda',
        n_d=64,
        n_a=64,
        n_steps=5,
        gamma=1.5,
        lambda_sparse=1e-4,
        optimizer_fn=torch.optim.Adam,
        optimizer_params=dict(lr=2e-2),
        scheduler_fn=torch.optim.lr_scheduler.StepLR,
        scheduler_params=dict(gamma=0.95, step_size=20),
        mask_type='sparsemax',
        verbose=0,
        seed=42
    )
    tabnet_model.fit(
        X_train=X_train, y_train=y_train,
        eval_set=[(X_test, y_test)],
        eval_name=["val"],
        eval_metric=["auc"],
        max_epochs=200,
        patience=20,
        batch_size=512,
        virtual_batch_size=128,
    )
    tabnet_proba = tabnet_model.predict_proba(X_test)[:, 1]

    # Soft voting ensemble
    final_proba = 0.55 * xgb_proba + 0.45 * tabnet_proba #Here the parameters can be changed with attempts
    final_preds = (final_proba >= 0.5).astype(int)

    acc = accuracy_score(y_test, final_preds)
    f1 = f1_score(y_test, final_preds)
    auc = roc_auc_score(y_test, final_proba)

    print(f"Accuracy: {acc:.4f} | F1: {f1:.4f} | AUC: {auc:.4f}")
    results.append(dict(Fold=fold + 1, Accuracy=acc, F1=f1, AUC=auc))

# Summary
df = pd.DataFrame(results)
print("\nFinal Mean Scores:")
print(df.mean())



Fold 1

Early stopping occurred at epoch 42 with best_epoch = 22 and best_val_auc = 0.7051
Accuracy: 0.6516 | F1: 0.1412 | AUC: 0.7424

Fold 2





Early stopping occurred at epoch 47 with best_epoch = 27 and best_val_auc = 0.63382
Accuracy: 0.6244 | F1: 0.3540 | AUC: 0.6594

Fold 3





Early stopping occurred at epoch 58 with best_epoch = 38 and best_val_auc = 0.65334
Accuracy: 0.7249 | F1: 0.2564 | AUC: 0.6641

Fold 4





Early stopping occurred at epoch 23 with best_epoch = 3 and best_val_auc = 0.60205
Accuracy: 0.6882 | F1: 0.3142 | AUC: 0.6529

Fold 5





Early stopping occurred at epoch 43 with best_epoch = 23 and best_val_auc = 0.68313
Accuracy: 0.6628 | F1: 0.3507 | AUC: 0.6918

Final Mean Scores:
Fold        3.000000
Accuracy    0.670342
F1          0.283298
AUC         0.682125
dtype: float64




## Discussion

By averaging their predicted probabilities (rather than using hard votes), we allowed the ensemble to capture richer confidence information from both models. We experimented with different weight combinations and found that assigning **0.55 to XGBoost and 0.45 to TabNet** yielded the best performance, achieving a **mean AUC of 0.682**.

This confirms that ensemble learning, even with only two models, can effectively boost robustness and generalization by leveraging diverse modeling perspectives.

## Appendix

#### Hyperparameter Tuning for TabNet

In [None]:

# Outer CV setup
OUTER_CV = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=42)

# Define objective function for Hyperopt
def objective(params):
    val_scores = []

    for train_full_idx, _ in OUTER_CV.split(X_np, y_np, groups):
        X_train_full = X_np[train_full_idx]
        y_train_full = y_np[train_full_idx]

        # Inner split: train/val from training fold only
        X_train, X_val, y_train, y_val = train_test_split(
            X_train_full, y_train_full,
            test_size=0.20,
            stratify=y_train_full,
            random_state=42
        )

        # Initialize TabNet model with hyperparameters
        model = TabNetClassifier(
            n_d=int(params["n_d"]),
            n_a=int(params["n_a"]),
            n_steps=int(params["n_steps"]),
            gamma=params["gamma"],
            lambda_sparse=params["lambda_sparse"],
            optimizer_fn=torch.optim.Adam,
            optimizer_params=dict(lr=params["lr"]),
            scheduler_fn=torch.optim.lr_scheduler.StepLR,
            scheduler_params={"step_size": 10, "gamma": 0.9},
            mask_type="entmax",
            verbose=0,
            device_name="cuda"
        )

        # Train on training data, validate on val split
        model.fit(
            X_train=X_train,
            y_train=y_train,
            eval_set=[(X_val, y_val)],
            eval_metric=["auc"],
            max_epochs=200,
            patience=20,
            batch_size=512,
            virtual_batch_size=128
        )

        # Evaluate on validation set
        y_val_pred = model.predict_proba(X_val)[:, 1]
        val_auc = roc_auc_score(y_val, y_val_pred)
        val_scores.append(val_auc)

    return {'loss': -np.mean(val_scores), 'status': STATUS_OK}


# Define search space
space = {
    "n_d": hp.choice("n_d", [32, 64, 128]),
    "n_a": hp.choice("n_a", [32, 64, 128]),
    "n_steps": hp.choice("n_steps", [3, 5, 7]),
    "gamma": hp.uniform("gamma", 1.0, 2.0),
    "lambda_sparse": hp.loguniform("lambda_sparse", np.log(1e-6), np.log(1e-2)),
    "lr": hp.loguniform("lr", np.log(1e-3), np.log(5e-2)),
}

# Run hyperparameter tuning with Hyperopt
trials = Trials()
best = fmin(
    fn=objective,
    space=space,
    algo=tpe.suggest,
    max_evals=30,
    trials=trials
)

#processing parameters
best_params = {k: v for k, v in best.items()}
best_params["n_d"] = [32, 64, 128][best_params["n_d"]]
best_params["n_a"] = [32, 64, 128][best_params["n_a"]]
best_params["n_steps"] = [3, 5, 7][best_params["n_steps"]]
best_params["lambda_sparse"] = 10 ** best_params["lambda_sparse"]
best_params["lr"] = 10 ** best_params["lr"]
best_params["random_state"] = 42

#saving parameters
result_dir = os.path.join("assignment4", "hyperopt")
os.makedirs(result_dir, exist_ok=True)
pd.DataFrame([best_params]).to_csv(os.path.join(result_dir, "best_hyperparameters.csv"), index=False)
print(list(best_params.values()))


Early stopping occurred at epoch 35 with best_epoch = 15 and best_val_0_auc = 0.65599
  0%|          | 0/30 [00:04<?, ?trial/s, best loss=?]





Early stopping occurred at epoch 29 with best_epoch = 9 and best_val_0_auc = 0.64368
  0%|          | 0/30 [00:07<?, ?trial/s, best loss=?]





Early stopping occurred at epoch 44 with best_epoch = 24 and best_val_0_auc = 0.63917
  0%|          | 0/30 [00:12<?, ?trial/s, best loss=?]





Early stopping occurred at epoch 63 with best_epoch = 43 and best_val_0_auc = 0.70551
  0%|          | 0/30 [00:20<?, ?trial/s, best loss=?]





Early stopping occurred at epoch 52 with best_epoch = 32 and best_val_0_auc = 0.65807
  3%|▎         | 1/30 [00:26<12:39, 26.20s/trial, best loss: -0.6604850658656904]





Early stopping occurred at epoch 27 with best_epoch = 7 and best_val_0_auc = 0.61946
  3%|▎         | 1/30 [00:29<12:39, 26.20s/trial, best loss: -0.6604850658656904]





Early stopping occurred at epoch 67 with best_epoch = 47 and best_val_0_auc = 0.66357
  3%|▎         | 1/30 [00:37<12:39, 26.20s/trial, best loss: -0.6604850658656904]





Early stopping occurred at epoch 34 with best_epoch = 14 and best_val_0_auc = 0.66796
  3%|▎         | 1/30 [00:41<12:39, 26.20s/trial, best loss: -0.6604850658656904]





Early stopping occurred at epoch 45 with best_epoch = 25 and best_val_0_auc = 0.72213
  3%|▎         | 1/30 [00:46<12:39, 26.20s/trial, best loss: -0.6604850658656904]





Early stopping occurred at epoch 52 with best_epoch = 32 and best_val_0_auc = 0.66717
  7%|▋         | 2/30 [00:52<12:19, 26.42s/trial, best loss: -0.6680572630115097]





Early stopping occurred at epoch 52 with best_epoch = 32 and best_val_0_auc = 0.61998
  7%|▋         | 2/30 [01:04<12:19, 26.42s/trial, best loss: -0.6680572630115097]





Early stopping occurred at epoch 30 with best_epoch = 10 and best_val_0_auc = 0.61723
  7%|▋         | 2/30 [01:11<12:19, 26.42s/trial, best loss: -0.6680572630115097]





Early stopping occurred at epoch 54 with best_epoch = 34 and best_val_0_auc = 0.65567
  7%|▋         | 2/30 [01:23<12:19, 26.42s/trial, best loss: -0.6680572630115097]





Early stopping occurred at epoch 27 with best_epoch = 7 and best_val_0_auc = 0.62123
  7%|▋         | 2/30 [01:29<12:19, 26.42s/trial, best loss: -0.6680572630115097]





Early stopping occurred at epoch 56 with best_epoch = 36 and best_val_0_auc = 0.63211
 10%|█         | 3/30 [01:42<16:46, 37.28s/trial, best loss: -0.6680572630115097]





Early stopping occurred at epoch 53 with best_epoch = 33 and best_val_0_auc = 0.64371
 10%|█         | 3/30 [01:55<16:46, 37.28s/trial, best loss: -0.6680572630115097]





Early stopping occurred at epoch 28 with best_epoch = 8 and best_val_0_auc = 0.60714
 10%|█         | 3/30 [02:02<16:46, 37.28s/trial, best loss: -0.6680572630115097]





Early stopping occurred at epoch 72 with best_epoch = 52 and best_val_0_auc = 0.64413
 10%|█         | 3/30 [02:18<16:46, 37.28s/trial, best loss: -0.6680572630115097]





Early stopping occurred at epoch 30 with best_epoch = 10 and best_val_0_auc = 0.65239
 10%|█         | 3/30 [02:25<16:46, 37.28s/trial, best loss: -0.6680572630115097]





Early stopping occurred at epoch 25 with best_epoch = 5 and best_val_0_auc = 0.61877
 13%|█▎        | 4/30 [02:31<18:05, 41.74s/trial, best loss: -0.6680572630115097]





Early stopping occurred at epoch 35 with best_epoch = 15 and best_val_0_auc = 0.64587
 13%|█▎        | 4/30 [02:39<18:05, 41.74s/trial, best loss: -0.6680572630115097]





Early stopping occurred at epoch 61 with best_epoch = 41 and best_val_0_auc = 0.6836
 13%|█▎        | 4/30 [02:53<18:05, 41.74s/trial, best loss: -0.6680572630115097]





Early stopping occurred at epoch 75 with best_epoch = 55 and best_val_0_auc = 0.69885
 13%|█▎        | 4/30 [03:11<18:05, 41.74s/trial, best loss: -0.6680572630115097]





Early stopping occurred at epoch 24 with best_epoch = 4 and best_val_0_auc = 0.67041
 13%|█▎        | 4/30 [03:17<18:05, 41.74s/trial, best loss: -0.6680572630115097]





Early stopping occurred at epoch 54 with best_epoch = 34 and best_val_0_auc = 0.65525
 17%|█▋        | 5/30 [03:29<19:52, 47.71s/trial, best loss: -0.6707953708246024]





Early stopping occurred at epoch 45 with best_epoch = 25 and best_val_0_auc = 0.61165
 17%|█▋        | 5/30 [03:35<19:52, 47.71s/trial, best loss: -0.6707953708246024]





Early stopping occurred at epoch 53 with best_epoch = 33 and best_val_0_auc = 0.67554
 17%|█▋        | 5/30 [03:41<19:52, 47.71s/trial, best loss: -0.6707953708246024]





Early stopping occurred at epoch 44 with best_epoch = 24 and best_val_0_auc = 0.6966
 17%|█▋        | 5/30 [03:46<19:52, 47.71s/trial, best loss: -0.6707953708246024]





Early stopping occurred at epoch 43 with best_epoch = 23 and best_val_0_auc = 0.68739
 17%|█▋        | 5/30 [03:52<19:52, 47.71s/trial, best loss: -0.6707953708246024]





Early stopping occurred at epoch 43 with best_epoch = 23 and best_val_0_auc = 0.6548
 20%|██        | 6/30 [03:57<16:20, 40.84s/trial, best loss: -0.6707953708246024]





Early stopping occurred at epoch 58 with best_epoch = 38 and best_val_0_auc = 0.66261
 20%|██        | 6/30 [04:04<16:20, 40.84s/trial, best loss: -0.6707953708246024]





Early stopping occurred at epoch 85 with best_epoch = 65 and best_val_0_auc = 0.65779
 20%|██        | 6/30 [04:14<16:20, 40.84s/trial, best loss: -0.6707953708246024]





Early stopping occurred at epoch 34 with best_epoch = 14 and best_val_0_auc = 0.58109
 20%|██        | 6/30 [04:18<16:20, 40.84s/trial, best loss: -0.6707953708246024]





Early stopping occurred at epoch 66 with best_epoch = 46 and best_val_0_auc = 0.65514
 20%|██        | 6/30 [04:26<16:20, 40.84s/trial, best loss: -0.6707953708246024]





Early stopping occurred at epoch 73 with best_epoch = 53 and best_val_0_auc = 0.64913
 23%|██▎       | 7/30 [04:34<15:14, 39.76s/trial, best loss: -0.6707953708246024]





Early stopping occurred at epoch 34 with best_epoch = 14 and best_val_0_auc = 0.63993
 23%|██▎       | 7/30 [04:39<15:14, 39.76s/trial, best loss: -0.6707953708246024]





Early stopping occurred at epoch 35 with best_epoch = 15 and best_val_0_auc = 0.64469
 23%|██▎       | 7/30 [04:43<15:14, 39.76s/trial, best loss: -0.6707953708246024]





Early stopping occurred at epoch 31 with best_epoch = 11 and best_val_0_auc = 0.66305
 23%|██▎       | 7/30 [04:46<15:14, 39.76s/trial, best loss: -0.6707953708246024]





Early stopping occurred at epoch 28 with best_epoch = 8 and best_val_0_auc = 0.6951
 23%|██▎       | 7/30 [04:50<15:14, 39.76s/trial, best loss: -0.6707953708246024]





Early stopping occurred at epoch 30 with best_epoch = 10 and best_val_0_auc = 0.68408
 27%|██▋       | 8/30 [04:54<12:10, 33.20s/trial, best loss: -0.6707953708246024]





Early stopping occurred at epoch 55 with best_epoch = 35 and best_val_0_auc = 0.61618
 27%|██▋       | 8/30 [05:06<12:10, 33.20s/trial, best loss: -0.6707953708246024]





Early stopping occurred at epoch 68 with best_epoch = 48 and best_val_0_auc = 0.66566
 27%|██▋       | 8/30 [05:21<12:10, 33.20s/trial, best loss: -0.6707953708246024]





Early stopping occurred at epoch 125 with best_epoch = 105 and best_val_0_auc = 0.66097
 27%|██▋       | 8/30 [05:49<12:10, 33.20s/trial, best loss: -0.6707953708246024]





Early stopping occurred at epoch 53 with best_epoch = 33 and best_val_0_auc = 0.67329
 27%|██▋       | 8/30 [06:01<12:10, 33.20s/trial, best loss: -0.6707953708246024]





Early stopping occurred at epoch 77 with best_epoch = 57 and best_val_0_auc = 0.65057
 30%|███       | 9/30 [06:18<17:16, 49.38s/trial, best loss: -0.6707953708246024]





Early stopping occurred at epoch 90 with best_epoch = 70 and best_val_0_auc = 0.65435
 30%|███       | 9/30 [06:38<17:16, 49.38s/trial, best loss: -0.6707953708246024]





Early stopping occurred at epoch 60 with best_epoch = 40 and best_val_0_auc = 0.64238
 30%|███       | 9/30 [06:52<17:16, 49.38s/trial, best loss: -0.6707953708246024]





Early stopping occurred at epoch 64 with best_epoch = 44 and best_val_0_auc = 0.64123
 30%|███       | 9/30 [07:07<17:16, 49.38s/trial, best loss: -0.6707953708246024]





Early stopping occurred at epoch 61 with best_epoch = 41 and best_val_0_auc = 0.71841
 30%|███       | 9/30 [07:21<17:16, 49.38s/trial, best loss: -0.6707953708246024]





Early stopping occurred at epoch 50 with best_epoch = 30 and best_val_0_auc = 0.6564
 33%|███▎      | 10/30 [07:32<18:59, 56.97s/trial, best loss: -0.6707953708246024]





Early stopping occurred at epoch 58 with best_epoch = 38 and best_val_0_auc = 0.67345
 33%|███▎      | 10/30 [07:39<18:59, 56.97s/trial, best loss: -0.6707953708246024]





Early stopping occurred at epoch 26 with best_epoch = 6 and best_val_0_auc = 0.67748
 33%|███▎      | 10/30 [07:43<18:59, 56.97s/trial, best loss: -0.6707953708246024]





Early stopping occurred at epoch 38 with best_epoch = 18 and best_val_0_auc = 0.66022
 33%|███▎      | 10/30 [07:48<18:59, 56.97s/trial, best loss: -0.6707953708246024]





Early stopping occurred at epoch 38 with best_epoch = 18 and best_val_0_auc = 0.69884
 33%|███▎      | 10/30 [07:52<18:59, 56.97s/trial, best loss: -0.6707953708246024]





Early stopping occurred at epoch 38 with best_epoch = 18 and best_val_0_auc = 0.68011
 37%|███▋      | 11/30 [07:57<14:52, 46.98s/trial, best loss: -0.6780188434624901]





Early stopping occurred at epoch 47 with best_epoch = 27 and best_val_0_auc = 0.66551
 37%|███▋      | 11/30 [08:08<14:52, 46.98s/trial, best loss: -0.6780188434624901]





Early stopping occurred at epoch 64 with best_epoch = 44 and best_val_0_auc = 0.69301
 37%|███▋      | 11/30 [08:22<14:52, 46.98s/trial, best loss: -0.6780188434624901]





Early stopping occurred at epoch 87 with best_epoch = 67 and best_val_0_auc = 0.68905
 37%|███▋      | 11/30 [08:42<14:52, 46.98s/trial, best loss: -0.6780188434624901]





Early stopping occurred at epoch 71 with best_epoch = 51 and best_val_0_auc = 0.72392
 37%|███▋      | 11/30 [08:58<14:52, 46.98s/trial, best loss: -0.6780188434624901]





Early stopping occurred at epoch 33 with best_epoch = 13 and best_val_0_auc = 0.69118
 40%|████      | 12/30 [09:06<16:06, 53.70s/trial, best loss: -0.692533635985155] 





Early stopping occurred at epoch 82 with best_epoch = 62 and best_val_0_auc = 0.64781
 40%|████      | 12/30 [09:15<16:06, 53.70s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 47 with best_epoch = 27 and best_val_0_auc = 0.66992
 40%|████      | 12/30 [09:21<16:06, 53.70s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 41 with best_epoch = 21 and best_val_0_auc = 0.6837
 40%|████      | 12/30 [09:26<16:06, 53.70s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 48 with best_epoch = 28 and best_val_0_auc = 0.6959
 40%|████      | 12/30 [09:32<16:06, 53.70s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 76 with best_epoch = 56 and best_val_0_auc = 0.64334
 43%|████▎     | 13/30 [09:41<13:35, 48.00s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 47 with best_epoch = 27 and best_val_0_auc = 0.62916
 43%|████▎     | 13/30 [09:46<13:35, 48.00s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 48 with best_epoch = 28 and best_val_0_auc = 0.67372
 43%|████▎     | 13/30 [09:52<13:35, 48.00s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 37 with best_epoch = 17 and best_val_0_auc = 0.72304
 43%|████▎     | 13/30 [09:56<13:35, 48.00s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 28 with best_epoch = 8 and best_val_0_auc = 0.69188
 43%|████▎     | 13/30 [10:00<13:35, 48.00s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 37 with best_epoch = 17 and best_val_0_auc = 0.66484
 47%|████▋     | 14/30 [10:04<10:50, 40.63s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 93 with best_epoch = 73 and best_val_0_auc = 0.69099
 47%|████▋     | 14/30 [10:26<10:50, 40.63s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 44 with best_epoch = 24 and best_val_0_auc = 0.65155
 47%|████▋     | 14/30 [10:36<10:50, 40.63s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 38 with best_epoch = 18 and best_val_0_auc = 0.65921
 47%|████▋     | 14/30 [10:44<10:50, 40.63s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 73 with best_epoch = 53 and best_val_0_auc = 0.68775
 47%|████▋     | 14/30 [11:01<10:50, 40.63s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 46 with best_epoch = 26 and best_val_0_auc = 0.68281
 50%|█████     | 15/30 [11:11<12:08, 48.60s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 51 with best_epoch = 31 and best_val_0_auc = 0.63679
 50%|█████     | 15/30 [11:23<12:08, 48.60s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 33 with best_epoch = 13 and best_val_0_auc = 0.65899
 50%|█████     | 15/30 [11:31<12:08, 48.60s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 40 with best_epoch = 20 and best_val_0_auc = 0.6131
 50%|█████     | 15/30 [11:40<12:08, 48.60s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 72 with best_epoch = 52 and best_val_0_auc = 0.66077
 50%|█████     | 15/30 [11:57<12:08, 48.60s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 35 with best_epoch = 15 and best_val_0_auc = 0.6513
 53%|█████▎    | 16/30 [12:05<11:40, 50.03s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 31 with best_epoch = 11 and best_val_0_auc = 0.59871
 53%|█████▎    | 16/30 [12:12<11:40, 50.03s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 48 with best_epoch = 28 and best_val_0_auc = 0.68354
 53%|█████▎    | 16/30 [12:23<11:40, 50.03s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 49 with best_epoch = 29 and best_val_0_auc = 0.63883
 53%|█████▎    | 16/30 [12:34<11:40, 50.03s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 26 with best_epoch = 6 and best_val_0_auc = 0.6454
 53%|█████▎    | 16/30 [12:40<11:40, 50.03s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 23 with best_epoch = 3 and best_val_0_auc = 0.61773
 57%|█████▋    | 17/30 [12:45<10:11, 47.07s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 33 with best_epoch = 13 and best_val_0_auc = 0.5594
 57%|█████▋    | 17/30 [12:52<10:11, 47.07s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 23 with best_epoch = 3 and best_val_0_auc = 0.53637
 57%|█████▋    | 17/30 [12:58<10:11, 47.07s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 26 with best_epoch = 6 and best_val_0_auc = 0.5661
 57%|█████▋    | 17/30 [13:04<10:11, 47.07s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 22 with best_epoch = 2 and best_val_0_auc = 0.56174
 57%|█████▋    | 17/30 [13:09<10:11, 47.07s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 79 with best_epoch = 59 and best_val_0_auc = 0.59798
 60%|██████    | 18/30 [13:26<09:04, 45.33s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 44 with best_epoch = 24 and best_val_0_auc = 0.59507
 60%|██████    | 18/30 [13:34<09:04, 45.33s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 68 with best_epoch = 48 and best_val_0_auc = 0.65687
 60%|██████    | 18/30 [13:45<09:04, 45.33s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 78 with best_epoch = 58 and best_val_0_auc = 0.6717
 60%|██████    | 18/30 [13:58<09:04, 45.33s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 46 with best_epoch = 26 and best_val_0_auc = 0.63702
 60%|██████    | 18/30 [14:07<09:04, 45.33s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 61 with best_epoch = 41 and best_val_0_auc = 0.66204
 63%|██████▎   | 19/30 [14:17<08:36, 46.98s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 79 with best_epoch = 59 and best_val_0_auc = 0.6605
 63%|██████▎   | 19/30 [14:31<08:36, 46.98s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 45 with best_epoch = 25 and best_val_0_auc = 0.71312
 63%|██████▎   | 19/30 [14:39<08:36, 46.98s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 50 with best_epoch = 30 and best_val_0_auc = 0.70445
 63%|██████▎   | 19/30 [14:48<08:36, 46.98s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 50 with best_epoch = 30 and best_val_0_auc = 0.68458
 63%|██████▎   | 19/30 [14:57<08:36, 46.98s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 69 with best_epoch = 49 and best_val_0_auc = 0.67459
 67%|██████▋   | 20/30 [15:09<08:03, 48.36s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 42 with best_epoch = 22 and best_val_0_auc = 0.63884
 67%|██████▋   | 20/30 [15:16<08:03, 48.36s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 52 with best_epoch = 32 and best_val_0_auc = 0.69337
 67%|██████▋   | 20/30 [15:25<08:03, 48.36s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 24 with best_epoch = 4 and best_val_0_auc = 0.6705
 67%|██████▋   | 20/30 [15:29<08:03, 48.36s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 51 with best_epoch = 31 and best_val_0_auc = 0.70538
 67%|██████▋   | 20/30 [15:38<08:03, 48.36s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 47 with best_epoch = 27 and best_val_0_auc = 0.67899
 70%|███████   | 21/30 [15:47<06:47, 45.23s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 41 with best_epoch = 21 and best_val_0_auc = 0.65257
 70%|███████   | 21/30 [15:54<06:47, 45.23s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 72 with best_epoch = 52 and best_val_0_auc = 0.69375
 70%|███████   | 21/30 [16:07<06:47, 45.23s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 70 with best_epoch = 50 and best_val_0_auc = 0.68565
 70%|███████   | 21/30 [16:19<06:47, 45.23s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 50 with best_epoch = 30 and best_val_0_auc = 0.675
 70%|███████   | 21/30 [16:28<06:47, 45.23s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 50 with best_epoch = 30 and best_val_0_auc = 0.6593
 73%|███████▎  | 22/30 [16:37<06:14, 46.76s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 46 with best_epoch = 26 and best_val_0_auc = 0.67855
 73%|███████▎  | 22/30 [16:45<06:14, 46.76s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 51 with best_epoch = 31 and best_val_0_auc = 0.68175
 73%|███████▎  | 22/30 [16:54<06:14, 46.76s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 55 with best_epoch = 35 and best_val_0_auc = 0.67859
 73%|███████▎  | 22/30 [17:04<06:14, 46.76s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 63 with best_epoch = 43 and best_val_0_auc = 0.70621
 73%|███████▎  | 22/30 [17:15<06:14, 46.76s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 63 with best_epoch = 43 and best_val_0_auc = 0.68281
 77%|███████▋  | 23/30 [17:26<05:32, 47.53s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 49 with best_epoch = 29 and best_val_0_auc = 0.65182
 77%|███████▋  | 23/30 [17:35<05:32, 47.53s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 36 with best_epoch = 16 and best_val_0_auc = 0.59837
 77%|███████▋  | 23/30 [17:41<05:32, 47.53s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 30 with best_epoch = 10 and best_val_0_auc = 0.5987
 77%|███████▋  | 23/30 [17:47<05:32, 47.53s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 49 with best_epoch = 29 and best_val_0_auc = 0.65291
 77%|███████▋  | 23/30 [17:55<05:32, 47.53s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 49 with best_epoch = 29 and best_val_0_auc = 0.65827
 80%|████████  | 24/30 [18:04<04:27, 44.60s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 68 with best_epoch = 48 and best_val_0_auc = 0.67917
 80%|████████  | 24/30 [18:16<04:27, 44.60s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 82 with best_epoch = 62 and best_val_0_auc = 0.69529
 80%|████████  | 24/30 [18:30<04:27, 44.60s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 66 with best_epoch = 46 and best_val_0_auc = 0.69731
 80%|████████  | 24/30 [18:41<04:27, 44.60s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 49 with best_epoch = 29 and best_val_0_auc = 0.68674
 80%|████████  | 24/30 [18:49<04:27, 44.60s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 60 with best_epoch = 40 and best_val_0_auc = 0.65802
 83%|████████▎ | 25/30 [19:00<03:59, 47.99s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 34 with best_epoch = 14 and best_val_0_auc = 0.65585
 83%|████████▎ | 25/30 [19:06<03:59, 47.99s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 49 with best_epoch = 29 and best_val_0_auc = 0.67817
 83%|████████▎ | 25/30 [19:14<03:59, 47.99s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 35 with best_epoch = 15 and best_val_0_auc = 0.66513
 83%|████████▎ | 25/30 [19:20<03:59, 47.99s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 32 with best_epoch = 12 and best_val_0_auc = 0.71927
 83%|████████▎ | 25/30 [19:26<03:59, 47.99s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 23 with best_epoch = 3 and best_val_0_auc = 0.65485
 87%|████████▋ | 26/30 [19:30<02:50, 42.73s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 47 with best_epoch = 27 and best_val_0_auc = 0.65246
 87%|████████▋ | 26/30 [19:38<02:50, 42.73s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 50 with best_epoch = 30 and best_val_0_auc = 0.67573
 87%|████████▋ | 26/30 [19:47<02:50, 42.73s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 90 with best_epoch = 70 and best_val_0_auc = 0.68348
 87%|████████▋ | 26/30 [20:02<02:50, 42.73s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 59 with best_epoch = 39 and best_val_0_auc = 0.71878
 87%|████████▋ | 26/30 [20:12<02:50, 42.73s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 59 with best_epoch = 39 and best_val_0_auc = 0.67949
 90%|█████████ | 27/30 [20:22<02:16, 45.47s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 45 with best_epoch = 25 and best_val_0_auc = 0.63695
 90%|█████████ | 27/30 [20:30<02:16, 45.47s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 28 with best_epoch = 8 and best_val_0_auc = 0.65106
 90%|█████████ | 27/30 [20:35<02:16, 45.47s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 59 with best_epoch = 39 and best_val_0_auc = 0.65108
 90%|█████████ | 27/30 [20:45<02:16, 45.47s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 57 with best_epoch = 37 and best_val_0_auc = 0.61609
 90%|█████████ | 27/30 [20:55<02:16, 45.47s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 37 with best_epoch = 17 and best_val_0_auc = 0.64293
 93%|█████████▎| 28/30 [21:02<01:27, 43.63s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 66 with best_epoch = 46 and best_val_0_auc = 0.65676
 93%|█████████▎| 28/30 [21:13<01:27, 43.63s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 63 with best_epoch = 43 and best_val_0_auc = 0.70058
 93%|█████████▎| 28/30 [21:23<01:27, 43.63s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 44 with best_epoch = 24 and best_val_0_auc = 0.65877
 93%|█████████▎| 28/30 [21:31<01:27, 43.63s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 33 with best_epoch = 13 and best_val_0_auc = 0.65514
 93%|█████████▎| 28/30 [21:37<01:27, 43.63s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 43 with best_epoch = 23 and best_val_0_auc = 0.66538
 97%|█████████▋| 29/30 [21:44<00:43, 43.37s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 80 with best_epoch = 60 and best_val_0_auc = 0.6893
 97%|█████████▋| 29/30 [22:02<00:43, 43.37s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 59 with best_epoch = 39 and best_val_0_auc = 0.70587
 97%|█████████▋| 29/30 [22:15<00:43, 43.37s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 74 with best_epoch = 54 and best_val_0_auc = 0.71591
 97%|█████████▋| 29/30 [22:32<00:43, 43.37s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 78 with best_epoch = 58 and best_val_0_auc = 0.72028
 97%|█████████▋| 29/30 [22:49<00:43, 43.37s/trial, best loss: -0.692533635985155]





Early stopping occurred at epoch 37 with best_epoch = 17 and best_val_0_auc = 0.65
100%|██████████| 30/30 [22:58<00:00, 45.94s/trial, best loss: -0.6962740720665268]
[np.float64(1.0992675338862758), np.float64(1.000960344555067), np.float64(1.0269832911054881), 128, 128, 7, 42]



