### <span style="color:darkred"> 1. Introduction

This notebook presents a solution to the [Playground Series - Season 5, Episode 8](kaggle.com/competitions/playground-series-s5e8/overview) Kaggle competition, held in August 2025. The goal is to predict the probability that a client subscribes to a bank term deposit. Submissions are evaluated using the ROC AUC between the predicted probabilities and the observed target.

The workflow begins by importing the necessary libraries for the project. We then load and inspect the first few rows of the training and testing datasets. Next, we prepare the features for modeling: the target is stored in a separate variable, and string features are converted to the categorical data type.

Two models are trained: XGBoost and LightGBM, both with tuned hyperparameters. We use stratified 5-fold cross-validation, which ensures each fold maintains the original target distribution. In each fold, both models are trained and used to generate out-of-fold predictions, as well as predictions on the test set, the latter averaged across folds.

The cross-validation ROC AUC score is then calculated by comparing the out-of-fold predictions with the true targets. Predictions from two public notebooks — both using LightGBM with advanced hyperparameter tuning — are loaded and blended with our own averaged predictions using weighted averaging.

Finally, we generate a CSV file formatted for submission to the competition.

### <span style="color:darkred"> 2. Import Libraries

In this step, we import the libraries used for data manipulation, model training, cross-validation, and evaluation. NumPy and Pandas handle data manipulation. LGBMClassifier from LightGBM and XGBClassifier from XGBoost are imported for model training. StratifiedKFold is imported from scikit-learn to perform stratified cross-validation, and roc_auc_score for evaluating model performance.

In [1]:
# ===== Import Libraries =====
import numpy as np
import pandas as pd
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score

### <span style="color:darkred"> 3. Load and Inspect Data

The following code reads the CSV files containing the training and testing datasets and loads them into pandas DataFrames. It also displays the first few rows of each to better understand their structure and contents.

In [2]:
# ===== Load and Inspect Data =====
X = pd.read_csv("/kaggle/input/playground-series-s5e8/train.csv")
X_test = pd.read_csv("/kaggle/input/playground-series-s5e8/test.csv")

for name, df in [('TRAINING DATA', X), ('TESTING DATA', X_test)]:
    print('-' * 50)
    print(name + ':')
    display(df.head())

--------------------------------------------------
TRAINING DATA:


Unnamed: 0,id,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,0,42,technician,married,secondary,no,7,no,no,cellular,25,aug,117,3,-1,0,unknown,0
1,1,38,blue-collar,married,secondary,no,514,no,no,unknown,18,jun,185,1,-1,0,unknown,0
2,2,36,blue-collar,married,secondary,no,602,yes,no,unknown,14,may,111,2,-1,0,unknown,0
3,3,27,student,single,secondary,no,34,yes,no,unknown,28,may,10,2,-1,0,unknown,0
4,4,26,technician,married,secondary,no,889,yes,no,cellular,3,feb,902,1,-1,0,unknown,1


--------------------------------------------------
TESTING DATA:


Unnamed: 0,id,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome
0,750000,32,blue-collar,married,secondary,no,1397,yes,no,unknown,21,may,224,1,-1,0,unknown
1,750001,44,management,married,tertiary,no,23,yes,no,cellular,3,apr,586,2,-1,0,unknown
2,750002,36,self-employed,married,primary,no,46,yes,yes,cellular,13,may,111,2,-1,0,unknown
3,750003,58,blue-collar,married,secondary,no,-1380,yes,yes,unknown,29,may,125,1,-1,0,unknown
4,750004,28,technician,single,secondary,no,1950,yes,no,cellular,22,jul,181,1,-1,0,unknown


### <span style="color:darkred"> 4. Prepare Features

In this section, the features are prepared for training. First, the id column is set as the index of both the training and testing datasets to uniquely identify rows. The target variable is then stored in a pandas Series called y and removed from the training dataset. Finally, columns with object data types are identified and converted to the categorical type, which enables efficient handling by models such as LightGBM and XGBoost.

In [3]:
# ===== Prepare Features =====
X = X.set_index('id')
X_test = X_test.set_index('id')

y = X['y']
X = X.drop(['y'], axis=1)

categorical_cols = [col for col in X.columns if X[col].dtype == 'object']
for col in categorical_cols:
    X[col] = X[col].astype("category")
    X_test[col] = X_test[col].astype("category")

### <span style="color:darkred"> 5. XGBoost Model

Now we define an XGBoost model with tuned hyperparameters. The objective is set to binary logistic because this is a binary classification problem and we want to predict probabilities. The evaluation metric is logloss, which works well for probabilistic binary classification tasks. The number of estimators is set to 10,000 to allow the model sufficient capacity for learning.

We use a low learning rate of 0.02 to ensure slower and more stable learning, which typically improves generalization. Each tree’s maximum depth is set to 7 to control model complexity and reduce the risk of overfitting. We use 70% of the features for each tree to introduce randomness and promote better generalization. Categorical feature support is enabled for training. Early stopping rounds are set to 250, so training stops if the logloss on the validation set does not improve for 250 rounds. Finally, a random seed is set to ensure reproducibility.

In [4]:
# ===== Define XGBoost Model =====
xgb = XGBClassifier(
    objective='binary:logistic',
    eval_metric='logloss',
    n_estimators=10000,
    learning_rate=0.02,
    max_depth=7,
    colsample_bytree=0.70,
    enable_categorical=True,
    early_stopping_rounds=250,
    random_state=42
)

### <span style="color:darkred"> 6. LightGBM Model

In this step, we define the hyperparameters for the LightGBM model. We set the number of estimators to 10,000 for better predictions. A learning rate of 0.075 is used for steady and reliable learning. Each tree's maximum depth is set to 9, which provides a balance between model complexity and generalization.

L1 and L2 regularization are applied: the lambda_l1 parameter is set to 0.85 and the lambda_l2 parameter is set to 3.50, helping to penalize large leaf output values and reduce overfitting. In each tree, we use 50% of the features and 80% of the data, introducing randomness for better generalization. Early stopping rounds are set to 250 to stop training once validation scores no longer improve. A random state is used for reproducibility, and verbosity is set to -1 to suppress output during training.

In [5]:
# ===== Define LightGBM Model =====
lgbm = LGBMClassifier(
    n_estimators=10000,
    learning_rate= 0.075,
    max_depth=9,
    reg_alpha=0.85,
    reg_lambda=3.50,
    colsample_bytree=0.50,
    subsample=0.80,
    early_stopping_rounds=250,
    random_state=42,
    verbosity=-1
)

### <span style="color:darkred"> 7. Stratified 5-Fold Cross-Validation

The following code uses stratified 5-fold cross-validation to train both of our models and generate predictions. First, two dictionaries are created: valid_preds, which will store out-of-fold predictions, and test_preds, which will store averaged predictions on the test set. Each dictionary has one key per model, with the values initialized as NumPy arrays of appropriate lengths filled with zeros.

Next, an instance of StratifiedKFold is created with 5 splits. Shuffling is enabled to introduce randomness, and a fixed random state is set to ensure reproducibility.

For each fold, the training data is split into training and validation sets using stratification to preserve the class distribution. Both the XGBoost and LightGBM models are then fitted on the training set, with the validation set used for early stopping. Verbose output for XGBoost is suppressed.

Finally, we loop through each fitted model to generate out-of-fold predictions, which will be used for model evaluation, and test set predictions, which are averaged across folds. The predictions correspond to the estimated probabilities of the positive class, representing the likelihood that each client subscribes to the bank term deposit.

In [6]:
# ===== Stratified 5-Fold Cross-Validation =====
valid_preds = {'XGBOOST': np.zeros(len(X)), 'LIGHTGBM': np.zeros(len(X))}
test_preds = {'XGBOOST': np.zeros(len(X_test)), 'LIGHTGBM': np.zeros(len(X_test))}

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

for (train_idx, valid_idx) in skf.split(X, y):
    X_train, X_valid = X.iloc[train_idx], X.iloc[valid_idx]
    y_train, y_valid = y.iloc[train_idx], y.iloc[valid_idx]

    xgb.fit(X_train, y_train, eval_set=[(X_valid, y_valid)], verbose=False)
    lgbm.fit(X_train, y_train, eval_set=[(X_valid, y_valid)])

    for name, model in [('XGBOOST', xgb), ('LIGHTGBM', lgbm)]:
        valid_preds[name][valid_idx] = model.predict_proba(X_valid)[:, 1]
        test_preds[name] += model.predict_proba(X_test)[:, 1] / skf.n_splits

### <span style="color:darkred"> 8. Model Evaluation

In this step, we calculate the ROC AUC score by comparing the true targets with the predictions made by each of the models, XGBoost and LightGBM.

In [7]:
# === Model Evaluation =====
for model in ['XGBOOST', 'LIGHTGBM']:
    print(f'{model} - ROC AUC SCORE: {roc_auc_score(y, valid_preds[model]):.6f}')

XGBOOST - ROC AUC SCORE: 0.968388
LIGHTGBM - ROC AUC SCORE: 0.969291


### <span style="color:darkred"> 9. Blending

In this section, we blend predictions from two public notebooks with those generated by our own models to improve final prediction accuracy and leaderboard performance. Specifically, the public notebooks are [Playground S5E8: Enhanced LightGBM](https://www.kaggle.com/code/molozhenko/playground-s5e8-enhanced-lightgbm) and [Bank Term Deposit: Single LightGBM](https://www.kaggle.com/code/bizen250/bank-term-deposit-single-lightgbm), both of which employ LightGBM models with advanced hyperparameter tuning and basic feature engineering. Their cross-validation ROC AUC scores are 0.973362 and 0.971080, respectively.

The code loads their test set predictions into NumPy arrays named sub1 and sub2. Before blending, we average the test predictions from our XGBoost and LightGBM models. We then apply weighted stacking to combine these predictions, assigning 15% weight to our averaged predictions, 50% to sub1, and 35% to sub2. These weights were chosen heuristically to balance contributions from each model.

Blending leverages the complementary strengths of diverse models to improve prediction robustness and overall performance. This approach typically reduces overfitting and captures different patterns in the data, leading to better generalization. All public notebook predictions were used respecting their licenses and attributions.

In [8]:
# ===== Blending =====
sub1 = np.load('/kaggle/input/playground-s5e8-enhanced-lightgbm/pred_conservative.npy')
sub2 = np.load('/kaggle/input/k/bizen250/bank-term-deposit-single-lightgbm/pred.npy')

mean_preds = (test_preds['XGBOOST'] + test_preds['LIGHTGBM']) / 2
final_preds = 0.15 * mean_preds + 0.5 * sub1 + 0.35 * sub2

### <span style="color:darkred"> 10. Create Submission File

The final step is creating a CSV file containing the test set predictions, ready for submission to the competition.

In [9]:
# ===== Create Submission File =====
output = pd.DataFrame({'id': X_test.index, 'y': final_preds})
output.to_csv('submission.csv', index=False)