# 3. Bank - Feature Engineering

### 3.1 Introduction

So far, we explored the dataset through EDA and trained baseline models using the raw features. While this gave us a first idea of performance, it is unlikely that the original variables in their current form are optimal for modeling. Feature engineering is the step where we refine and transform the data to make it more informative for our algorithms. This can include scaling numerical features, creating interactions, handling categorical variables, or extracting new features that capture hidden patterns through different encoding techniques. The goal is to give the models a richer and cleaner representation of the data, which often leads to better predictive power.

These transformations are particularly beneficial for gradient boosting algorithms such as LightGBM and XGBoost. Unlike simpler models like logistic regression, which rely heavily on linear relationships, gradient boosting can capture complex interactions between features. Well-engineered features make these interactions clearer and easier to exploit, often leading to significant gains in performance. Compared to bagging methods or neural networks, boosting algorithms also tend to be more optimized to handle large datasets, more sample-efficient and less sensitive to heavy preprocessing than bagging ensembles or neural networks, which makes them a strong choice when working with structured data.

(This has been inspired by *Chris Deotte's* [Notebook](https://www.kaggle.com/code/cdeotte/train-more-xgb-nn-lb-0-9774))

In [1]:
import os, warnings, joblib
import numpy as np
import pandas as pd
from glob import glob
from itertools import combinations
from sklearn import set_config
import shap
from sklearn.base import clone

warnings.filterwarnings("ignore")
set_config(transform_output="pandas") 

from bank_functions import *
from bank_feat_engineer import *
from sklearn.model_selection import StratifiedKFold, cross_val_predict

In [2]:
model = "lgb"

result_folder = "Results"
model_folder = os.path.join(result_folder, model)
final_folder = os.path.join(model_folder, "final_models")
shap_folder = os.path.join(result_folder, "Shap")

for f in [result_folder, model_folder, final_folder, shap_folder]:
    os.makedirs(f, exist_ok=True)

df_train = pd.read_csv("Data/train.csv")
df_test  = pd.read_csv("Data/test.csv")
df_orig = pd.read_csv("Data/bank-full.csv", delimiter=";")

target = "y"
X_cat = ["job","month","poutcome","education","contact","marital","loan","housing","default"]
X_num = ["balance","duration","pdays","age","campaign","previous","day"]

df_test[target] = np.random.randint(0, 2, len(df_test))
df_orig[target] = df_orig[target].map({"yes": 1, "no": 0})
df_orig["id"] = (np.arange(len(df_orig)) + 1e6).astype("int")
df_orig = df_orig.set_index("id")

y = df_train[target]
N_FOLDS = 5

### 3.2 Feature Engineering Strategies

There are several techniques we are going to use to enrich the feature space:
- *Create categorical twins of the numeric variables*: by binning continuous variables into groups, we allow tree-based models to more easily capture non-linear thresholds and category-like behavior.

- *Create pairs of each categorical variable*: combining categories two by two can reveal interaction effects that are not visible when looking at single features in isolation.

- *(Global) Count Encoding*: replacing categories with their frequency across the dataset provides an ordinal signal that helps models distinguish between common and rare values.

- *(OOF) Feature encoding*: using out-of-fold encodings ensures that target-based transformations are done in a way that avoids leakage, giving the model additional predictive power without compromising validity.

Together, these techniques aim to create richer and more expressive features that gradient boosting algorithms can leverage effectively, while keeping training times manageable.

#### Create categorical twins for each numeric column

In [3]:
# label-encode all categoricals into their original columns
# create a "c2" categorical twin for each numeric column
combine = pd.concat([df_train, df_test, df_orig], axis=0)
X_num_as_cat = []
cat_card = {}
for c in X_num + X_cat:
    n = c
    if c in X_num:
        n = f"{c}2"           # twin treated as categorical
        X_num_as_cat.append(n)
    # factorize into integer codes (NaN -> -1)
    codes, uniques = pd.factorize(combine[c], sort=False)
    combine[n] = codes.astype("int32")
    cat_card[n] = int(combine[n].max()) + 1

    # for consistency with the original notebook
    # - numeric columns stay numeric int32
    # - categorical columns have just been overwritten by their integer codes
    combine[c] = combine[c].astype("int32")

print("New CATS:", X_num_as_cat)
print("Cardinality of all CATS:", cat_card)

New CATS: ['balance2', 'duration2', 'pdays2', 'age2', 'campaign2', 'previous2', 'day2']
Cardinality of all CATS: {'balance2': 8590, 'duration2': 1824, 'pdays2': 628, 'age2': 78, 'campaign2': 52, 'previous2': 54, 'day2': 31, 'job': 12, 'month': 12, 'poutcome': 4, 'education': 4, 'contact': 3, 'marital': 3, 'loan': 2, 'housing': 2, 'default': 2}


#### Create Pairwise Categorical Combinations

In [4]:
pairs = combinations(X_cat + X_num_as_cat, 2)
new_cols = {}
X_pairs = []

for c1, c2 in pairs:
    name = "_".join(sorted((c1, c2)))
    new_cols[name] = (combine[c1].astype(np.int64) * int(cat_card[c2])) + combine[c2].astype(np.int64)
    X_pairs.append(name)

if new_cols:
    new_df = pd.DataFrame(new_cols, index=combine.index)
    combine = pd.concat([combine, new_df], axis=1)

print(f"Created {len(X_pairs)} new CAT columns")

Created 120 new CAT columns


#### Count Encoding for all categorical columns

In [5]:
CE = []
CC = X_cat + X_num_as_cat + X_pairs
combine["i"] = np.arange(len(combine))  # for restoring order later

print(f"Processing {len(CC)} columns... ", end="")
for i, c in enumerate(CC):
    if i % 10 == 0:
        print(f"{i}, ", end="")
    tmp = combine.groupby(c)["y"].count().astype("int32").rename(f"CE_{c}")
    CE.append(f"CE_{c}")
    # merge by key on left column and right index (Series)
    combine = combine.merge(tmp, left_on=c, right_index=True, how="left")
combine = combine.sort_values("i")
print()

Processing 136 columns... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 


In [6]:
# Split combine back into train/test/orig (preserve original sizes)
N_train, N_test, N_orig = len(df_train), len(df_test), len(df_orig)
df_train = combine.iloc[:N_train].copy()
df_test  = combine.iloc[N_train:N_train + N_test].copy()
df_orig  = combine.iloc[N_train + N_test:].copy()
del combine
print("Train shape", df_train.shape, "Test shape", df_test.shape, "Original shape", df_orig.shape)

Train shape (750000, 282) Test shape (250000, 282) Original shape (45211, 282)


### 3.3 Data and Feature Augmentation

On top of that, we will also leverage the original dataset **orig** as an additional source of information, applying two augmentation strategies separately in order to avoid leakage:
- *Data Augmentation*: we expand the training set by concatenating the rows from **orig**, increasing sample size and diversity.
- *Feature Augmentation*: we enrich the dataset with new features computed through target encoding on **orig**, providing extra predictive signals.
To measure the impact of these approaches, we will keep the base dataset without augmentation as a benchmark.

#### Target Encoding with Original Data as Columns

In [7]:
# self-contained KFold target encoder (smooth=0 equivalent)

TE = []
print(f"Processing {len(CC)} columns... ", end="")
for i, c in enumerate(CC):
    if i % 10 == 0:
        print(f"{i}, ", end="")
    tmp = df_orig.groupby(c)["y"].mean().astype("float32").rename(f"TE_ORIG_{c}")
    NAME = f"TE_ORIG_{c}"
    TE.append(NAME)
    # merge series by index
    df_train = df_train.merge(tmp, left_on=c, right_index=True, how="left")
    df_test  = df_test.merge(tmp, left_on=c, right_index=True, how="left")
# sorting back to original order
df_train = df_train.sort_values("i")
df_test  = df_test.sort_values("i")
ORIG_TE_COLS = [c for c in df_train.columns if c.startswith("TE_ORIG_")]
print()

Processing 136 columns... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 


In [16]:
# Save the features
CAT_CANDIDATES = set(X_cat) | set(X_num_as_cat) | set(X_pairs)
features = {
    "X_cat": X_cat,
    "X_num": X_num,
    "X_num_as_cat": X_num_as_cat,
    "X_pairs": X_pairs,
    "CE": CE,
    "ORIG_TE_COLS": ORIG_TE_COLS,
    "CAT_CANDIDATES": CAT_CANDIDATES,
}
joblib.dump(features, os.path.join(model_folder, "features.joblib"), compress=3)

['Results\\lgb\\features.joblib']

### 3.4 Run Models

For each feature setup, we will train a LightGBM model using *N_Folds* cross-validation and produce two types of outputs:
- **OOF predictions**, used to evaluate the performance of each setup on the training data.
- **Test set predictions**, averaged across folds and submitted to Kaggle, allowing us to compare the different models directly on the competition leaderboard

In [8]:
PASSES = [
    ("vanilla_", False, None, 0),     # Pass 1: vanilla runs for benchmark
    ("origrow_", False, df_orig, 1),  # Pass 2: Data augmentation
    ("origcol_", True,  None, 0),     # Pass 3: Feature augmentation (TE_ORIG_*)
]

# ===============================
# Define and run the feature sets
# ===============================
BASE_SETUPS = [
    ("base",                      X_cat + X_num),
    ("with_num_as_cat",           X_cat + X_num + X_num_as_cat),
    ("with_pairs",                X_cat + X_num + X_pairs),
    ("with_ce",                   X_cat + X_num + CE),
    ("with_num_as_cat_and_pairs", X_cat + X_num + X_num_as_cat + X_pairs),
    ("with_num_as_cat_and_ce",    X_cat + X_num + X_num_as_cat + CE),
    ("with_pairs_and_ce",         X_cat + X_num + X_pairs + CE),
    ("with_all",                  X_cat + X_num + X_num_as_cat + X_pairs + CE),
]

# Expand each into the (no_te, te) variants
SETUPS = [
    (name, features, False)
    for name, features in BASE_SETUPS
] + [
    (name + "_te", features, True)
    for name, features in BASE_SETUPS
]

In [9]:
for prefix, add_orig_cols, train_more_orig, append_orig_times in PASSES:
    for set_name, feature_list, use_te in SETUPS:
        base_feats = feature_list
        if add_orig_cols:
            base_feats = list(dict.fromkeys(feature_list + ORIG_TE_COLS))
        run_feature_set(
            train_df=df_train,
            test_df=df_test,
            y=y,
            base_features=base_feats,
            cat_candidates=CAT_CANDIDATES,
            set_name=f"{prefix}{set_name}",
            use_target_encoding=use_te,
            save_folder=os.path.join(result_folder,model),
            model=model,
            te_inner_splits=5,
            seed=SEED,
            n_folds=N_FOLDS,
            train_more_orig=train_more_orig,
            append_orig_times=append_orig_times,
        )

[vanilla_base] Models, OOF and predictions all exist. Skipping entirely.
[vanilla_with_num_as_cat] Models, OOF and predictions all exist. Skipping entirely.
[vanilla_with_pairs] Models, OOF and predictions all exist. Skipping entirely.
[vanilla_with_ce] Models, OOF and predictions all exist. Skipping entirely.
[vanilla_with_num_as_cat_and_pairs] Models, OOF and predictions all exist. Skipping entirely.
[vanilla_with_num_as_cat_and_ce] Models, OOF and predictions all exist. Skipping entirely.
[vanilla_with_pairs_and_ce] Models, OOF and predictions all exist. Skipping entirely.
[vanilla_with_all] Models, OOF and predictions all exist. Skipping entirely.
[vanilla_base_te] Models, OOF and predictions all exist. Skipping entirely.
[vanilla_with_num_as_cat_te] Models, OOF and predictions all exist. Skipping entirely.
[vanilla_with_pairs_te] Models, OOF and predictions all exist. Skipping entirely.
[vanilla_with_ce_te] Models, OOF and predictions all exist. Skipping entirely.
[vanilla_with_nu

The next step is to run the two full models: one with row augmentation and the other with column augmentation, using the complete set of engineered features. This will allow us to directly compare how each augmentation strategy performs when the feature space is maximized.

For both models, we are also going to compute the SHAP values in order to analyze feature importance. This will help us understand which engineered features and augmentation strategies contribute the most to the predictions, and whether the gains in performance come from specific transformations or from the combination of all features together.

In [10]:
row_aug = "origrow_with_all_te"
col_aug = "origcol_with_all_te"

# ---------------------------
# 1) origrow_with_all_te
#    - augment rows with df_orig
#    - add TE on the augmented matrix
# ---------------------------
with_all = X_cat + X_num + X_num_as_cat + X_pairs + CE
X_tr_base = df_train[with_all].reset_index(drop=True)
y_tr_base = y.reset_index(drop=True)

X_aug = pd.concat([X_tr_base, df_orig[with_all].reset_index(drop=True)], axis=0, ignore_index=True)
y_aug = pd.concat([y_tr_base, df_orig["y"].reset_index(drop=True)], axis=0, ignore_index=True)

X_aug_te, used_feats_row = build_full_te_matrix(
    df_X=X_aug,
    y_vec=y_aug,
    base_features=with_all,
    cat_candidates=CAT_CANDIDATES
)

clf_row, used_feats, best_iter = fit_with_holdout(X_aug_te, y_aug, used_features = used_feats_row, 
                                                  model="lgb", n_splits=N_FOLDS, seed=SEED)

# Matrix for SHAP explanations (training portion only for interpretability)
X_explain_row = X_aug_te.iloc[:len(df_train)]

joblib.dump(clf_row,os.path.join(final_folder, f"{row_aug}.joblib"),compress=3)

Training until validation scores don't improve for 250 rounds
Early stopping, best iteration is:
[1560]	valid_0's binary_logloss: 0.0140083


['Results\\lgb\\final_models\\origrow_with_all_te.joblib']

In [11]:
# ---------------------------
# 2) origcol_with_all_te
#    - append ORIG_TE_COLS, then add TE on train
# ---------------------------
with_all_and_origcol = list(dict.fromkeys(with_all + ORIG_TE_COLS))
X_col = df_train[with_all_and_origcol].reset_index(drop=True)
y_col = y.reset_index(drop=True)

X_col_te, used_feats_col = build_full_te_matrix(
    df_X=X_col,
    y_vec=y_col,
    base_features=with_all_and_origcol,
    cat_candidates=CAT_CANDIDATES
)

clf_col, used_feats, best_iter = fit_with_holdout(X_col_te, y_col, used_features=used_feats_col, 
                                                  model="lgb", n_splits=N_FOLDS, seed=SEED)

joblib.dump(clf_col,os.path.join(final_folder, f"{col_aug}.joblib"),compress=3)

Training until validation scores don't improve for 250 rounds
Early stopping, best iteration is:
[1277]	valid_0's binary_logloss: 0.0132933


['Results\\lgb\\final_models\\origcol_with_all_te.joblib']

In [20]:
# SHAP ---------------------------
pairs = [
    (row_aug, clf_row, X_explain_row),
    (col_aug, clf_col, X_col_te),
]
N_SAMPLE = 25000
for name, clf, X in pairs:
    path = os.path.join(shap_folder, f"shap_{name}.joblib")
    expl = shap.TreeExplainer(clf)
    sample = X.sample(n=min(len(X), N_SAMPLE), random_state=SEED)
    sv = expl(sample, check_additivity=False)
    joblib.dump(sv, path, compress=3)

## 3.5 Predictions

Our models are ready so we can generate both the CV OOF for the post-model analysis and the predictions for the Kaggle submissions. We will generate two setups :
- A *blend* which is basically the aggregated results of the passes with **every added features** with *Data (rows) Augments* and *Features (columns) Augments*
- A final *stacking model* which will take all the models that we computed both here and in the [Previous Notebook](2_Bank_ML.ipynb)

#### Cross Validation Out Of Folds

In [21]:
base_oof = pd.read_csv(os.path.join(result_folder, "cv_oof.csv"))
augments = ["row","col"]
df_fe = pd.DataFrame({
    f"{model}_{aug}_augments": pd.read_csv(f).squeeze("columns")
    for aug in augments
    for f in [os.path.join(result_folder, model, "oof", f"orig{aug}_with_all_te_oof.csv")]
})

# Write the OOF file for the blend model
blend_oof = df_fe.mean(axis = 1)
pd.DataFrame({"oof":blend_oof}).to_csv(os.path.join(final_folder,"fe_blend_oof.csv"), index=False)

In [35]:
df_base = pd.read_csv(os.path.join(result_folder, "cv_oof.csv")).iloc[:, :-1]
df_merged_oof = pd.concat([df_base,df_fe], axis=1)

# Train final meta-learner on all data
final_meta, all_features, best_iter = fit_with_holdout(df_merged_oof, y)

# Write the OOF file for the final stack model
cv = StratifiedKFold(n_splits=N_FOLDS, shuffle=True, random_state=SEED)
est_for_cv = clone(final_meta).set_params(n_estimators = best_iter)
final_oof = cross_val_predict(est_for_cv, df_merged_oof, y, cv=cv, n_jobs=8, method="predict_proba")[:, 1]

pd.DataFrame({"oof":final_oof}).to_csv(os.path.join(final_folder, "fe_stack_oof.csv"), index=False)
joblib.dump(final_meta,os.path.join(final_folder, "fe_stack.joblib"),compress=3)

Training until validation scores don't improve for 250 rounds
Early stopping, best iteration is:
[663]	valid_0's binary_logloss: 0.125121


['Results\\lgb\\final_models\\fe_stack.joblib']

#### Predictions

In [38]:
all_te_pred = pd.DataFrame({
    f"lgb_{aug}_augments": pd.read_csv(f)["y"]
    for aug in augments
    for f in [os.path.join(result_folder, model, "predictions", f"predictions_orig{aug}_with_all_te.csv")]
})
base_pred = pd.DataFrame({
    os.path.splitext(os.path.basename(f))[0][len("predictions_"):]: pd.read_csv(f).squeeze("columns").iloc[:,1]
    for f in sorted(glob(os.path.join(result_folder, "predictions", "*"))) 
})

base_pred = pd.concat([base_pred,all_te_pred], axis = 1)
base_pred = base_pred[df_merged_oof.columns]

final_pred = pd.DataFrame({"id":df_test["id"],
                           "y":final_meta.predict_proba(base_pred)[:,1]})
final_pred.to_csv(os.path.join(final_folder, "predictions_fe_stack.csv"), index=False)

df_blend = final_pred 
df_blend[target] = all_te_pred.mean(axis = 1)
df_blend.to_csv(os.path.join(final_folder, "predictions_fe_blend.csv"),index = False)

Now that our models are ready, we are going to analyze them in the [Final Analysis Notebook](4_Bank_Final_Analysis.ipynb)