##BankTermPredict: Predicting Client Subscription to Term Deposits from Campaign Data

Feature Engineering  
Source:  https://archive.ics.uci.edu/dataset/222/bank+marketing  
Greg Gibson Sept. 2025

### Feature Exclusion / Transformation

- 'duration' removed (leakage)
- Outliers capped per EDA: balance, campaign, previous, pdays
- pdays transformed to never_contacted + recency
- Ordinal encoding for education
- One-hot encoding for categorical features

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.utils.class_weight import compute_class_weight
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, recall_score, precision_score, f1_score
from imblearn.over_sampling import SMOTE

In [3]:
# Load data
df = pd.read_csv("../data/bank-full.csv", sep=";")

In [4]:
# Drop leakage column
if "duration" in df.columns:
    df = df.drop(columns=["duration"])

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 16 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   age        45211 non-null  int64 
 1   job        45211 non-null  object
 2   marital    45211 non-null  object
 3   education  45211 non-null  object
 4   default    45211 non-null  object
 5   balance    45211 non-null  int64 
 6   housing    45211 non-null  object
 7   loan       45211 non-null  object
 8   contact    45211 non-null  object
 9   day        45211 non-null  int64 
 10  month      45211 non-null  object
 11  campaign   45211 non-null  int64 
 12  pdays      45211 non-null  int64 
 13  previous   45211 non-null  int64 
 14  poutcome   45211 non-null  object
 15  y          45211 non-null  object
dtypes: int64(6), object(10)
memory usage: 5.5+ MB


In [6]:
# Split features and target
X = df.drop(columns=["y"])
y = df["y"].map({"yes": 1, "no": 0}).astype(bool)  # Binary target

Domain Features

In [7]:
# Never contacted flag
X["never_contacted"] = (X["pdays"] == -1).astype(bool)

Aggregations / Ratios / Interactions

In [8]:
# Age buckets (ordinal groupings from EDA)
X["age_group"] = pd.cut(
    X["age"],
    bins=[17, 25, 35, 50, 65, 100],
    labels=["18-25", "26-35", "36-50", "51-65", "66+"]
)

In [9]:
# Campaign frequency ratio
# Ratio of campaign calls to previous contacts (avoid div/0)
X["campaign_per_previous"] = X["campaign"] / (X["previous"] + 1)

In [10]:
# Interaction: balance × housing loan
# Hypothesis: higher balance + mortgage might influence product interest
X["balance_x_housing"] = X["balance"] * (X["housing"].map({"yes": 1, "no": 0}))

Temporal / Recency Features

In [11]:
# Recency of contact (from pdays)
# Lower pdays = more recent. Transform to "days_since_contact"
X["days_since_contact"] = X["pdays"].replace(-1, np.nan)

Binary / One Hot / Ordinal Encodings

In [12]:
# Column groups
yes_no_cols = ['default', 'housing', 'loan']
cat_cols = ['job', 'marital', 'contact', 'month', 'poutcome', 'age_group']
num_cols = ['age', 'balance', 'day', 'duration', 'campaign', 'pdays', 'previous', 'campaign_per_previous', 'days_since_contact', 'balance_x_housing']

In [13]:
# yes/no to 1/0
for col in yes_no_cols:
    X[col] = X[col].map({"yes": 1, "no": 0}).astype(bool)

In [14]:
# Ordinal encoding for education
order_map = {
    "unknown": 0,
    "primary": 1,
    "secondary": 2,
    "tertiary": 3
}

X['education_ord'] = X['education'].map(order_map)

In [15]:
# One-hot encode categorical columns in X
X = pd.get_dummies(X, columns=cat_cols, drop_first=True)

In [16]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 48 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   age                    45211 non-null  int64  
 1   education              45211 non-null  object 
 2   default                45211 non-null  bool   
 3   balance                45211 non-null  int64  
 4   housing                45211 non-null  bool   
 5   loan                   45211 non-null  bool   
 6   day                    45211 non-null  int64  
 7   campaign               45211 non-null  int64  
 8   pdays                  45211 non-null  int64  
 9   previous               45211 non-null  int64  
 10  never_contacted        45211 non-null  bool   
 11  campaign_per_previous  45211 non-null  float64
 12  balance_x_housing      45211 non-null  int64  
 13  days_since_contact     8257 non-null   float64
 14  education_ord          45211 non-null  int64  
 15  jo

In [17]:
X[['education_ord', 'month_nov', 'poutcome_other', 'job_blue-collar', 'age_group_51-65', 'contact_unknown']].tail()

Unnamed: 0,education_ord,month_nov,poutcome_other,job_blue-collar,age_group_51-65,contact_unknown
45206,3,True,False,False,True,False
45207,1,True,False,False,False,False
45208,2,True,False,False,False,False
45209,2,True,False,True,True,False
45210,2,True,True,False,False,False


In [18]:
# Drop converted columns
X = X.drop(columns=['education', 'pdays'])

Feature Documentation (What / Why / How)

In [19]:
feature_docs = {
    "campaign_per_previous": "Ratio: campaign / (previous+1). High values → aggressive contact strategy.",
    "days_since_contact": "Numeric recency. Derived from pdays (with -1 → NaN).  SimpleImputer used later.",
    "never_contacted": "Binary flag if customer was never contacted before.",
    "balance_x_housing": "Interaction term capturing combined effect of balance and housing loan.",
    "age_group": "Binned age categories based on domain knowledge."
}
for k,v in feature_docs.items():
    print(f"{k}: {v}")

campaign_per_previous: Ratio: campaign / (previous+1). High values → aggressive contact strategy.
days_since_contact: Numeric recency. Derived from pdays (with -1 → NaN).  SimpleImputer used later.
never_contacted: Binary flag if customer was never contacted before.
balance_x_housing: Interaction term capturing combined effect of balance and housing loan.
age_group: Binned age categories based on domain knowledge.


### Imbalance Handling

In [20]:
# Check imbalance
print("Class distribution:\n", y.value_counts(normalize=True))

Class distribution:
 y
False    0.883015
True     0.116985
Name: proportion, dtype: float64


In [21]:
# Train-test split with stratification
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

In [22]:
# Impute missing days_since_contact in train/test separately
imputer = SimpleImputer(strategy="median")

X_train_imputed = X_train.copy()
X_test_imputed = X_test.copy()

X_train_imputed["days_since_contact"] = imputer.fit_transform(X_train[["days_since_contact"]])
X_test_imputed["days_since_contact"] = imputer.transform(X_test[["days_since_contact"]])

Strategy Rationale (SMOTE vs. Class Weights)

In [23]:
# Option A: Class weights (logistic regression, tree models)
classes = np.unique(y)
weights = compute_class_weight(class_weight="balanced", classes=classes, y=y.values.ravel())
class_weight_dict = dict(zip(classes, weights))
print("\nComputed class weights:", class_weight_dict)


Computed class weights: {np.False_: np.float64(0.566241671258955), np.True_: np.float64(4.274059368500661)}


In [24]:
# Train Logistic Regression with class weights
clf_weighted = LogisticRegression(class_weight="balanced", random_state=42, max_iter=500)
clf_weighted.fit(X_train_imputed, y_train)

y_pred_weighted = clf_weighted.predict(X_test_imputed)

print("Classification Report:")
print(classification_report(y_test, y_pred_weighted))
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred_weighted))
print(f"\nRecall (Sensitivity): {recall_score(y_test, y_pred_weighted):.4f}")
print(f"Precision: {precision_score(y_test, y_pred_weighted):.4f}")
print(f"F1-Score: {f1_score(y_test, y_pred_weighted):.4f}")
print(f"Accuracy: {clf_weighted.score(X_test_imputed, y_test):.4f}")

Classification Report:
              precision    recall  f1-score   support

       False       0.94      0.69      0.79      7985
        True       0.22      0.68      0.33      1058

    accuracy                           0.68      9043
   macro avg       0.58      0.68      0.56      9043
weighted avg       0.86      0.68      0.74      9043


Confusion Matrix:
[[5470 2515]
 [ 340  718]]

Recall (Sensitivity): 0.6786
Precision: 0.2221
F1-Score: 0.3347
Accuracy: 0.6843


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=500).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [25]:
# Option B: Oversampling with SMOTE
smote = SMOTE(random_state=42)
X_res, y_res = smote.fit_resample(X.select_dtypes(include=[np.number]).fillna(0), y)
print("\nAfter SMOTE:", y_res.value_counts(normalize=True))


After SMOTE: y
False    0.5
True     0.5
Name: proportion, dtype: float64


In [26]:
# Train-test split SMOTE with stratification
X_train_res, X_test_res, y_train_res, y_test_res = train_test_split(
    X_res, y_res, test_size=0.2, random_state=42, stratify=y_res
)

In [27]:
# Impute missing days_since_contact in train/test separately
imputer_res = SimpleImputer(strategy="median")

X_train_res_imputed = X_train_res.copy()
X_test_res_imputed = X_test_res.copy()

X_train_res_imputed["days_since_contact"] = imputer_res.fit_transform(X_train_res[["days_since_contact"]])
X_test_res_imputed["days_since_contact"] = imputer_res.transform(X_test_res[["days_since_contact"]])

In [28]:
# Train classifier
clf_smote = LogisticRegression(random_state=42, max_iter=500)
clf_smote.fit(X_train_res, y_train_res)

y_pred_smote = clf_smote.predict(X_test_res_imputed)

print("Classification Report:")
print(classification_report(y_test_res, y_pred_smote))
print("\nConfusion Matrix:")
print(confusion_matrix(y_test_res, y_pred_smote))
print(f"\nRecall (Sensitivity): {recall_score(y_test_res, y_pred_smote):.4f}")
print(f"Precision: {precision_score(y_test_res, y_pred_smote):.4f}")
print(f"F1-Score: {f1_score(y_test_res, y_pred_smote):.4f}")
print(f"Accuracy: {clf_smote.score(X_test_res_imputed, y_test_res):.4f}")

Classification Report:
              precision    recall  f1-score   support

       False       0.69      0.82      0.75      7985
        True       0.78      0.63      0.70      7984

    accuracy                           0.73     15969
   macro avg       0.73      0.73      0.72     15969
weighted avg       0.73      0.73      0.72     15969


Confusion Matrix:
[[6578 1407]
 [2970 5014]]

Recall (Sensitivity): 0.6280
Precision: 0.7809
F1-Score: 0.6961
Accuracy: 0.7259


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=500).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Chosen Approach & Justification
Although SMOTE had higher accuracy overall, with focus on identifying as many willing customers as possible, the recall score is more important than precision.  Class weights does edge out SMOTE by 5% in this regard.

In [29]:
# Side-by-side comparison
comparison_results = pd.DataFrame({
    'Class Weights': [
        clf_weighted.score(X_test_imputed, y_test),
        recall_score(y_test, y_pred_weighted),
        precision_score(y_test, y_pred_weighted),
        f1_score(y_test, y_pred_weighted)
    ],
    'SMOTE': [
        clf_smote.score(X_test_res_imputed, y_test_res),
        recall_score(y_test_res, y_pred_smote),
        precision_score(y_test_res, y_pred_smote),
        f1_score(y_test_res, y_pred_smote)
    ]
}, index=['Accuracy', 'Recall', 'Precision', 'F1-Score'])

print("=== SIDE-BY-SIDE COMPARISON ===")
print(comparison_results.round(4))

# Highlight the better approach for each metric
print("\n=== WINNER BY METRIC ===")
for metric in comparison_results.index:
    if comparison_results.loc[metric, 'Class Weights'] > comparison_results.loc[metric, 'SMOTE']:
        winner = 'Class Weights'
        margin = comparison_results.loc[metric, 'Class Weights'] - comparison_results.loc[metric, 'SMOTE']
    else:
        winner = 'SMOTE'
        margin = comparison_results.loc[metric, 'SMOTE'] - comparison_results.loc[metric, 'Class Weights']
    
    print(f"{metric}: {winner} (margin: {margin:.4f})")

=== SIDE-BY-SIDE COMPARISON ===
           Class Weights   SMOTE
Accuracy          0.6843  0.7259
Recall            0.6786  0.6280
Precision         0.2221  0.7809
F1-Score          0.3347  0.6961

=== WINNER BY METRIC ===
Accuracy: SMOTE (margin: 0.0416)
Recall: Class Weights (margin: 0.0506)
Precision: SMOTE (margin: 0.5588)
F1-Score: SMOTE (margin: 0.3615)


Sanity Checks (no leakage, applied only to train)

In [30]:
# Ensure no target leakage, and engineered features are only from train set
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

assert "duration" not in X_train.columns, "Leakage variable still present!"
print("✅ Sanity checks passed: no leakage columns in train/test.")

✅ Sanity checks passed: no leakage columns in train/test.


A view of scaled data

In [34]:
# Separate age and the other numeric columns
age_scaler = StandardScaler()
robust_scaler = RobustScaler()

# Columns to scale
# Columns to scale
cols_to_standard = ["age"]
cols_to_robust = ["balance", "campaign", "previous", "campaign_per_previous", "days_since_contact", "balance_x_housing"]

# Copy only these columns
X_train_scaled_subset = X_train[cols_to_standard + cols_to_robust].copy()

# Scale separately
X_train_scaled_subset[cols_to_standard] = StandardScaler().fit_transform(X_train[cols_to_standard])
X_train_scaled_subset[cols_to_robust] = RobustScaler().fit_transform(X_train[cols_to_robust])

# Show preview
print("Scaled X_train header:\n")
print(X_train_scaled_subset.head())
print(X_train_scaled_subset.describe())

Scaled X_train header:

            age   balance  campaign  previous  campaign_per_previous  \
24001 -0.460434  0.302304       0.0       0.0                   0.00   
43409 -1.589641  2.709677       1.0       7.0                  -0.75   
20669  0.292371 -0.152627       1.0       0.0                   1.00   
18810  0.668773 -0.332535       4.5       0.0                   4.50   
23130 -0.272233 -0.143041       4.0       0.0                   4.00   

       days_since_contact  balance_x_housing  
24001                 NaN           0.000000  
43409           -0.051282           0.000000  
20669                 NaN           0.472868  
18810                 NaN           0.000000  
23130                 NaN           0.000000  
                age       balance      campaign      previous  \
count  3.616800e+04  36168.000000  36168.000000  36168.000000   
mean  -1.214099e-16      0.674281      0.381967      0.581730   
std    1.000014e+00      2.262521      1.552080      2.408766   
m