## **Data Preprocessing and Feature Engineering**

This notebook continues from the Exploratory Data Analysis (EDA) phase of the Bank Marketing prediction project. Having gained an understanding of the dataset’s structure, data quality, and key relationships, the focus now shifts to preparing the data for machine learning.

The objectives are to:

- Encode categorical variables into numerical representations

- Scale or normalize numeric features for model compatibility

- Apply transformations to correct feature skewness where necessary

- Split the dataset into training and testing subsets for fair evaluation


### Import Packages 

In [5]:
# Data manipulation and analysis
import pandas as pd
import numpy as np

# File path handling
from pathlib import Path

# Machine learning - preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import (
    PowerTransformer, 
    OrdinalEncoder, 
    OneHotEncoder, 
    StandardScaler
)

# Model persistence
import joblib

### Load dataset 

In [7]:
# Move one level up (to the project root), then into data/raw
data_dir = Path().resolve().parent / "data" / "raw"

# Load dataset
df = pd.read_csv(data_dir / "train.csv")

print(f" Dataset loaded successfully from: {data_dir}")
df.head()


 Dataset loaded successfully from: C:\Users\hp\Documents\DA projects\Bank Marketing Prediction\data\raw


Unnamed: 0,id,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,0,42,technician,married,secondary,no,7,no,no,cellular,25,aug,117,3,-1,0,unknown,0
1,1,38,blue-collar,married,secondary,no,514,no,no,unknown,18,jun,185,1,-1,0,unknown,0
2,2,36,blue-collar,married,secondary,no,602,yes,no,unknown,14,may,111,2,-1,0,unknown,0
3,3,27,student,single,secondary,no,34,yes,no,unknown,28,may,10,2,-1,0,unknown,0
4,4,26,technician,married,secondary,no,889,yes,no,cellular,3,feb,902,1,-1,0,unknown,1


### Feature Engineering 

In [9]:
# Drop the 'id' column since it carries no predictive information
df = df.drop(columns=['id'])


#### Handling the `pdays` variable 

pdays represents the number of days since the client was last contacted, with -1 indicating no prior contact.
Leaving -1 as-is could mislead models, so we’ll decompose it into two separate variables:

Transformation Logic:

- was_contacted_before: binary flag → 1 if pdays != -1, else 0

- log_pdays: log-transformed version of pdays (applied only where pdays > 0)

    - Assign 0 to clients who were never contacted (pdays == -1)

Finally, we’ll drop the original pdays column.

In [11]:
# Step 3: Special Feature Treatment - pdays

# Create binary flag
df['was_contacted_before'] = np.where(df['pdays'] != -1, 1, 0)

# Safely compute log-transformed pdays
with np.errstate(divide='ignore'):
    df['log_pdays'] = np.where(
        df['pdays'] > 0,
        np.log(df['pdays'] + 1),
        0
    )

# Drop original column
df.drop(columns='pdays', inplace=True)

# Quick check
df[['was_contacted_before', 'log_pdays']].head()


Unnamed: 0,was_contacted_before,log_pdays
0,0,0.0
1,0,0.0
2,0,0.0
3,0,0.0
4,0,0.0


#### Train - Test Split 

To ensure unbiased model evaluation and prevent data leakage, the dataset is divided into training and testing subsets using stratified sampling based on the target variable (y).
This preserves the proportion of positive and negative classes in both sets.
The training set will be used to fit all preprocessing transformers (e.g., scaling, encoding, and Yeo–Johnson), which will then be applied consistently to the test set.

In [14]:
# Separate features and target
X = df.drop(columns=['y'])
y = df['y']

# Stratified split to maintain class balance
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,             # 80/20 split
    stratify=y,                # preserve target distribution
    random_state=42            # ensure reproducibility
)

# Check resulting shapes
print("Training set shape:", X_train.shape)
print("Test set shape:", X_test.shape)
print("Target distribution in training set:")
print(y_train.value_counts(normalize=True))
print("\nTarget distribution in test set:")
print(y_test.value_counts(normalize=True))

Training set shape: (600000, 17)
Test set shape: (150000, 17)
Target distribution in training set:
y
0    0.87935
1    0.12065
Name: proportion, dtype: float64

Target distribution in test set:
y
0    0.879347
1    0.120653
Name: proportion, dtype: float64


### Dual Preprocessing Pipelines

To effectively compare the performance of distinct model families, a tree-based model (XGBoost) and a gradient-descent model (Neural Network), two separate, tailored preprocessing workflows were implemented. Each pipeline was designed to meet the fundamental input requirements of its respective algorithm. Critically, both were fitted only on the training data to ensure sound methodological practice.

#### The XGBoost Pipeline (Optimized for Trees)

Tree-based models are robust to feature magnitude and prefer rank-based information. This pipeline focused on creating density and reducing skew without scaling:

- Numeric Features: The Yeo–Johnson transformation was applied to mitigate skewness, but standard scaling was skipped as it provides no performance benefit for tree ensembles.

- Categorical Features: Ordinal Encoding was utilized, converting categories into integers for computational efficiency, a format trees handle naturally.

In [18]:
# Create copies for XGBoost
X_train_xgb = X_train.copy()
X_test_xgb = X_test.copy()

In [19]:
numeric_cols_xgb = [
    'balance', 'duration', 'age', 'campaign', 
    'previous', 'log_pdays', 'day'
]
binary_cols_xgb = ['was_contacted_before']

categorical_cols_xgb = [
    col for col in X_train_xgb.columns 
    if col not in numeric_cols_xgb + binary_cols_xgb
]

print(f"\n Feature Types Identified:")
print(f"   Numeric: {len(numeric_cols_xgb)} features")
print(f"   Categorical: {len(categorical_cols_xgb)} features")
print(f"   Binary: {len(binary_cols_xgb)} features")


 Feature Types Identified:
   Numeric: 7 features
   Categorical: 9 features
   Binary: 1 features


In [20]:
print(f"\n Applying Power Transform to skewed features...")

pt_xgb = PowerTransformer(method='yeo-johnson', standardize=False)
num_skewed_xgb = ['balance', 'duration']

X_train_xgb[num_skewed_xgb] = pt_xgb.fit_transform(X_train_xgb[num_skewed_xgb])
X_test_xgb[num_skewed_xgb] = pt_xgb.transform(X_test_xgb[num_skewed_xgb])




 Applying Power Transform to skewed features...


In [21]:
# Save transformer

# Create path to project root
project_root = Path().resolve().parent

# Create /transformers directory inside project root 
transformers_dir = project_root / "transformers"
transformers_dir.mkdir(parents=True, exist_ok=True)

xgb_transformer_path = transformers_dir / "xgb_yeo_johnson_transformer.pkl"
joblib.dump(pt_xgb, xgb_transformer_path)
print(f" XGBoost transformer saved successfully at: {xgb_transformer_path}")


 XGBoost transformer saved successfully at: C:\Users\hp\Documents\DA projects\Bank Marketing Prediction\transformers\xgb_yeo_johnson_transformer.pkl


In [22]:
print(f"\n Applying Ordinal Encoding...")

ordinal_encoder_xgb = OrdinalEncoder(
    handle_unknown='use_encoded_value',
    unknown_value=-1
)

X_train_xgb[categorical_cols_xgb] = ordinal_encoder_xgb.fit_transform(
    X_train_xgb[categorical_cols_xgb]
)
X_test_xgb[categorical_cols_xgb] = ordinal_encoder_xgb.transform(
    X_test_xgb[categorical_cols_xgb]
)

# Create /encoders directory at project root 
encoders_dir = project_root / "encoders"
encoders_dir.mkdir(parents=True, exist_ok=True)

# --- Save encoder ---
xgb_encoder_path = encoders_dir / "xgb_ordinal_encoder.pkl"
joblib.dump(ordinal_encoder_xgb, xgb_encoder_path)

print(f" XGBoost Ordinal Encoder saved successfully at: {xgb_encoder_path}")


 Applying Ordinal Encoding...
 XGBoost Ordinal Encoder saved successfully at: C:\Users\hp\Documents\DA projects\Bank Marketing Prediction\encoders\xgb_ordinal_encoder.pkl


In [23]:
print(f"\n XGBoost Preprocessing Complete!")
print(f"   X_train_xgb: {X_train_xgb.shape}")
print(f"   X_test_xgb:  {X_test_xgb.shape}")
print(f"   Total features: {X_train_xgb.shape[1]}")

# Sanity check
assert X_train_xgb.shape[0] == y_train.shape[0], " Train shape mismatch!"
assert X_test_xgb.shape[0] == y_test.shape[0], " Test shape mismatch!"
assert X_train_xgb.isnull().sum().sum() == 0, " NaN values found in train!"
assert X_test_xgb.isnull().sum().sum() == 0, " NaN values found in test!"


 XGBoost Preprocessing Complete!
   X_train_xgb: (600000, 17)
   X_test_xgb:  (150000, 17)
   Total features: 17


In [24]:
# Check class distribution
class_dist = y_train.value_counts(normalize=True)
print(f"\n Class Distribution (Training):")
print(f"   Class 0 (No):  {class_dist[0]:.2%}")
print(f"   Class 1 (Yes): {class_dist[1]:.2%}")

# Calculate imbalance ratio for scale_pos_weight
imbalance_ratio = (y_train == 0).sum() / (y_train == 1).sum()
print(f"   Imbalance ratio: {imbalance_ratio:.2f}")


 Class Distribution (Training):
   Class 0 (No):  87.94%
   Class 1 (Yes): 12.06%
   Imbalance ratio: 7.29


In [25]:
# Create a 'data/processed' folder at the project root
processed_dir = project_root / "data" / "processed"
processed_dir.mkdir(parents=True, exist_ok=True)

# Save datasets to the project root's processed folder
joblib.dump(X_train_xgb, processed_dir / "X_train_xgb.pkl")
joblib.dump(X_test_xgb, processed_dir / "X_test_xgb.pkl")
joblib.dump(y_train, processed_dir / "y_train.pkl")
joblib.dump(y_test, processed_dir / "y_test.pkl")

print(f"Processed XGBoost datasets saved in: {processed_dir}")


Processed XGBoost datasets saved in: C:\Users\hp\Documents\DA projects\Bank Marketing Prediction\data\processed


### The Neural Network Pipeline (Optimized for Stability)


The Neural Network Pipeline (Optimized for Stability)

 Neural Networks require standardized inputs and bounded ranges for fast, stable training via backpropagation:

- Numeric Features: Features first received the Yeo–Johnson transformation for normalization, followed by Standardization (mean 0, variance 1). Finally, outlier clipping was applied to restrict the range (e.g., to $\pm 5$ standard deviations) to prevent exploding gradients.

- Categorical Features: One-Hot Encoding (OHE) was necessary to avoid introducing false numerical relationships between categories, which would confuse the network's distance calculations

In [27]:
## Create copies for Neural networks
X_train_nn = X_train.copy()
X_test_nn = X_test.copy()


In [28]:
# Identify Feature Types
numeric_cols_nn = [
    'balance', 'duration', 'age', 'campaign', 
    'previous', 'log_pdays', 'day'
]
binary_cols_nn = ['was_contacted_before']

categorical_cols_nn = [
    col for col in X_train_nn.columns 
    if col not in numeric_cols_nn + binary_cols_nn
]

print(f"\nFeature Types Identified:")
print(f"   Numeric: {len(numeric_cols_nn)} features")
print(f"   Categorical: {len(categorical_cols_nn)} features")
print(f"   Binary: {len(binary_cols_nn)} features")


Feature Types Identified:
   Numeric: 7 features
   Categorical: 9 features
   Binary: 1 features


In [29]:
# Power Transform Skewed Numeric Features

print(f"\n Applying Power Transform with standardization...")

pt_nn = PowerTransformer(method='yeo-johnson', standardize=True)
num_skewed_nn = ['balance', 'duration']

X_train_nn[num_skewed_nn] = pt_nn.fit_transform(X_train_nn[num_skewed_nn])
X_test_nn[num_skewed_nn] = pt_nn.transform(X_test_nn[num_skewed_nn])

# Save transformer
nn_transformer_path = transformers_dir / "nn_yeo_johnson_transformer.pkl"
joblib.dump(pt_nn, nn_transformer_path)
print(f" Neural Network transformer saved successfully at: {nn_transformer_path}")



 Applying Power Transform with standardization...
 Neural Network transformer saved successfully at: C:\Users\hp\Documents\DA projects\Bank Marketing Prediction\transformers\nn_yeo_johnson_transformer.pkl


In [30]:
print(f"\n Applying One-Hot Encoding...")

# Initialize OneHotEncoder
onehot_encoder_nn = OneHotEncoder(
    handle_unknown='ignore',
    sparse_output=False
)

# Fit and transform
X_train_cat_ohe = onehot_encoder_nn.fit_transform(X_train_nn[categorical_cols_nn])
X_test_cat_ohe = onehot_encoder_nn.transform(X_test_nn[categorical_cols_nn])

# Convert to DataFrames
ohe_feature_names = onehot_encoder_nn.get_feature_names_out(categorical_cols_nn)
X_train_cat_ohe = pd.DataFrame(
    X_train_cat_ohe, 
    columns=ohe_feature_names, 
    index=X_train_nn.index
)
X_test_cat_ohe = pd.DataFrame(
    X_test_cat_ohe, 
    columns=ohe_feature_names, 
    index=X_test_nn.index
)

# Drop original categorical columns and concatenate one-hot encoded
X_train_nn = pd.concat(
    [X_train_nn.drop(columns=categorical_cols_nn), X_train_cat_ohe], 
    axis=1
)
X_test_nn = pd.concat(
    [X_test_nn.drop(columns=categorical_cols_nn), X_test_cat_ohe], 
    axis=1
)

# Align columns (ensure test has same columns as train)
X_test_nn = X_test_nn.reindex(columns=X_train_nn.columns, fill_value=0)

# --- Create /encoders directory at project root (reuse if already exists) ---
encoders_dir = project_root / "encoders"
encoders_dir.mkdir(parents=True, exist_ok=True)

# --- Save One-Hot Encoder ---
nn_encoder_path = encoders_dir / "nn_onehot_encoder.pkl"
joblib.dump(onehot_encoder_nn, nn_encoder_path)

print(f" Neural Network One-Hot Encoder saved successfully at: {nn_encoder_path}")
print(f"   Created {len(ohe_feature_names)} one-hot encoded features")



 Applying One-Hot Encoding...
 Neural Network One-Hot Encoder saved successfully at: C:\Users\hp\Documents\DA projects\Bank Marketing Prediction\encoders\nn_onehot_encoder.pkl
   Created 44 one-hot encoded features


In [31]:
print(f"\n Applying Standard Scaling to numeric features...")

scaler_nn = StandardScaler()

# Scale all numeric columns (not binary)
X_train_nn[numeric_cols_nn] = scaler_nn.fit_transform(X_train_nn[numeric_cols_nn])
X_test_nn[numeric_cols_nn] = scaler_nn.transform(X_test_nn[numeric_cols_nn])

# Optional: Clip extreme values to prevent gradient explosion
CLIP_VALUE = 5
X_train_nn[numeric_cols_nn] = X_train_nn[numeric_cols_nn].clip(-CLIP_VALUE, CLIP_VALUE)
X_test_nn[numeric_cols_nn] = X_test_nn[numeric_cols_nn].clip(-CLIP_VALUE, CLIP_VALUE)

# --- Create /scalers directory at project root ---
scalers_dir = project_root / "scalers"
scalers_dir.mkdir(parents=True, exist_ok=True)

# --- Save Scaler ---
scaler_path = scalers_dir / "nn_standard_scaler.pkl"
joblib.dump(scaler_nn, scaler_path)

print(f" Neural Network Standard Scaler saved successfully at: {scaler_path}")
print(f"   Clipped values to range: [{-CLIP_VALUE}, {CLIP_VALUE}]")



 Applying Standard Scaling to numeric features...
 Neural Network Standard Scaler saved successfully at: C:\Users\hp\Documents\DA projects\Bank Marketing Prediction\scalers\nn_standard_scaler.pkl
   Clipped values to range: [-5, 5]


In [32]:
#Final Neural Network Dataset Check

print(f"\n Neural Network Preprocessing Complete!")
print(f"   X_train_nn: {X_train_nn.shape}")
print(f"   X_test_nn:  {X_test_nn.shape}")
print(f"   Total features: {X_train_nn.shape[1]}")

# Sanity checks
assert X_train_nn.shape[0] == y_train.shape[0], "Train shape mismatch!"
assert X_test_nn.shape[0] == y_test.shape[0], " Test shape mismatch!"
assert X_train_nn.isnull().sum().sum() == 0, " NaN values found in train!"
assert X_test_nn.isnull().sum().sum() == 0, " NaN values found in test!"


 Neural Network Preprocessing Complete!
   X_train_nn: (600000, 52)
   X_test_nn:  (150000, 52)
   Total features: 52


In [33]:
# Save Neural Network datasets to project root's processed folder 
joblib.dump(X_train_nn, processed_dir / "X_train_nn.pkl")
joblib.dump(X_test_nn, processed_dir / "X_test_nn.pkl")
joblib.dump(y_train, processed_dir / "y_train.pkl")  # already saved for XGBoost, can overwrite safely
joblib.dump(y_test, processed_dir / "y_test.pkl")    # already saved for XGBoost, can overwrite safely

print(f" Processed Neural Network datasets saved successfully in: {processed_dir}")


 Processed Neural Network datasets saved successfully in: C:\Users\hp\Documents\DA projects\Bank Marketing Prediction\data\processed


With our data preprocessing pipeline complete, we have successfully prepared two optimized datasets tailored for different modeling approaches. The XGBoost dataset utilizes ordinal encoding and selective power transformation to preserve tree-based interpretability, while the Neural Network dataset employs one-hot encoding, standardization, and clipping to ensure stable gradient flow.