## Feature Engineering Implementation

This section creates advanced features that transform raw banking data into powerful predictive indicators for credit risk assessment.

### Feature Categories Created:

**Financial Health Indicators:**
- `Debt_to_Income_Ratio`: Monthly credit card spending relative to monthly income
- `Savings_Rate`: Percentage of income saved after credit card payments  
- `Credit_Usage_Intensity`: Credit utilization intensity metric

**Behavioral Features:**
- `Digital_Engagement`: Combined online banking and credit card usage
- `Investment_Profile`: Combined securities and CD account ownership

**Stability Indicators:**
- `Career_Stage`: Professional experience relative to age
- `Family_Financial_Stress`: Family size relative to income level

**Categorical Groupings:**
- `Income_Bin`: Income segmentation (Low, Medium, High, Very High)
- `Age_Group`: Life stage segmentation (Young, Adult, Middle, Senior)  
- `CCAvg_Level`: Credit card spending behavior levels

In [0]:
import pandas as pd
import numpy as np

def create_real_time_features(df):
    """Create features suitable for real-time scoring"""
    df_enhanced = df.copy()
    
    # Financial ratios (real-time calculable)
    df_enhanced['Debt_to_Income_Ratio'] = df_enhanced['CCAvg'] / (df_enhanced['Income']/12 + 1e-6)
    df_enhanced['Savings_Rate'] = (df_enhanced['Income'] - df_enhanced['CCAvg'] * 12) / df_enhanced['Income']
    df_enhanced['Credit_Usage_Intensity'] = df_enhanced['CCAvg'] / (df_enhanced['Income']/12 + 1e-6)
    
    # Behavioral features
    df_enhanced['Digital_Engagement'] = df_enhanced['Online'] + df_enhanced['CreditCard']
    df_enhanced['Investment_Profile'] = df_enhanced['Securities Account'] + df_enhanced['CD Account']
    
    # Stability indicators
    df_enhanced['Career_Stage'] = df_enhanced['Experience'] / (df_enhanced['Age'] + 1e-6)
    df_enhanced['Family_Financial_Stress'] = df_enhanced['Family'] / (df_enhanced['Income']/1000 + 1e-6)
    
    # Binning for categorical encoding - using pd.cut (pandas) not np.cut
    df_enhanced['Income_Bin'] = pd.cut(df_enhanced['Income'], 
                                      bins=[0, 50, 100, 200, 500], 
                                      labels=['Low', 'Medium', 'High', 'Very High'])
    
    df_enhanced['Age_Group'] = pd.cut(df_enhanced['Age'],
                                     bins=[0, 30, 45, 60, 100],
                                     labels=['Young', 'Adult', 'Middle', 'Senior'])
    
    df_enhanced['CCAvg_Level'] = pd.cut(df_enhanced['CCAvg'],
                                       bins=[0, 1, 3, 6, 10],
                                       labels=['Low', 'Medium', 'High', 'Very High'])
    
    return df_enhanced

# Define df from the correct table name
df = spark.table("personal_catalog.default.bank_loan_modelling")

# Apply feature engineering
pandas_df = df.toPandas()
enhanced_pandas_df = create_real_time_features(pandas_df)
print("Enhanced features created successfully!")
print("Available features:", enhanced_pandas_df.columns.tolist())

## Model Preparation with scikit-learn Preprocessing

This section prepares the engineered features for machine learning by defining the preprocessing pipeline and splitting the data for model training.

### Feature Categorization:

**Categorical Features** (for One-Hot Encoding):
- Income_Bin, Age_Group, CCAvg_Level, Education

**Numerical Features** (for Standard Scaling):
- Age, Experience, Income, Family, CCAvg, Mortgage
- Engineered features: Debt_to_Income_Ratio, Savings_Rate, Credit_Usage_Intensity, etc.

**Target Variable:**
- Personal Loan (binary classification)

### Preprocessing Strategy:

The ColumnTransformer applies:
- **StandardScaler** to numerical features (mean=0, variance=1)
- **OneHotEncoder** to categorical features (drop-first to avoid multicollinearity)

In [0]:
# Import required libraries for model training
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline as SkPipeline
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Define features for the model
categorical_features = ['Income_Bin', 'Age_Group', 'CCAvg_Level', 'Education']
numerical_features = [
    'Age', 'Experience', 'Income', 'Family', 'CCAvg', 'Mortgage',
    'Debt_to_Income_Ratio', 'Savings_Rate', 'Credit_Usage_Intensity',
    'Digital_Engagement', 'Investment_Profile', 'Career_Stage', 
    'Family_Financial_Stress'
]

# Target variable
target = 'Personal Loan'

# Create preprocessing pipeline with One-Hot Encoding
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore'), categorical_features)
    ])

# Create the full pipeline with multiple model options
def create_sklearn_pipeline(model_type='random_forest'):
    if model_type == 'random_forest':
        model = RandomForestClassifier(
            n_estimators=100,
            max_depth=10,
            random_state=42,
            class_weight='balanced'
        )
    elif model_type == 'gradient_boosting':
        model = GradientBoostingClassifier(
            n_estimators=100,
            max_depth=6,
            random_state=42
        )
    else:  # logistic regression
        model = LogisticRegression(
            random_state=42,
            class_weight='balanced',
            max_iter=1000
        )
    
    return SkPipeline(steps=[
        ('preprocessor', preprocessor),
        ('classifier', model)
    ])

# Prepare data for training
X = enhanced_pandas_df[numerical_features + categorical_features]
y = enhanced_pandas_df[target]

# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")