# Tree-Based Classification
    Non-Linear Credit Risk Modeling in Finance & Banking
    
    
## Objective

This notebook introduces tree-based classification models using a synthetic credit default dataset, focusing on:

- Decision Trees as rule-based risk engines

- Random Forests for variance reduction

- Gradient Boosting for complex risk patterns

- Feature importance and model governance

- Comparison with logistic regression

It answers:

    When linear risk models fail, how do banks safely use trees?

## Business Context – Why Trees in Credit Risk?

Linear models assume additive, monotonic effects.

However, real credit risk exhibits:

- Threshold effects (e.g., utilization > 80%)

- Feature interactions (income × tenure)

- Non-linear penalties (defaults spike after tipping points)

Tree models naturally capture these patterns.

## Imports and Dataset

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns


df = pd.read_csv("D:/GitHub/Data-Science-Techniques/datasets/Supervised-classification/synthetic_credit_default_classification.csv")

df.head()


Unnamed: 0,customer_id,age,annual_income,credit_utilization,debt_to_income,loan_amount,loan_term_months,num_past_defaults,employment_years,credit_score,default
0,1,59,23283.682822,0.187813,0.245248,20232.165654,24,0,4.575844,689.627408,1
1,2,49,61262.608063,0.291774,0.396763,26484.067591,36,0,3.317515,697.770541,1
2,3,35,60221.74316,0.230557,0.122859,27142.522594,24,1,11.871955,713.721429,0
3,4,63,93603.112731,0.157906,0.635484,1000.0,12,0,2.256651,655.306417,1
4,5,28,71674.557271,0.167549,0.422446,15254.246561,48,0,6.97127,644.247643,0


## Dataset Description (Synthetic)

| Feature | Description |
|------|------------|
| customer_id | Unique identifier |
| age | Applicant age |
| annual_income | Declared income |
| credit_utilization | Credit used / credit limit |
| debt_to_income | Debt burden ratio |
| loan_amount | Requested loan value |
| loan_term_months | Loan duration |
| num_past_defaults | Historical defaults |
| employment_years | Job stability |
| credit_score | Synthetic credit score |
| default | Target (1 = default) |


#  Target and Feature Definition


In [2]:
target = "default"

X = df.drop(columns=[target, "customer_id"])
y = df[target]


# Train/Test Split (Stratified)

In [3]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    stratify=y,
    random_state=42
)

# Why Trees Need Less Preprocessing

Tree models:

- `[con] - ` Do not require scaling

- `[pro] - ` Handle skewed distributions

- `[pro] - ` Handle monotonic relationships

- `[con] - ` Still sensitive to leakage

We only impute missing values.

In [4]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy="median")

X_train_imp = imputer.fit_transform(X_train)
X_test_imp = imputer.transform(X_test)


# Decision Tree Classifier
Why Start with a Single Tree?

- Maximum interpretability

- Rule-based reasoning

- Baseline non-linearity

In [5]:
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier(
    max_depth=4,
    min_samples_leaf=50,
    class_weight="balanced",
    random_state=42
)

dt.fit(X_train_imp, y_train)


## Decision Tree Evaluation

In [6]:
from sklearn.metrics import roc_auc_score, classification_report

dt_prob = dt.predict_proba(X_test_imp)[:, 1]
dt_pred = dt.predict(X_test_imp)

print(classification_report(y_test, dt_pred))
roc_auc_score(y_test, dt_prob)


              precision    recall  f1-score   support

           0       0.83      0.72      0.77       653
           1       0.58      0.73      0.65       347

    accuracy                           0.72      1000
   macro avg       0.71      0.72      0.71      1000
weighted avg       0.74      0.72      0.73      1000



np.float64(0.8054004792776412)

### Interpretation

- `[pro] - ` Captures thresholds
- `[pro] - ` Easy to explain
- `[con] - ` High variance
- `[con] - ` Limited performance

Single trees are rarely deployed alone.

## Random Forest (Industry Workhorse)
Why Random Forests?

- Ensemble of decorrelated trees

- Strong out-of-sample stability

- Robust to noise and outliers- 

In [7]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(
    n_estimators=300,
    max_depth=8,
    min_samples_leaf=30,
    class_weight="balanced",
    random_state=42,
    n_jobs=-1
)

rf.fit(X_train_imp, y_train)


### Random Forest Evaluation

In [8]:
rf_prob = rf.predict_proba(X_test_imp)[:, 1]
rf_pred = rf.predict(X_test_imp)

print(classification_report(y_test, rf_pred))
roc_auc_score(y_test, rf_prob)


              precision    recall  f1-score   support

           0       0.88      0.82      0.85       653
           1       0.70      0.79      0.75       347

    accuracy                           0.81      1000
   macro avg       0.79      0.81      0.80      1000
weighted avg       0.82      0.81      0.81      1000



np.float64(0.8921757704410149)

### Why Random Forests Work Well in Credit

- `[pro] -` Non-linear
- `[pro] -` Stable
- `[pro] -` Resistant to overfitting
- `[con] -` Less interpretable than logistic regression

Often used as challenger models.

### Feature Importance (Risk Drivers)

In [9]:
importance_df = pd.DataFrame({
    "feature": X.columns,
    "importance": rf.feature_importances_
}).sort_values(by="importance", ascending=False)

importance_df


Unnamed: 0,feature,importance
2,credit_utilization,0.364809
8,credit_score,0.223287
3,debt_to_income,0.206237
1,annual_income,0.102658
7,employment_years,0.04369
4,loan_amount,0.028472
6,num_past_defaults,0.019573
0,age,0.008301
5,loan_term_months,0.002973


### Interpretation Warning 

Tree importance:

- Biased toward high-cardinality features

- Not causal

- Must be audited carefully

Use SHAP in production.

## Gradient Boosting (Advanced Risk Modeling)
Why Boosting?

- Sequential error correction

- Captures subtle interactions

- High predictive power

### Gradient Boosting

In [10]:
from sklearn.ensemble import GradientBoostingClassifier

gb = GradientBoostingClassifier(
    n_estimators=200,
    learning_rate=0.05,
    max_depth=3,
    random_state=42
)

gb.fit(X_train_imp, y_train)


### Gradient Boosting Evaluation

In [11]:
gb_prob = gb.predict_proba(X_test_imp)[:, 1]
roc_auc_score(y_test, gb_prob)


np.float64(0.902061423445768)

## Model Comparison (ROC-AUC)

| Model               | ROC-AUC  |
| ------------------- | -------- |
| Logistic Regression | Baseline |
| Decision Tree       | Medium   |
| Random Forest       | High     |
| Gradient Boosting   | Highest  |


## Pipelines (Leakage-Safe)

In [12]:
from sklearn.pipeline import Pipeline

rf_pipeline = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("model", RandomForestClassifier(
        n_estimators=300,
        max_depth=8,
        min_samples_leaf=30,
        class_weight="balanced",
        random_state=42
    ))
])

rf_pipeline.fit(X_train, y_train)


## When NOT to Use Tree Models

- `[cons] -` Strong regulatory constraints
- `[cons] -` Need for coefficient-level explanations
- `[cons] -` Very small datasets
- `[cons] -` Policy transparency required



## Key Takeaways

- Trees capture non-linear credit risk

- Random Forests provide stability

- Boosting maximizes performance

- Interpretability must be managed

- Logistic regression remains the baseline




## Next Notebook
04_Supervised_Learning/

└── [03_ensemble_classification](03_ensemble_classification.ipynb)


<br><br><br><br><br>



# Complete: [Data Science Techniques](https://github.com/lei-soares/Data-Science-Techniques)

- [00_Data_Generation_and_Simulation](https://github.com/lei-soares/Data-Science-Techniques/tree/main/00_Data_Generation_and_Simulation)


- [01_Exploratory_Data_Analysis_(EDA)](https://github.com/lei-soares/Data-Science-Techniques/tree/main/01_Exploratory_Data_Analysis_(EDA))


- [02_Data_Preprocessing](https://github.com/lei-soares/Data-Science-Techniques/tree/main/02_Data_Preprocessing)


- [03_Feature_Engineering](https://github.com/lei-soares/Data-Science-Techniques/tree/main/03_Feature_Engineering)


- [04_Supervised_Learning](https://github.com/lei-soares/Data-Science-Techniques/tree/main/04_Supervised_Learning)

    - [Regression Models](https://github.com/lei-soares/Data-Science-Techniques/tree/49de369e0600a513b54445e8cb4196b26ce71853/04_Supervised_Learning/01_regression_models)
    
    - [Classification Models](https://github.com/lei-soares/Data-Science-Techniques/tree/49de369e0600a513b54445e8cb4196b26ce71853/04_Supervised_Learning/02_classification_models)


- [05_Unsupervised_Learning](https://github.com/lei-soares/Data-Science-Techniques/tree/main/05_Unsupervised_Learning)


- [06_Model_Evaluation_and_Validation](https://github.com/lei-soares/Data-Science-Techniques/tree/main/06_Model_Evaluation_and_Validation)


- [07_Model_Tuning_and_Optimization](https://github.com/lei-soares/Data-Science-Techniques/tree/main/07_Model_Tuning_and_Optimization)


- [08_Interpretability_and_Explainability](https://github.com/lei-soares/Data-Science-Techniques/tree/main/08_Interpretability_and_Explainability)


- [09_Pipelines_and_Workflows](https://github.com/lei-soares/Data-Science-Techniques/tree/main/09_Pipelines_and_Workflows)


- [10_Natural_Language_Processing_(NLP)](https://github.com/lei-soares/Data-Science-Techniques/tree/main/10_Natural_Language_Processing_(NLP))


- [11_Time_Series](https://github.com/lei-soares/Data-Science-Techniques/tree/main/11_Time_Series)


- [12_Anomaly_and_Fraud_Detection](https://github.com/lei-soares/Data-Science-Techniques/tree/main/12_Anomaly_and_Fraud_Detection)


- [13_Imbalanced_Learning](https://github.com/lei-soares/Data-Science-Techniques/tree/main/13_Imbalanced_Learning)


- [14_Deployment_and_Production_Concepts](https://github.com/lei-soares/Data-Science-Techniques/tree/main/14_Deployment_and_Production_Concepts)


- [15_Business_and_Experimental_Design](https://github.com/lei-soares/Data-Science-Techniques/tree/main/15_Business_and_Experimental_Design)




<br><br><br><br><br>

[Panfugo Dados](www.pantufodados.com)


[Pantufo Dados - YouTube Channel](https://www.youtube.com/@pantufodados)