# Linear and Probabilistic Models
     Credit Risk, Default Probability, and Decision Theory


## Objective

This notebook provides a rigorous introduction to **linear and probabilistic classification models** using a
**finance & banking risk context**, focusing on:

- Logistic Regression for credit default prediction
- Probabilistic interpretation of outputs
- Odds, log-odds, and risk scoring
- Regularization and model stability
- Decision thresholds and cost-sensitive policies

It answers:

    How do banks estimate default probability and translate it into lending decisions?


## Business Context – Credit Risk Modeling

Banks estimate the **Probability of Default (PD)** to:

- Approve or reject loans
- Price interest rates
- Allocate capital (Basel accords)
- Control portfolio risk

Logistic regression remains a **regulatory-standard baseline** due to:
- Interpretability
- Stability
- Auditability



## Imports and Dataset




In [8]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns


df = pd.read_csv("D:/GitHub/Data-Science-Techniques/datasets/Supervised-classification/synthetic_credit_default_classification.csv")

df.head()


Unnamed: 0,customer_id,age,annual_income,credit_utilization,debt_to_income,loan_amount,loan_term_months,num_past_defaults,employment_years,credit_score,default
0,1,59,23283.682822,0.187813,0.245248,20232.165654,24,0,4.575844,689.627408,1
1,2,49,61262.608063,0.291774,0.396763,26484.067591,36,0,3.317515,697.770541,1
2,3,35,60221.74316,0.230557,0.122859,27142.522594,24,1,11.871955,713.721429,0
3,4,63,93603.112731,0.157906,0.635484,1000.0,12,0,2.256651,655.306417,1
4,5,28,71674.557271,0.167549,0.422446,15254.246561,48,0,6.97127,644.247643,0


## Dataset Description (Synthetic)

| Feature | Description |
|------|------------|
| customer_id | Unique identifier |
| age | Applicant age |
| annual_income | Declared income |
| credit_utilization | Credit used / credit limit |
| debt_to_income | Debt burden ratio |
| loan_amount | Requested loan value |
| loan_term_months | Loan duration |
| num_past_defaults | Historical defaults |
| employment_years | Job stability |
| credit_score | Synthetic credit score |
| default | Target (1 = default) |


#  Target and Feature Definition


In [9]:
target = "default"

X = df.drop(columns=[target, "customer_id"])
y = df[target]


### Default Rate (Class Balance)


In [10]:
y.value_counts(normalize=True)


default
0    0.6526
1    0.3474
Name: proportion, dtype: float64

### Why Class Imbalance Matters

Defaults are **rare but costly** events.

- Accuracy is misleading
- Recall for default (target) class is critical
- Probabilistic ranking matters more than labels


# – Train/Test Split (Stratified)


In [12]:
from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(
    X,y, 
    test_size=0.2,
    random_state=2010,
    stratify=y
)

## Feature Scaling

Logistic regression assumes features are on comparable scales.


In [14]:
from sklearn.preprocessing import StandardScaler


scaler =  StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# MODELLING

##  Logistic Regression (Probability of Default Model)


In [15]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(
    max_iter=1000,
    #class_weight='balanced'
    class_weight={0: 0.6526, 1: 0.3474}
)

log_reg.fit(X_train, y_train)


pd_pred = log_reg.predict(X_test_scaled)
pd_prob = log_reg.predict_proba(X_test_scaled)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### – Evaluation (Risk-Focused Metrics)


In [19]:
from sklearn.metrics import classification_report, roc_auc_score

print(classification_report(y_test, pd_pred))
roc_auc_score(y_test, pd_pred)


              precision    recall  f1-score   support

           0       0.90      0.55      0.68       653
           1       0.51      0.88      0.65       347

    accuracy                           0.67      1000
   macro avg       0.70      0.72      0.66      1000
weighted avg       0.76      0.67      0.67      1000



np.float64(0.7151321102779898)


### Why ROC-AUC Is Preferred in Credit Risk

ROC-AUC measures **ranking quality**, not threshold decisions.

It answers:
> Can we rank risky clients above safe ones?



### Interpreting Coefficients

Coefficients represent changes in **log-odds of default**.


In [20]:
coef_df = pd.DataFrame({
    "feature": X.columns,
    "coefficient": log_reg.coef_[0]
}).sort_values(by="coefficient", ascending=False)

coef_df


Unnamed: 0,feature,coefficient
2,credit_utilization,7.597599
3,debt_to_income,5.774911
6,num_past_defaults,1.138679
0,age,0.007578
5,loan_term_months,0.003283
4,loan_amount,5.1e-05
1,annual_income,-3.1e-05
8,credit_score,-0.011792
7,employment_years,-0.114078


In [17]:
y_test

233     0
4377    0
2181    0
2734    0
1112    0
       ..
1672    1
470     0
4422    1
194     0
2938    1
Name: default, Length: 1000, dtype: int64

### Odds Ratios (Business Interpretation)

Odds ratio > 1 → increases default risk  
Odds ratio < 1 → protective effect


In [29]:
coef_df["odds_ratio"] = np.exp(coef_df["coefficient"])
coef_df


Unnamed: 0,feature,coefficient,odds_ratio
2,credit_utilization,7.597599,1993.40471
3,debt_to_income,5.774911,322.115679
6,num_past_defaults,1.138679,3.122639
0,age,0.007578,1.007607
5,loan_term_months,0.003283,1.003288
4,loan_amount,5.1e-05,1.000051
1,annual_income,-3.1e-05,0.999969
8,credit_score,-0.011792,0.988277
7,employment_years,-0.114078,0.892188


##  Regularization for Stability


- **L2**

In [32]:
log_l2 = LogisticRegression(
    penalty="l2",
    C=0.3,
    #class_weight='balanced'
    class_weight={0: 0.6526, 1: 0.3474},
    max_iter=1000
)

log_l2.fit(X_train_scaled, y_train)


In [35]:
pd_pred_l2 = log_l2.predict(X_test_scaled)
pd_prob_l2 = log_l2.predict_proba(X_test_scaled)

print(classification_report(y_test, pd_pred_l2))
roc_auc_score(y_test, pd_pred_l2)

              precision    recall  f1-score   support

           0       0.82      0.95      0.88       653
           1       0.87      0.61      0.72       347

    accuracy                           0.83      1000
   macro avg       0.85      0.78      0.80      1000
weighted avg       0.84      0.83      0.83      1000



np.float64(0.782414129422616)

In [34]:
coef_df_l2 = pd.DataFrame({
    "feature_l2": X.columns,
    "coefficient_l2": log_l2.coef_[0]
}).sort_values(by="coefficient_l2", ascending=False)

coef_df_l2





coef_df_l2["odds_ratio_l2"] = np.exp(coef_df_l2["coefficient_l2"])
coef_df_l2


Unnamed: 0,feature_l2,coefficient_l2,odds_ratio_l2
2,credit_utilization,1.510476,4.528887
3,debt_to_income,1.210261,3.354359
6,num_past_defaults,0.510284,1.665763
4,loan_amount,0.37201,1.450647
0,age,0.019882,1.020081
5,loan_term_months,-0.001986,0.998016
7,employment_years,-0.639263,0.527681
1,annual_income,-0.841267,0.431164
8,credit_score,-1.27267,0.280083


## L1 Regularization (Sparse Risk Factors)

- **L1**

In [38]:
log_l1 = LogisticRegression(
    penalty="l1",
    solver="liblinear",
    C=0.3,
    #class_weight='balanced'
    class_weight={0: 0.6526, 1: 0.3474},
    max_iter=1000
)

log_l1.fit(X_train_scaled, y_train)


In [39]:
pd_pred_l1 = log_l1.predict(X_test_scaled)
pd_prob_l1 = log_l1.predict_proba(X_test_scaled)

print(classification_report(y_test, pd_pred_l1))
roc_auc_score(y_test, pd_pred_l1)

              precision    recall  f1-score   support

           0       0.82      0.95      0.88       653
           1       0.87      0.61      0.72       347

    accuracy                           0.83      1000
   macro avg       0.84      0.78      0.80      1000
weighted avg       0.84      0.83      0.82      1000



np.float64(0.7816484326385426)

In [40]:
coef_df_l1 = pd.DataFrame({
    "feature_l1": X.columns,
    "coefficient_l1": log_l1.coef_[0]
}).sort_values(by="coefficient_l1", ascending=False)

coef_df_l1





coef_df_l1["odds_ratio_l1"] = np.exp(coef_df_l1["coefficient_l1"])
coef_df_l1


Unnamed: 0,feature_l1,coefficient_l1,odds_ratio_l1
2,credit_utilization,1.510964,4.531098
3,debt_to_income,1.208296,3.347776
6,num_past_defaults,0.500259,1.649148
4,loan_amount,0.359894,1.433178
0,age,0.003157,1.003162
5,loan_term_months,0.0,1.0
7,employment_years,-0.627648,0.533846
1,annual_income,-0.834855,0.433937
8,credit_score,-1.270818,0.280602


##  Decision Thresholds and Lending Policy


In [42]:
from sklearn.metrics import precision_recall_curve

precision, recall, thresholds = precision_recall_curve(y_test, pd_pred)

pd.DataFrame({
    "threshold": thresholds,
    "precision": precision[:-1],
    "recall": recall[:-1]
}).head()


Unnamed: 0,threshold,precision,recall
0,0,0.347,1.0
1,1,0.510033,0.878963


## Cost-Sensitive View

| Error | Business Cost |
|----|-------------|
| False Positive (reject good) | Lost revenue |
| False Negative (approve bad) | Capital loss |

Threshold selection encodes risk appetite.


## Pipelines (Leakage-Safe and Auditable)


In [45]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

credit_pipeline = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler()),
    ("model", LogisticRegression(
        #class_weight='balanced'
        class_weight={0: 0.6526, 1: 0.3474},
        max_iter=1000
    ))
])

credit_pipeline.fit(X_train, y_train)


## When Logistic Regression Is Not Enough

- Strong non-linear risk interactions
- Behavioral sequences
- Fraud detection
- Complex portfolio effects


## Model Summary

| Aspect | Logistic Regression |
|-----|--------------------|
| Interpretability | Excellent |
| Calibration | Strong |
| Regulatory acceptance | High |
| Non-linearity | Weak |


## Key Takeaways

- Logistic regression estimates Probability of Default
- Coefficients explain risk drivers
- Regularization ensures stability
- Thresholds encode credit policy
- Pipelines support governance


## Next Notebook

04_Supervised_Learning/

└── [02_tree_based_classification.ipynb](02_tree_based_classification.ipynb)


<br><br><br><br><br>



# Complete: [Data Science Techniques](https://github.com/lei-soares/Data-Science-Techniques)

- [00_Data_Generation_and_Simulation](https://github.com/lei-soares/Data-Science-Techniques/tree/main/00_Data_Generation_and_Simulation)


- [01_Exploratory_Data_Analysis_(EDA)](https://github.com/lei-soares/Data-Science-Techniques/tree/main/01_Exploratory_Data_Analysis_(EDA))


- [02_Data_Preprocessing](https://github.com/lei-soares/Data-Science-Techniques/tree/main/02_Data_Preprocessing)


- [03_Feature_Engineering](https://github.com/lei-soares/Data-Science-Techniques/tree/main/03_Feature_Engineering)


- [04_Supervised_Learning](https://github.com/lei-soares/Data-Science-Techniques/tree/main/04_Supervised_Learning)

    - [Regression Models](https://github.com/lei-soares/Data-Science-Techniques/tree/49de369e0600a513b54445e8cb4196b26ce71853/04_Supervised_Learning/01_regression_models)
    
    - [Classification Models](https://github.com/lei-soares/Data-Science-Techniques/tree/49de369e0600a513b54445e8cb4196b26ce71853/04_Supervised_Learning/02_classification_models)


- [05_Unsupervised_Learning](https://github.com/lei-soares/Data-Science-Techniques/tree/main/05_Unsupervised_Learning)


- [06_Model_Evaluation_and_Validation](https://github.com/lei-soares/Data-Science-Techniques/tree/main/06_Model_Evaluation_and_Validation)


- [07_Model_Tuning_and_Optimization](https://github.com/lei-soares/Data-Science-Techniques/tree/main/07_Model_Tuning_and_Optimization)


- [08_Interpretability_and_Explainability](https://github.com/lei-soares/Data-Science-Techniques/tree/main/08_Interpretability_and_Explainability)


- [09_Pipelines_and_Workflows](https://github.com/lei-soares/Data-Science-Techniques/tree/main/09_Pipelines_and_Workflows)


- [10_Natural_Language_Processing_(NLP)](https://github.com/lei-soares/Data-Science-Techniques/tree/main/10_Natural_Language_Processing_(NLP))


- [11_Time_Series](https://github.com/lei-soares/Data-Science-Techniques/tree/main/11_Time_Series)


- [12_Anomaly_and_Fraud_Detection](https://github.com/lei-soares/Data-Science-Techniques/tree/main/12_Anomaly_and_Fraud_Detection)


- [13_Imbalanced_Learning](https://github.com/lei-soares/Data-Science-Techniques/tree/main/13_Imbalanced_Learning)


- [14_Deployment_and_Production_Concepts](https://github.com/lei-soares/Data-Science-Techniques/tree/main/14_Deployment_and_Production_Concepts)


- [15_Business_and_Experimental_Design](https://github.com/lei-soares/Data-Science-Techniques/tree/main/15_Business_and_Experimental_Design)




<br><br><br><br><br>

[Panfugo Dados](www.pantufodados.com)


[Pantufo Dados - YouTube Channel](https://www.youtube.com/@pantufodados)