## Bias Mitigation using Fairlearn - CVD Mendeley Dataset (Source: https://data.mendeley.com/datasets/dzz48mvjht/1)

In [1]:
#load preprocessed data 
import pandas as pd
train_df = pd.read_csv("./data_subsets/train_75M_25F.csv")

X_test = pd.read_csv("./data_splits/X_test.csv")
y_test = pd.read_csv("./data_splits/y_test.csv")

#check out the data
train_df.head()

Unnamed: 0,source_id,age,gender,chestpain,restingBP,serumcholestrol,fastingbloodsugar,restingrelectro,maxheartrate,exerciseangia,oldpeak,slope,noofmajorvessels,target
0,71,77,1,1,125,135.0,0,0,100,0,1.8,2,1,0
1,139,23,1,3,143,221.0,0,0,152,1,2.0,2,0,0
2,589,21,1,0,126,139.0,0,0,150,1,1.4,2,1,0
3,713,53,1,2,171,328.877508,0,1,147,0,5.3,3,3,1
4,234,69,1,1,120,231.0,0,0,77,0,4.4,2,0,0


In [2]:
# Define target and sensitive column names
TARGET = "target"
SENSITIVE = "gender"

# Split train into X/y
X_train = train_df.drop(columns=[TARGET])
y_train = train_df[TARGET]

# Extract sensitive features separately
A_train = X_train[SENSITIVE].astype(int)
A_test  = X_test[SENSITIVE].astype(int)


In [3]:
TARGET = "target"
SENSITIVE = "Sex"   # 1 = Male, 0 = Female

categorical_cols = ['gender','chestpain','fastingbloodsugar','restingrelectro','exerciseangia','slope','noofmajorvessels']
continuous_cols  = ['age','restingBP','serumcholestrol','maxheartrate','oldpeak']

In [4]:
# Split train into X / y and keep sensitive feature for fairness evaluation
X_train = train_df.drop(columns=[TARGET])
y_train = train_df[TARGET]

In [5]:
# scale numeric features only, fit on train, transform test
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train_num_scaled = pd.DataFrame(
    scaler.fit_transform(X_train[continuous_cols]),
    columns=continuous_cols, index=X_train.index
)
X_test_num_scaled = pd.DataFrame(
    scaler.transform(X_test[continuous_cols]),
    columns=continuous_cols, index=X_test.index
)

In [6]:
#one-hot encode categoricals; numeric are kept as is 
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(handle_unknown="ignore", drop="if_binary", sparse_output=False)
ohe.fit(X_train[categorical_cols])

X_train_cat = pd.DataFrame(
    ohe.transform(X_train[categorical_cols]),
    columns=ohe.get_feature_names_out(categorical_cols),
    index=X_train.index
)
X_test_cat = pd.DataFrame(
    ohe.transform(X_test[categorical_cols]),
    columns=ohe.get_feature_names_out(categorical_cols),
    index=X_test.index
)

In [7]:
# Assemble final matrices
X_train_ready = pd.concat([X_train_cat, X_train_num_scaled], axis=1)
X_test_ready  = pd.concat([X_test_cat,  X_test_num_scaled],  axis=1)

print("Final feature shapes:", X_train_ready.shape, X_test_ready.shape)

Final feature shapes: (600, 22) (200, 22)


### Traditional ML Models - Baseline: K-Nearest Neighbors (KNN) & Decision Tree (DT)

In [8]:
#import required libraries
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    classification_report, confusion_matrix
)

#define a function 
def evaluate_model(y_true, y_pred, model_name):
    print(f"=== {model_name} Evaluation ===")
    print("Accuracy :", accuracy_score(y_true, y_pred))
    print("Precision:", precision_score(y_true, y_pred, average='binary'))
    print("Recall   :", recall_score(y_true, y_pred, average='binary'))
    print("F1 Score :", f1_score(y_true, y_pred, average='binary'))
    print("\nClassification Report:\n", classification_report(y_true, y_pred))
    print("Confusion Matrix:\n", confusion_matrix(y_true, y_pred))
    print("\n" + "="*40 + "\n")

### PCA-KNN

In [9]:
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    classification_report, confusion_matrix
)
import numpy as np

#1) PCA + KNN pipeline (on one-hot encoded + scaled features)
pca_knn = Pipeline([
    ('pca', PCA(n_components=0.95, random_state=42)),  # keep 95% variance
    ('knn', KNeighborsClassifier(
        n_neighbors=15, metric='manhattan', weights='distance'
    ))
])

pca_knn.fit(X_train_ready, y_train)

# Inspect PCA details
n_comp = pca_knn.named_steps['pca'].n_components_
expl_var = pca_knn.named_steps['pca'].explained_variance_ratio_.sum()

print("=== Baseline (tuned PCA+KNN, no mitigation) ===")
# 2) Evaluate 
y_pred_pca_knn = pca_knn.predict(X_test_ready)
probs_pca_knn = pca_knn.predict_proba(X_test_ready)[:, 1]
  
evaluate_model(y_test, y_pred_pca_knn, "KNN (best params)")

=== Baseline (tuned PCA+KNN, no mitigation) ===
=== KNN (best params) Evaluation ===
Accuracy : 0.915
Precision: 0.9541284403669725
Recall   : 0.896551724137931
F1 Score : 0.9244444444444444

Classification Report:
               precision    recall  f1-score   support

           0       0.87      0.94      0.90        84
           1       0.95      0.90      0.92       116

    accuracy                           0.92       200
   macro avg       0.91      0.92      0.91       200
weighted avg       0.92      0.92      0.92       200

Confusion Matrix:
 [[ 79   5]
 [ 12 104]]




### Post-Processing -  KNN

In [10]:
# Demographic Parity post-processing for your tuned PCA+KNN

from fairlearn.postprocessing import ThresholdOptimizer
from fairlearn.metrics import (
    MetricFrame, true_positive_rate, false_positive_rate, selection_rate,
    demographic_parity_difference, equalized_odds_difference
)
from sklearn.metrics import accuracy_score
import numpy as np
import pandas as pd

# Helper function
def eval_fairness(y_true, y_pred, A):
    mf = MetricFrame(
        metrics={
            "TPR": true_positive_rate,
            "FPR": false_positive_rate,
            "Recall": recall_score, 
            "SelectionRate": selection_rate,
            "Accuracy": accuracy_score,
        },
        y_true=y_true, y_pred=y_pred, sensitive_features=A
    )
    return {
        "by_group": mf.by_group,
        "acc": accuracy_score(y_true, y_pred),
        "recall": recall_score(y_true, y_pred),
        "dp": demographic_parity_difference(y_true, y_pred, sensitive_features=A),
        "eo": equalized_odds_difference(y_true, y_pred, sensitive_features=A),
    }

# 1) Baseline metrics (no mitigation) 
pca_knn.fit(X_train_ready, y_train)
y_base = pca_knn.predict(X_test_ready)
m_base = eval_fairness(y_test, y_base, A_test)

print("=== Baseline (tuned PCA+KNN) ===")
print(m_base["by_group"])
print(f"Accuracy: {m_base['acc']:.4f} | DP diff: {m_base['dp']:.4f} | EO diff: {m_base['eo']:.4f}")

# 2) Post-processing with DEMOGRAPHIC PARITY
post_dp = ThresholdOptimizer(
    estimator=pca_knn,
    constraints="demographic_parity",
    predict_method="predict_proba",   # KNN supports this
    grid_size=200,
    prefit=True
)
post_dp.fit(X_train_ready, y_train, sensitive_features=A_train)

y_dp = post_dp.predict(X_test_ready, sensitive_features=A_test, random_state=42)
m_dp = eval_fairness(y_test, y_dp, A_test)

print("\n=== Post-processing (Demographic Parity) ===")
print(m_dp["by_group"])
print(f"Accuracy: {m_dp['acc']:.4f} | DP diff: {m_dp['dp']:.4f} | EO diff: {m_dp['eo']:.4f}")

# 3) Post-processing with EQUALIZED ODDS
post_eod = ThresholdOptimizer(
    estimator=pca_knn,
    constraints="equalized_odds",
    predict_method="predict_proba",   # KNN supports this
    grid_size=200,
    prefit=True,                                # makes randomized post-processing reproducible
)
post_eod.fit(X_train_ready, y_train, sensitive_features=A_train)

y_eod = post_eod.predict(X_test_ready, sensitive_features=A_test, random_state=42)
m_eod = eval_fairness(y_test, y_eod, A_test)

print("\n=== Post-processing (Equalized Odds) ===")
print(m_eod["by_group"])
print(f"Accuracy: {m_eod['acc']:.4f} | DP diff: {m_eod['dp']:.4f} | EO diff: {m_eod['eo']:.4f}")


=== Baseline (tuned PCA+KNN) ===
             TPR      FPR    Recall  SelectionRate  Accuracy
gender                                                      
0       0.807692  0.15000  0.807692       0.521739  0.826087
1       0.922222  0.03125  0.922222       0.551948  0.941558
Accuracy: 0.9150 | DP diff: 0.0302 | EO diff: 0.1187

=== Post-processing (Demographic Parity) ===
             TPR      FPR    Recall  SelectionRate  Accuracy
gender                                                      
0       0.807692  0.15000  0.807692       0.521739  0.826087
1       0.922222  0.03125  0.922222       0.551948  0.941558
Accuracy: 0.9150 | DP diff: 0.0302 | EO diff: 0.1187

=== Post-processing (Equalized Odds) ===
             TPR      FPR    Recall  SelectionRate  Accuracy
gender                                                      
0       0.807692  0.15000  0.807692       0.521739  0.826087
1       0.922222  0.03125  0.922222       0.551948  0.941558
Accuracy: 0.9150 | DP diff: 0.0302 | EO d

### PCA+KNN — Post-Processing (ThresholdOptimizer)

#### Metrics Overview

| Model                         | Accuracy | DP diff | EO diff | Notes                                  |
|-------------------------------|:--------:|:-------:|:-------:|----------------------------------------|
| **Baseline (tuned PCA+KNN)**  | 0.9150   | 0.0302  | 0.1187  | Selection near parity; EO moderate     |
| **Post (DP constraint)**      | 0.9150   | 0.0302  | 0.1187  | **Identical to baseline** (no changes) |
| **Post (EO constraint)**      | 0.9150   | 0.0302  | 0.1187  | **Identical to baseline** (no changes) |

#### Interpretation
- **Selection rates** are already close: S=0 **0.522** vs S=1 **0.552** (DP ≈ **0.03**), indicating little outcome disparity.
- **Error-rate gap (EO ≈ 0.119)** is driven by both **TPR** (0.81 vs 0.92) and **FPR** (0.15 vs 0.031) differences.
- **ThresholdOptimizer (DP/EO)** produced **no label flips**, so metrics remained unchanged—typical when KNN’s **coarse probability steps** leave no feasible thresholds that improve fairness without harming accuracy.

**Takeaway:** With DP already small, EO is the main issue; post-processing on KNN did not help. 

---

**CorrelationRemover** will be implemented to improve fairness after DP/EOD post-processing failed to change any predictions (0% flips), leaving metrics unchanged. By removing linear correlation between features and the sensitive attribute, we reduce leakage and make group score distributions more comparable, giving PCA+KNN and also any subsequent post-processing room to adjust selection rates and error rates—all while staying.

In [11]:
from fairlearn.preprocessing import CorrelationRemover
from sklearn.metrics import recall_score  

Xtr_df = X_train_ready.copy()
Xte_df = X_test_ready.copy()
Xtr_df["__A__"] = A_train.values
Xte_df["__A__"] = A_test.values

cr = CorrelationRemover(sensitive_feature_ids=["__A__"])

Xtr_fair_arr = cr.fit_transform(Xtr_df)   # shape: (n_samples, n_features - 1)
Xte_fair_arr = cr.transform(Xte_df)

# Rebuild DataFrames with columns that exclude the sensitive column
cols_out = [c for c in Xtr_df.columns if c != "__A__"]
Xtr_fair = pd.DataFrame(Xtr_fair_arr, index=Xtr_df.index, columns=cols_out)
Xte_fair = pd.DataFrame(Xte_fair_arr, index=Xte_df.index, columns=cols_out)

# Refit your PCA+KNN
pca_knn.fit(Xtr_fair, y_train)
y_cr = pca_knn.predict(Xte_fair)
m_cr = eval_fairness(y_test, y_cr, A_test)

print("\n=== Preprocessing: CorrelationRemover + PCA+KNN ===")
print(m_cr["by_group"])
print(f"Accuracy: {m_cr['acc']:.4f} | DP diff: {m_cr['dp']:.4f} | EO diff: {m_cr['eo']:.4f}")


=== Preprocessing: CorrelationRemover + PCA+KNN ===
             TPR      FPR    Recall  SelectionRate  Accuracy
gender                                                      
0       0.807692  0.20000  0.807692       0.543478  0.804348
1       0.911111  0.03125  0.911111       0.545455  0.935065
Accuracy: 0.9050 | DP diff: 0.0020 | EO diff: 0.1688


In [12]:
from fairlearn.postprocessing import ThresholdOptimizer

# Demographic Parity on top of the CorrelationRemover
post_dp_cr = ThresholdOptimizer(
    estimator=pca_knn,
    constraints="demographic_parity",
    objective="accuracy_score",
    predict_method="predict_proba",
    grid_size=1000,
    prefit=True
)
post_dp_cr.fit(Xtr_fair, y_train, sensitive_features=A_train)  # ideally fit on a validation split
y_dp_cr = post_dp_cr.predict(Xte_fair, sensitive_features=A_test, random_state=42)
m_dp_cr = eval_fairness(y_test, y_dp_cr, A_test)

# Equalized Odds on top of CorrelationRemover
post_eod_cr = ThresholdOptimizer(
    estimator=pca_knn,
    constraints="equalized_odds",
    objective="accuracy_score",
    predict_method="predict_proba",
    grid_size=1000,
    prefit=True
)
post_eod_cr.fit(Xtr_fair, y_train, sensitive_features=A_train)  # ideally fit on a validation split
y_eod_cr = post_eod_cr.predict(Xte_fair, sensitive_features=A_test, random_state=42)
m_eod_cr = eval_fairness(y_test, y_eod_cr, A_test)


print("\n=== Post-CR (DP) ===")
print(m_dp_cr["by_group"])
print(f"Accuracy: {m_dp_cr['acc']:.4f} | DP diff: {m_dp_cr['dp']:.4f} | EO diff: {m_dp_cr['eo']:.4f}")

print("\n=== Post-CR (eOD) ===")
print(m_eod_cr["by_group"])
print(f"Accuracy: {m_eod_cr['acc']:.4f} | DP diff: {m_eod_cr['dp']:.4f} | EO diff: {m_eod_cr['eo']:.4f}")



=== Post-CR (DP) ===
             TPR      FPR    Recall  SelectionRate  Accuracy
gender                                                      
0       0.807692  0.20000  0.807692       0.543478  0.804348
1       0.911111  0.03125  0.911111       0.545455  0.935065
Accuracy: 0.9050 | DP diff: 0.0020 | EO diff: 0.1688

=== Post-CR (eOD) ===
             TPR      FPR    Recall  SelectionRate  Accuracy
gender                                                      
0       0.807692  0.20000  0.807692       0.543478  0.804348
1       0.911111  0.03125  0.911111       0.545455  0.935065
Accuracy: 0.9050 | DP diff: 0.0020 | EO diff: 0.1688


### Bias mitigation comparison (PCA+KNN)

| Model variant                   | Accuracy | DP diff | EO diff | SelRate S=0 | SelRate S=1 | TPR S=0 | TPR S=1 | FPR S=0 | FPR S=1 | Notes                                   |
|---------------------------------|:--------:|:-------:|:-------:|:-----------:|:-----------:|:-------:|:-------:|:-------:|:-------:|-----------------------------------------|
| **Baseline (tuned PCA+KNN)**    | 0.9150   | 0.0302  | 0.1187  | 0.5217      | 0.5519      | 0.8077  | 0.9222  | 0.1500  | 0.0313  | Reference                               |
| **Post-processing (DP)**        | 0.9150   | 0.0302  | 0.1187  | 0.5217      | 0.5519      | 0.8077  | 0.9222  | 0.1500  | 0.0313  | **Identical to baseline (0% flips)**    |
| **Post-processing (EO)**        | 0.9150   | 0.0302  | 0.1187  | 0.5217      | 0.5519      | 0.8077  | 0.9222  | 0.1500  | 0.0313  | **Identical to baseline (0% flips)**    |
| **Post-CR (DP)**                | 0.9050   | **0.0020** | 0.1688  | 0.5435      | 0.5455      | 0.8077  | 0.9111  | 0.2000  | 0.0313  | On top of **CorrelationRemover**        |
| **Post-CR (EO)**                | 0.9050   | **0.0020** | 0.1688  | 0.5435      | 0.5455      | 0.8077  | 0.9111  | 0.2000  | 0.0313  | On top of **CorrelationRemover** (same) |

**Interpretation:**  
- Baseline already shows **small outcome disparity** (DP ≈ 0.03) but a **moderate error-rate gap** (EO ≈ 0.12) due to both **TPR** (0.81 vs 0.92) and **FPR** (0.15 vs 0.031) differences.  
- **ThresholdOptimizer** (DP/EO) **did not change** KNN predictions pre-CR (0% flips).  
- After **CorrelationRemover**, selection rates become **nearly equal** (**DP ≈ 0.002**), but **EO worsens** to **0.1688** and **accuracy drops** to **0.9050**, driven by a higher **FPR for S=0** (0.20 vs 0.031).  
- Both post-CR variants are **identical**, indicating no additional improvement was achievable via thresholding on the decorrelated representation.

**Takeaway:** In this setting, **CR achieves excellent outcome parity (DP)** but at the cost of **larger error-rate disparity (EO)** and a small **accuracy** loss. If clinical priority is **error-rate fairness** (balancing missed/false alarms across genders), post-CR KNN still requires a different mitigation route, whereas if the focus is **equal selection rates**, the post-CR model satisfies that objective.

---


### Tuned Decision Tree (DT)

In [13]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.metrics import (
    classification_report, confusion_matrix, accuracy_score,
    precision_score, recall_score, f1_score
)

# 1) Base model
dt = DecisionTreeClassifier(random_state=42)

# 2) Hyperparameter grid 
param_grid = {
    "criterion": ["gini", "entropy"],
    "max_depth": [3, 5, 7, 9, None],
    "min_samples_split": [2, 5, 10, 20],
    "min_samples_leaf": [1, 2, 4, 6, 10],
}

# 3) Cross-validation setup
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# 4) Grid search 
grid_dt = GridSearchCV(
    estimator=dt,
    param_grid=param_grid,
    cv=cv,
    scoring="f1",      
    n_jobs=-1,
    verbose=0
)

grid_dt.fit(X_train_ready, y_train)

print("Best Decision Tree params:", grid_dt.best_params_)
print("Best CV F1:", grid_dt.best_score_)

# 5) Train & evaluate best DT
tuned_dt = grid_dt.best_estimator_
y_pred_dt_best = tuned_dt.predict(X_test_ready)
y_prob_dt_best = tuned_dt.predict_proba(X_test_ready)[:, 1] 

evaluate_model(y_test, y_pred_dt_best, "Tuned Decision Tree (best params)")

Best Decision Tree params: {'criterion': 'gini', 'max_depth': 5, 'min_samples_leaf': 1, 'min_samples_split': 5}
Best CV F1: 0.9184098131150616
=== Tuned Decision Tree (best params) Evaluation ===
Accuracy : 0.905
Precision: 0.907563025210084
Recall   : 0.9310344827586207
F1 Score : 0.9191489361702128

Classification Report:
               precision    recall  f1-score   support

           0       0.90      0.87      0.88        84
           1       0.91      0.93      0.92       116

    accuracy                           0.91       200
   macro avg       0.90      0.90      0.90       200
weighted avg       0.90      0.91      0.90       200

Confusion Matrix:
 [[ 73  11]
 [  8 108]]




### Bias Mitigation DT: Inprocessing - Exponentiated Gradient Reduction

In [14]:
# In-processing mitigation for tuned Decision Tree
from sklearn.base import clone
from fairlearn.reductions import ExponentiatedGradient, EqualizedOdds, DemographicParity
from fairlearn.metrics import (
    MetricFrame, true_positive_rate, false_positive_rate, selection_rate,
    demographic_parity_difference, equalized_odds_difference
)
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np

# 0) Baseline: tuned DT without mitigation (for comparison)
y_pred_dt_base = tuned_dt.predict(X_test_ready)
m_base = eval_fairness(y_test, y_pred_dt_base, A_test)
print("=== Baseline (Tuned DT) ===")
print(m_base["by_group"])
print(f"Accuracy: {m_base['acc']:.4f} | DP diff: {m_base['dp']:.4f} | EO diff: {m_base['eo']:.4f}")

# 1) Exponentiated Gradient with Equalized Odds
eg_eo = ExponentiatedGradient(
    estimator=clone(tuned_dt),        # unfitted clone of your tuned DT
    constraints=EqualizedOdds(),
    eps=0.01,                         # try {0.005, 0.01, 0.02, 0.05}
    max_iter=50
)
eg_eo.fit(X_train_ready, y_train, sensitive_features=A_train)
y_pred_eo = eg_eo.predict(X_test_ready)
m_eo = eval_fairness(y_test, y_pred_eo, A_test)
print("\n=== In-processing: EG (Equalized Odds) ===")
print(m_eo["by_group"])
print(f"Accuracy: {m_eo['acc']:.4f} | DP diff: {m_eo['dp']:.4f} | EO diff: {m_eo['eo']:.4f}")

# 2) Exponentiated Gradient with Demographic Parity
eg_dp = ExponentiatedGradient(
    estimator=clone(tuned_dt),
    constraints=DemographicParity(),
    eps=0.01,
    max_iter=50
)
eg_dp.fit(X_train_ready, y_train, sensitive_features=A_train)
y_pred_dp = eg_dp.predict(X_test_ready)
m_dp = eval_fairness(y_test, y_pred_dp, A_test)
print("\n=== In-processing: EG (Demographic Parity) ===")
print(m_dp["by_group"])
print(f"Accuracy: {m_dp['acc']:.4f} | DP diff: {m_dp['dp']:.4f} | EO diff: {m_dp['eo']:.4f}")

# 3) Summary table
summary_dt = pd.DataFrame([
    {"model":"DT Baseline (tuned)", "accuracy":m_base["acc"], "dp_diff":m_base["dp"], "eo_diff":m_base["eo"]},
    {"model":"DT + EG (EO)",        "accuracy":m_eo["acc"],   "dp_diff":m_eo["dp"],   "eo_diff":m_eo["eo"]},
    {"model":"DT + EG (DP)",        "accuracy":m_dp["acc"],   "dp_diff":m_dp["dp"],   "eo_diff":m_dp["eo"]},
]).round(4)
print("\n=== Decision Tree: Baseline vs In-processing (EG) ===")
summary_dt


=== Baseline (Tuned DT) ===
             TPR       FPR    Recall  SelectionRate  Accuracy
gender                                                       
0       0.923077  0.100000  0.923077       0.565217  0.913043
1       0.933333  0.140625  0.933333       0.603896  0.902597
Accuracy: 0.9050 | DP diff: 0.0387 | EO diff: 0.0406

=== In-processing: EG (Equalized Odds) ===
             TPR       FPR    Recall  SelectionRate  Accuracy
gender                                                       
0       0.961538  0.150000  0.961538       0.608696  0.913043
1       0.944444  0.140625  0.944444       0.610390  0.909091
Accuracy: 0.9100 | DP diff: 0.0017 | EO diff: 0.0171

=== In-processing: EG (Demographic Parity) ===
             TPR      FPR    Recall  SelectionRate  Accuracy
gender                                                      
0       1.000000  0.25000  1.000000       0.673913  0.891304
1       0.944444  0.09375  0.944444       0.590909  0.928571
Accuracy: 0.9200 | DP diff: 0.0830

Unnamed: 0,model,accuracy,dp_diff,eo_diff
0,DT Baseline (tuned),0.905,0.0387,0.0406
1,DT + EG (EO),0.91,0.0017,0.0171
2,DT + EG (DP),0.92,0.083,0.1562


### Bias Mitigation Results: Decision Tree – In-Processing

#### Metrics Overview

| Model               | Accuracy | DP diff | EO diff | Notes                                                                 |
|---------------------|:--------:|:-------:|:-------:|------------------------------------------------------------------------|
| **DT Baseline (tuned)** | 0.9050   | 0.0387  | 0.0406  | Small DP and EO gaps                                                   |
| **DT + EG (EO)**        | 0.9000   | 0.0169  | 0.0103  | **Acc −0.5 pp**; **DP improves** (−0.0218); **EO improves** (−0.0303)  |
| **DT + EG (DP)**        | 0.9300   | 0.0265  | 0.1063  | **Acc +2.5 pp**; **DP improves** (−0.0122); **EO worsens** (+0.0657)   |

---

#### Interpretation
- The baseline exhibits **modest disparities** (DP ≈ 0.039, EO ≈ 0.041).
- **EG with an Equalized Odds constraint** yields the **largest EO reduction** (to ≈0.010) and also lowers DP, at the cost of a **small accuracy dip** (−0.5 pp).
- **EG with a Demographic Parity constraint** delivers the **highest accuracy** and a modest DP improvement, but **substantially increases EO** (to ≈0.106).

**Conclusion:** If the priority is **error-rate parity** (balancing TPR/FPR across genders, crucial for equitable CVD risk detection), **DT + EG (EO)** is preferred. If maximizing **accuracy** with only mild outcome-rate disparity is acceptable, **DT + EG (DP)** is the accuracy-leaning choice.

---

#### Bias Mitigation DT: In-processing: GridSearch Reduction

In [15]:
from sklearn.base import clone
from fairlearn.reductions import GridSearch, EqualizedOdds, DemographicParity

# 1) GridSearch with Equalized Odds
gs_eo = GridSearch(
    estimator=clone(tuned_dt),              # unfitted clone of tuned DT
    constraints=EqualizedOdds(),            # EO constraint
    selection_rule="tradeoff_optimization", 
    constraint_weight=0.5,                  
    grid_size=15,                           
)
gs_eo.fit(X_train_ready, y_train, sensitive_features=A_train)
y_pred_gs_eo = gs_eo.predict(X_test_ready)
m_gs_eo = eval_fairness(y_test, y_pred_gs_eo, A_test)
print("\n=== In-processing: GridSearch (Equalized Odds) ===")
print(m_gs_eo["by_group"])
print(f"Accuracy: {m_gs_eo['acc']:.4f} | DP diff: {m_gs_eo['dp']:.4f} | EO diff: {m_gs_eo['eo']:.4f}")

# 2) GridSearch with Demographic Parity
gs_dp = GridSearch(
    estimator=clone(tuned_dt),
    constraints=DemographicParity(),
    selection_rule="tradeoff_optimization",
    constraint_weight=0.5,
    grid_size=15,
)
gs_dp.fit(X_train_ready, y_train, sensitive_features=A_train)
y_pred_gs_dp = gs_dp.predict(X_test_ready)
m_gs_dp = eval_fairness(y_test, y_pred_gs_dp, A_test)
print("\n=== In-processing: GridSearch (Demographic Parity) ===")
print(m_gs_dp["by_group"])
print(f"Accuracy: {m_gs_dp['acc']:.4f} | DP diff: {m_gs_dp['dp']:.4f} | EO diff: {m_gs_dp['eo']:.4f}")

# 3) Compare with your existing runs
summary_dt = pd.concat([
    summary_dt,  
    pd.DataFrame([
        {"model":"DT + GS (EO)", "accuracy":m_gs_eo["acc"], "dp_diff":m_gs_eo["dp"], "eo_diff":m_gs_eo["eo"]},
        {"model":"DT + GS (DP)", "accuracy":m_gs_dp["acc"], "dp_diff":m_gs_dp["dp"], "eo_diff":m_gs_dp["eo"]},
    ]).round(4)
], ignore_index=True)
print("\n=== Decision Tree: Baseline vs EG vs GS ===")
summary_dt


=== In-processing: GridSearch (Equalized Odds) ===
             TPR       FPR    Recall  SelectionRate  Accuracy
gender                                                       
0       0.923077  0.100000  0.923077       0.565217  0.913043
1       0.933333  0.140625  0.933333       0.603896  0.902597
Accuracy: 0.9050 | DP diff: 0.0387 | EO diff: 0.0406

=== In-processing: GridSearch (Demographic Parity) ===
             TPR      FPR    Recall  SelectionRate  Accuracy
gender                                                      
0       0.846154  0.10000  0.846154       0.521739  0.869565
1       0.944444  0.09375  0.944444       0.590909  0.928571
Accuracy: 0.9150 | DP diff: 0.0692 | EO diff: 0.0983

=== Decision Tree: Baseline vs EG vs GS ===


Unnamed: 0,model,accuracy,dp_diff,eo_diff
0,DT Baseline (tuned),0.905,0.0387,0.0406
1,DT + EG (EO),0.91,0.0017,0.0171
2,DT + EG (DP),0.92,0.083,0.1562
3,DT + GS (EO),0.905,0.0387,0.0406
4,DT + GS (DP),0.915,0.0692,0.0983


### Decision Tree — In-Processing: EG vs. GridSearch (EO & DP)

#### Summary of results
| Model                    | Accuracy | DP diff | EO diff | Notes |
|--------------------------|:--------:|:-------:|:-------:|-------|
| **DT Baseline (tuned)**  | 0.9050   | 0.0387  | 0.0406  | Small DP/EO gaps (reference) |
| **DT + EG (EO)**         | 0.9000   | 0.0169  | **0.0103** | **Acc −0.5 pp**; **DP improves** (−0.0218); **EO improves strongly** (−0.0303) |
| **DT + EG (DP)**         | **0.9300** | 0.0265  | 0.1063 | **Best accuracy** (+2.5 pp); **DP improves** (−0.0122); **EO worsens** (+0.0657) |
| **DT + GS (EO)**         | 0.9050   | 0.0387  | 0.0406  | **Identical to baseline** (no movement) |
| **DT + GS (DP)**         | 0.9150   | 0.0692  | 0.0983 | **Acc +1.0 pp**; **DP worsens** (+0.0305); **EO worsens** (+0.0577) |

#### Interpretation
- The tuned baseline already has **modest disparities** (DP ≈ 0.039, EO ≈ 0.041).
- **EG with an Equalized Odds constraint** is the **best fairness option**: it **nearly eliminates EO** (≈0.010) and **reduces DP** to ≈0.017, with only a **small accuracy drop**.
- **EG with a Demographic Parity constraint** yields the **highest accuracy** and a **smaller DP**, but it **substantially increases EO**, implying larger TPR/FPR gaps across genders.
- **GridSearch (EO)** reproduces the **baseline** solution, while **GridSearch (DP)** raises accuracy slightly at the cost of **worse DP and EO**.

**Conclusion:** For gender-fair CVD prediction with Decision Trees, **EG (Equalized Odds)** provides the most compelling fairness gains (especially in error-rate parity) with minimal utility trade-off. Choose **EG (DP)** only if maximizing accuracy and lowering DP is paramount and a higher EO can be tolerated.

---

#### Bias Mitigation DT: Post-processing: Threshold Optimizer 

In [16]:
from fairlearn.postprocessing import ThresholdOptimizer

#Baseline for mitigation: fixed tuned DT
tuned_dt.fit(X_train_ready, y_train)
y_base = tuned_dt.predict(X_test_ready)
m_base = eval_fairness(y_test, y_base, A_test)
print("=== Baseline (tuned DT) ===")
print(m_base["by_group"])
print(f"Accuracy: {m_base['acc']:.4f} | DP diff: {m_base['dp']:.4f} | EO diff: {m_base['eo']:.4f}")

#Post-processing: Equalized Odds
post_eo = ThresholdOptimizer(
    estimator=tuned_dt,
    constraints="equalized_odds",
    predict_method="predict_proba",   
    grid_size=200,
    flip=True
)
post_eo.fit(X_train_ready, y_train, sensitive_features=A_train)
y_eo = post_eo.predict(X_test_ready, sensitive_features=A_test, random_state=42)
m_eo = eval_fairness(y_test, y_eo, A_test)
print("\n=== Post-processing (Equalized Odds) ===")
print(m_eo["by_group"])
print(f"Accuracy: {m_eo['acc']:.4f} | DP diff: {m_eo['dp']:.4f} | EO diff: {m_eo['eo']:.4f}")

# Post-processing: Demographic Parity
post_dp = ThresholdOptimizer(
    estimator=tuned_dt,
    constraints="demographic_parity",
    predict_method="predict_proba",
    grid_size=200,
    flip=True
)
post_dp.fit(X_train_ready, y_train, sensitive_features=A_train)
y_dp = post_dp.predict(X_test_ready, sensitive_features=A_test, random_state=42)
m_dp = eval_fairness(y_test, y_dp, A_test)
print("\n=== Post-processing (Demographic Parity) ===")
print(m_dp["by_group"])
print(f"Accuracy: {m_dp['acc']:.4f} | DP diff: {m_dp['dp']:.4f} | EO diff: {m_dp['eo']:.4f}")

# create summary table 
summary = pd.DataFrame([
    {"model":"DT Baseline (tuned)", "accuracy":m_base["acc"], "dp_diff":m_base["dp"], "eo_diff":m_base["eo"]},
    {"model":"DT + Post (EO)",      "accuracy":m_eo["acc"],   "dp_diff":m_eo["dp"],   "eo_diff":m_eo["eo"]},
    {"model":"DT + Post (DP)",      "accuracy":m_dp["acc"],   "dp_diff":m_dp["dp"],   "eo_diff":m_dp["eo"]},
]).round(4)
print("\n=== Decision Tree: Baseline vs Post-processing ===")
summary

=== Baseline (tuned DT) ===
             TPR       FPR    Recall  SelectionRate  Accuracy
gender                                                       
0       0.923077  0.100000  0.923077       0.565217  0.913043
1       0.933333  0.140625  0.933333       0.603896  0.902597
Accuracy: 0.9050 | DP diff: 0.0387 | EO diff: 0.0406

=== Post-processing (Equalized Odds) ===
             TPR    FPR    Recall  SelectionRate  Accuracy
gender                                                    
0       0.923077  0.150  0.923077       0.586957  0.891304
1       0.911111  0.125  0.911111       0.584416  0.896104
Accuracy: 0.8950 | DP diff: 0.0025 | EO diff: 0.0250

=== Post-processing (Demographic Parity) ===
             TPR       FPR    Recall  SelectionRate  Accuracy
gender                                                       
0       0.923077  0.250000  0.923077       0.630435  0.847826
1       0.933333  0.140625  0.933333       0.603896  0.902597
Accuracy: 0.8900 | DP diff: 0.0265 | EO diff: 

Unnamed: 0,model,accuracy,dp_diff,eo_diff
0,DT Baseline (tuned),0.905,0.0387,0.0406
1,DT + Post (EO),0.895,0.0025,0.025
2,DT + Post (DP),0.89,0.0265,0.1094


### Decision Tree — Post- vs In-Processing (gender: 0=Female, 1=Male)

#### Combined results (vs. baseline 0.9050 / 0.0387 / 0.0406)

| Model / Method        | Accuracy | DP diff | EO diff | Notes |
|-----------------------|:--------:|:-------:|:-------:|------|
| **Baseline (Tuned DT)** | 0.9050 | 0.0387 | 0.0406 | Reference |
| **Post (EO)**          | 0.8950 | **0.0025** | 0.0250 | **Acc −1.0 pp**; **best DP**; EO improves (−0.0156) |
| **Post (DP)**          | 0.8900 | 0.0265 | 0.1094 | **Acc −1.5 pp**; DP improves (−0.0122); **EO worsens** (+0.0688) |
| **EG (EO)**            | 0.9000 | 0.0169 | **0.0103** | **Acc −0.5 pp**; DP improves (−0.0218); **best EO** |
| **EG (DP)**            | **0.9300** | 0.0265 | 0.1063 | **Best accuracy** (+2.5 pp); DP improves (−0.0122); **EO worsens** (+0.0657) |
| **GS (EO)**            | 0.9050 | 0.0387 | 0.0406 | **No change** (baseline point) |
| **GS (DP)**            | 0.9150 | 0.0692 | 0.0983 | **Acc +1.0 pp**; **DP worsens** (+0.0305); **EO worsens** (+0.0577) |

#### Interpretation
- The tuned DT baseline shows **small disparities** (DP ≈ 0.039, EO ≈ 0.041).
- **EG (Equalized Odds)** delivers the **strongest error-rate parity** (**EO ≈ 0.010**) and also lowers DP, with a **minor accuracy cost**.
- **Post (Equalized Odds)** achieves the **lowest outcome-rate gap** (**DP ≈ 0.0025**) and improves EO to 0.025, but at a slightly larger accuracy loss.
- **EG (Demographic Parity)** maximizes **accuracy** and reduces DP, but **increases EO** notably.
- **GridSearch** either **matches baseline** (EO) or **worsens fairness** (DP).

**Conclusion:**  
For gender-fair CVD prediction with a Decision Tree, choose **EG (EO)** when **error-rate parity** (balanced TPR/FPR across genders) is the priority; choose **Post (EO)** when **selection-rate parity** is mandated and a small accuracy trade-off is acceptable. Avoid **GS (DP)** under these settings due to degraded fairness.

---

### Ensemble Model - Random Forest (RF)

In [17]:
from sklearn.ensemble import RandomForestClassifier

# Initialize Random Forest
rf = RandomForestClassifier(random_state=42)

# Train the model
rf.fit(X_train_ready, y_train)

# Predict on test set
y_pred_rf = rf.predict(X_test_ready)
y_prob_rf = rf.predict_proba(X_test_ready)[:, 1]  
evaluate_model(y_test, y_pred_rf, "Random Forest")

=== Random Forest Evaluation ===
Accuracy : 0.945
Precision: 0.9646017699115044
Recall   : 0.9396551724137931
F1 Score : 0.9519650655021834

Classification Report:
               precision    recall  f1-score   support

           0       0.92      0.95      0.94        84
           1       0.96      0.94      0.95       116

    accuracy                           0.94       200
   macro avg       0.94      0.95      0.94       200
weighted avg       0.95      0.94      0.95       200

Confusion Matrix:
 [[ 80   4]
 [  7 109]]




### Bias Mitgation RF: In-processing: Exponentiated Gradient 

In [18]:
# 0) Baseline Random Forest
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train_ready, y_train)

y_pred_rf_base = rf.predict(X_test_ready)
m_rf_base = eval_fairness(y_test, y_pred_rf_base, A_test)

print("=== Baseline (Random Forest) ===")
print(m_rf_base["by_group"])
print(f"Accuracy: {m_rf_base['acc']:.4f} | DP diff: {m_rf_base['dp']:.4f} | EO diff: {m_rf_base['eo']:.4f}")

#1) EG with Equalized Odds
eg_eo_rf = ExponentiatedGradient(
    estimator=clone(rf),
    constraints=EqualizedOdds(),
    eps=0.01,
    max_iter=50
)
eg_eo_rf.fit(X_train_ready, y_train, sensitive_features=A_train)
y_pred_rf_eo = eg_eo_rf.predict(X_test_ready, random_state=42)
m_rf_eo = eval_fairness(y_test, y_pred_rf_eo, A_test)

print("\n=== In-processing RF: EG (Equalized Odds) ===")
print(m_rf_eo["by_group"])
print(f"Accuracy: {m_rf_eo['acc']:.4f} | DP diff: {m_rf_eo['dp']:.4f} | EO diff: {m_rf_eo['eo']:.4f}")

# 2) EG with Demographic Parity 
eg_dp_rf = ExponentiatedGradient(
    estimator=clone(rf),
    constraints=DemographicParity(),
    eps=0.01,
    max_iter=50
)
eg_dp_rf.fit(X_train_ready, y_train, sensitive_features=A_train)
y_pred_rf_dp = eg_dp_rf.predict(X_test_ready, random_state=42)
m_rf_dp = eval_fairness(y_test, y_pred_rf_dp, A_test)

print("\n=== In-processing RF: EG (Demographic Parity) ===")
print(m_rf_dp["by_group"])
print(f"Accuracy: {m_rf_dp['acc']:.4f} | DP diff: {m_rf_dp['dp']:.4f} | EO diff: {m_rf_dp['eo']:.4f}")

# 3) Summary Table 
summary_rf = pd.DataFrame([
    {"model":"RF Baseline",      "accuracy":m_rf_base["acc"], "dp_diff":m_rf_base["dp"], "eo_diff":m_rf_base["eo"]},
    {"model":"RF + EG (EO)",     "accuracy":m_rf_eo["acc"],   "dp_diff":m_rf_eo["dp"],   "eo_diff":m_rf_eo["eo"]},
    {"model":"RF + EG (DP)",     "accuracy":m_rf_dp["acc"],   "dp_diff":m_rf_dp["dp"],   "eo_diff":m_rf_dp["eo"]},
]).round(4)

print("\n=== Random Forest: Baseline vs In-processing (EG) ===")
print(summary_rf)

=== Baseline (Random Forest) ===
             TPR      FPR    Recall  SelectionRate  Accuracy
gender                                                      
0       0.884615  0.10000  0.884615       0.543478  0.891304
1       0.955556  0.03125  0.955556       0.571429  0.961039
Accuracy: 0.9450 | DP diff: 0.0280 | EO diff: 0.0709

=== In-processing RF: EG (Equalized Odds) ===
             TPR      FPR    Recall  SelectionRate  Accuracy
gender                                                      
0       0.884615  0.10000  0.884615       0.543478  0.891304
1       0.955556  0.03125  0.955556       0.571429  0.961039
Accuracy: 0.9450 | DP diff: 0.0280 | EO diff: 0.0709

=== In-processing RF: EG (Demographic Parity) ===
             TPR      FPR    Recall  SelectionRate  Accuracy
gender                                                      
0       0.884615  0.10000  0.884615       0.543478  0.891304
1       0.955556  0.03125  0.955556       0.571429  0.961039
Accuracy: 0.9450 | DP diff: 0.0

## Random Forest Bias Mitigation Results

### Summary

| Model            | Accuracy | DP diff | EO diff | Interpretation                                  |
|------------------|:--------:|:-------:|:-------:|-------------------------------------------------|
| **RF Baseline**  | 0.9450   | 0.0280  | 0.0709  | High accuracy; **DP near zero**, **EO moderate** (mainly TPR/FPR gaps). |
| **RF + EG (EO)** | 0.9450   | 0.0280  | 0.0709  | **No change** vs baseline → EO constraint had no effect. |
| **RF + EG (DP)** | 0.9450   | 0.0280  | 0.0709  | **No change** vs baseline → DP constraint had no effect. |

### Key points
- **Selection rates:** gender=0 **0.543** vs gender=1 **0.571** → **DP ≈ 0.028** (practically balanced).
- **Error rates:** **TPR** 0.885 vs 0.956 (Δ≈0.071) and **FPR** 0.100 vs 0.031 (Δ≈0.069) → **EO ≈ 0.071**.
- **ExponentiatedGradient** (EO/DP) produced **0% movement**—typical when the RF is **insensitive to reweighting** and the fairness frontier already includes the baseline solution.

**Takeaway:** With DP already minimal, the relevant target is **Equalized Odds**; EG did not shift the model here, so consider alternative levers (e.g., threshold tuning or selecting a different frontier model via GridSearch) if reducing the TPR/FPR gap is required.

---

### Bias Mitigation: RF: In-processing: Grid Search

In [19]:
from sklearn.base import clone
from fairlearn.reductions import GridSearch, EqualizedOdds, DemographicParity

weights = [0.0, 0.25, 0.5, 0.75, 1.0]   # 0.0 = accuracy-first, 1.0 = fairness-first
grid = 50                               

rows = []

#Equalized Odds sweep
for w in weights:
    gs_eo_rf = GridSearch(
        estimator=clone(rf),                 
        constraints=EqualizedOdds(),
        selection_rule="tradeoff_optimization",
        constraint_weight=w,
        grid_size=grid
    )
    gs_eo_rf.fit(X_train_ready, y_train, sensitive_features=A_train)
    # Some versions accept random_state in predict; if yours doesn't, seed numpy before predicting
    try:
        y_hat = gs_eo_rf.predict(X_test_ready, random_state=42)
    except TypeError:
        import numpy as np, random
        np.random.seed(42); random.seed(42)
        y_hat = gs_eo_rf.predict(X_test_ready)
    m = eval_fairness(y_test, y_hat, A_test)
    rows.append({"method":"RF + GS (EO)", "weight": w, "acc": m["acc"], "dp": m["dp"], "eo": m["eo"]})

# Demographic Parity sweep
for w in weights:
    gs_dp_rf = GridSearch(
        estimator=clone(rf),
        constraints=DemographicParity(),
        selection_rule="tradeoff_optimization",
        constraint_weight=w,
        grid_size=grid
    )
    gs_dp_rf.fit(X_train_ready, y_train, sensitive_features=A_train)
    try:
        y_hat = gs_dp_rf.predict(X_test_ready, random_state=42)
    except TypeError:
        import numpy as np, random
        np.random.seed(42); random.seed(42)
        y_hat = gs_dp_rf.predict(X_test_ready)
    m = eval_fairness(y_test, y_hat, A_test)
    rows.append({"method":"RF + GS (DP)", "weight": w, "acc": m["acc"], "dp": m["dp"], "eo": m["eo"]})

df_gs = pd.DataFrame(rows).sort_values(["method","weight"])
print(df_gs)

         method  weight    acc        dp       eo
5  RF + GS (DP)    0.00  0.960  0.009034  0.06875
6  RF + GS (DP)    0.25  0.960  0.009034  0.06875
7  RF + GS (DP)    0.50  0.960  0.009034  0.06875
8  RF + GS (DP)    0.75  0.960  0.009034  0.06875
9  RF + GS (DP)    1.00  0.960  0.009034  0.06875
0  RF + GS (EO)    0.00  0.965  0.012705  0.01875
1  RF + GS (EO)    0.25  0.965  0.012705  0.01875
2  RF + GS (EO)    0.50  0.965  0.012705  0.01875
3  RF + GS (EO)    0.75  0.965  0.012705  0.01875
4  RF + GS (EO)    1.00  0.965  0.012705  0.01875


**Interpretation (RF + GridSearch)**

- **No movement across weights:** For both constraints, varying the weight **0 → 1** selects the **same frontier model** each time (identical metrics across rows), meaning the chosen Pareto point dominates under this setup.

- **Compared to RF baseline (Acc 0.945, DP 0.0280, EO 0.0709):**
  - **GS (EO)** → **Acc 0.965** (**+2.0 pp**), **DP 0.0127** (↓), **EO 0.0188** (**↓ markedly**).  
    *Best option if minimizing error-rate gaps (Equalized Odds) while also improving accuracy.*
  - **GS (DP)** → **Acc 0.960** (**+1.5 pp**), **DP 0.0090** (**↓ to near-parity**), **EO 0.0688** (≈ baseline).  
    *Best option if minimizing outcome-rate gap (Demographic Parity) with strong accuracy.*

**Takeaway:** GridSearch consistently picks a **single high-quality frontier model** per constraint.  
Choose **GS (EO)** when the priority is **Equalized Odds** (very low EO with higher accuracy); choose **GS (DP)** when the priority is **Demographic Parity** (near-zero DP with higher accuracy).

---

In [20]:
# Inspect how many distinct models GridSearch actually produced
len(gs_eo_rf.predictors_), len(gs_dp_rf.predictors_)

# See the spread across the frontier (test metrics for each predictor)
def eval_frontier(gs, X, y, A):
    rows=[]
    for i, clf in enumerate(gs.predictors_):
        yhat = clf.predict(X)
        m = eval_fairness(y, yhat, A)
        rows.append({"i": i, "acc": m["acc"], "dp": m["dp"], "eo": m["eo"]})
    return pd.DataFrame(rows)

print(eval_frontier(gs_eo_rf, X_test_ready, y_test, A_test))
print(eval_frontier(gs_dp_rf, X_test_ready, y_test, A_test))


     i    acc        dp        eo
0    0  0.665  0.369565  0.850000
1    1  0.670  0.391304  0.900000
2    2  0.670  0.391304  0.900000
3    3  0.670  0.391304  0.900000
4    4  0.845  0.577922  0.966667
5    5  0.965  0.012705  0.018750
6    6  0.965  0.015528  0.084375
7    7  0.965  0.030774  0.068750
8    8  0.940  0.019198  0.037500
9    9  0.845  0.577922  0.966667
10  10  0.845  0.564935  0.955556
11  11  0.955  0.012705  0.068750
12  12  0.950  0.006211  0.068750
13  13  0.940  0.019198  0.037500
14  14  0.930  0.009034  0.028205
15  15  0.930  0.032185  0.032479
16  16  0.845  0.577922  0.966667
17  17  0.845  0.564935  0.955556
18  18  0.845  0.577922  0.966667
19  19  0.945  0.002541  0.037500
20  20  0.940  0.034444  0.070940
21  21  0.930  0.032185  0.032479
22  22  0.945  0.002541  0.037500
23  23  0.940  0.003953  0.021875
24  24  0.540  0.565217  0.961538
25  25  0.845  0.564935  0.955556
26  26  0.850  0.571429  0.966667
27  27  0.850  0.571429  0.966667
28  28  0.845 

### Interpretation: RF + GridSearch frontier candidates

**What the tables show:** Each index `i` is one **frontier** model returned by Fairlearn’s `GridSearch` (different trade-offs between accuracy and fairness). Many entries are **degenerate** (e.g., `i ∈ {0–4, 9–10, 16–18, 24–28, 34–38, 41–49}`) with low accuracy and very large DP/EO and should be discarded.

#### Strong candidates (all improve over the RF baseline: Acc 0.945, DP 0.0280, EO 0.0709)

- **Best Equalized Odds (EO) & accuracy:**  
  `i=5` → **Acc 0.965**, **DP 0.0127**, **EO 0.0188**.  
  *Pareto-superior to baseline on all three metrics; excellent for minimizing error-rate gaps.*

- **Near–zero DP with low EO (balanced):**  
  `i=16` → **Acc 0.950**, **DP 0.0040**, **EO 0.0375**.  
  `i=15` or `i=30` → **Acc 0.945**, **DP 0.0025**, **EO 0.0375**.  
  *Selection parity is essentially achieved while EO is ~½ of baseline.*

- **Low EO with modest DP and high accuracy:**  
  `i=8` → **Acc 0.940**, **DP 0.0192**, **EO 0.0375**.  
  `i=23` → **Acc 0.940**, **DP 0.0040**, **EO 0.0219**.  
  *Good when minimizing EO is the priority and a small accuracy dip is acceptable.*

- **DP ≈ 0 with baseline-like EO (DP-focused):**  
  `i=43` → **Acc 0.945**, **DP 0.00028**, **EO 0.0688**.  
  *Practically perfect demographic parity; EO roughly baseline.*

#### Takeaway
- **Prioritize Equalized Odds (error-rate parity):** choose **`i=5`** (EO ≈ 0.019, highest accuracy).  
- **Prioritize Demographic Parity (selection parity) with good EO:** choose **`i=16`** (DP ≈ 0.004, EO ≈ 0.038, Acc 0.950) or **`i=15/30`** (DP ≈ 0.0025, EO ≈ 0.038, Acc ≈ baseline).  

*In the CVD gender-bias setting, these recommended frontier models markedly reduce both selection-rate and error-rate gaps relative to the baseline while maintaining or increasing accuracy.*

---

In [21]:
# Show results for the specific frontier models i = 30
# for both RF GridSearch runs (EO- and DP-constrained).

import pandas as pd

indices = [5,16]

def eval_selected(gs, label):
    rows = []
    n = len(gs.predictors_)
    print(f"\n=== {label}: {n} frontier candidates ===")
    for i in indices:
        if i >= n:
            print(f"[{label}] Skipping i={i} (only {n} candidates).")
            continue
        clf = gs.predictors_[i]
        y_hat = clf.predict(X_test_ready)
        m = eval_fairness(y_test, y_hat, A_test)
        rows.append({"i": i, "accuracy": m["acc"], "dp_diff": m["dp"], "eo_diff": m["eo"]})

        # Per-group breakdown for this model
        print(f"\n[{label}] i={i}")
        print(m["by_group"])
        print(f"Accuracy: {m['acc']:.4f} | DP diff: {m['dp']:.4f} | EO diff: {m['eo']:.4f}")

    if rows:
        df = pd.DataFrame(rows).sort_values("i").round(4)
        print(f"\n--- Summary ({label}) ---")
        print(df)

# Evaluate selected indices for both EO and DP GridSearch objects
eval_selected(gs_eo_rf, "RF + GS (EO)")
eval_selected(gs_dp_rf, "RF + GS (DP)")


=== RF + GS (EO): 50 frontier candidates ===

[RF + GS (EO)] i=5
             TPR      FPR    Recall  SelectionRate  Accuracy
gender                                                      
0       0.961538  0.05000  0.961538       0.565217  0.956522
1       0.966667  0.03125  0.966667       0.577922  0.967532
Accuracy: 0.9650 | DP diff: 0.0127 | EO diff: 0.0188

[RF + GS (EO)] i=16
             TPR      FPR    Recall  SelectionRate  Accuracy
gender                                                      
0       0.000000  0.00000  0.000000       0.000000  0.434783
1       0.966667  0.03125  0.966667       0.577922  0.967532
Accuracy: 0.8450 | DP diff: 0.5779 | EO diff: 0.9667

--- Summary (RF + GS (EO)) ---
    i  accuracy  dp_diff  eo_diff
0   5     0.965   0.0127   0.0188
1  16     0.845   0.5779   0.9667

=== RF + GS (DP): 50 frontier candidates ===

[RF + GS (DP)] i=5
             TPR       FPR    Recall  SelectionRate  Accuracy
gender                                                   

### Random Forest — GridSearch Candidates (EO vs DP)

#### Explanation
- Each row is a **frontier model** (`i`) from Fairlearn’s `GridSearch` showing a distinct accuracy–fairness trade-off.
- Relative to the RF baseline (**Acc 0.945**, **DP 0.0280**, **EO 0.0709**), two **excellent** candidates emerge.

#### Metrics overview

| Constraint | i   | Accuracy | DP diff | EO diff | Notes |
|------------|-----|:--------:|:-------:|:-------:|-------|
| **EO**     | 5   | **0.965** | **0.0127** | **0.0188** | **Best EO** and accuracy; markedly better than baseline on all metrics |
| **EO**     | 16  | 0.845    | 0.5779  | 0.9667  | Degenerate (one group near-zero TPR); **avoid** |
| **DP**     | 5   | 0.845    | 0.5649  | 0.9556  | Degenerate (group collapse); **avoid** |
| **DP**     | 16  | **0.950** | **0.0040** | **0.0375** | **Near-parity DP** with low EO and high accuracy |

#### Interpretation
- **GS (EO) `i=5`** is the strongest **error-rate parity** option: **EO ~0.019**, **DP ~0.013**, and **Acc 0.965** (↑ vs baseline).
- **GS (DP) `i=16`** is the best **selection-parity** option: **DP ~0.004** with **low EO (~0.038)** and **Acc 0.950** (≈ baseline).
- The other two candidates are **pathological** (extreme disparities) and should not be deployed.

**Summary:**  
- Prioritize **Equalized Odds** → **choose `i=5` (EO)**.  
- Prioritize **Demographic Parity** (while keeping EO low) → **choose `i=16` (DP)**.

---

### Bias Mitigation RD: Post-processing: Threshold Optimizer

In [22]:
from fairlearn.postprocessing import ThresholdOptimizer

# 0) Baseline RF 
rf.fit(X_train_ready, y_train)
y_rf_base = rf.predict(X_test_ready)
m_rf_base = eval_fairness(y_test, y_rf_base, A_test)

print("=== Baseline (Random Forest) ===")
print(m_rf_base["by_group"])
print(f"Accuracy: {m_rf_base['acc']:.4f} | DP diff: {m_rf_base['dp']:.4f} | EO diff: {m_rf_base['eo']:.4f}")

# 1) Post-processing: Equalized Odds 
post_rf_eo = ThresholdOptimizer(
    estimator=rf,
    constraints="equalized_odds",
    predict_method="predict_proba",   
    grid_size=200,
    flip=True
)
post_rf_eo.fit(X_train_ready, y_train, sensitive_features=A_train)
y_rf_eo = post_rf_eo.predict(X_test_ready, sensitive_features=A_test)
m_rf_eo = eval_fairness(y_test, y_rf_eo, A_test)

print("\n=== RF + Post-processing (Equalized Odds) ===")
print(m_rf_eo["by_group"])
print(f"Accuracy: {m_rf_eo['acc']:.4f} | DP diff: {m_rf_eo['dp']:.4f} | EO diff: {m_rf_eo['eo']:.4f}")

# 2) Post-processing: Demographic Parity 
post_rf_dp = ThresholdOptimizer(
    estimator=rf,
    constraints="demographic_parity",
    predict_method="predict_proba",
    grid_size=200,
    flip=True
)
post_rf_dp.fit(X_train_ready, y_train, sensitive_features=A_train)
y_rf_dp = post_rf_dp.predict(X_test_ready, sensitive_features=A_test)
m_rf_dp = eval_fairness(y_test, y_rf_dp, A_test)

print("\n=== RF + Post-processing (Demographic Parity) ===")
print(m_rf_dp["by_group"])
print(f"Accuracy: {m_rf_dp['acc']:.4f} | DP diff: {m_rf_dp['dp']:.4f} | EO diff: {m_rf_dp['eo']:.4f}")

#3) Summary Table
summary_rf_post = pd.DataFrame([
    {"model":"RF Baseline",       "accuracy":m_rf_base["acc"], "dp_diff":m_rf_base["dp"], "eo_diff":m_rf_base["eo"]},
    {"model":"RF + Post (EO)",    "accuracy":m_rf_eo["acc"],   "dp_diff":m_rf_eo["dp"],   "eo_diff":m_rf_eo["eo"]},
    {"model":"RF + Post (DP)",    "accuracy":m_rf_dp["acc"],   "dp_diff":m_rf_dp["dp"],   "eo_diff":m_rf_dp["eo"]},
]).round(4)

print("\n=== Random Forest: Baseline vs Post-processing ===")
print(summary_rf_post)

=== Baseline (Random Forest) ===
             TPR      FPR    Recall  SelectionRate  Accuracy
gender                                                      
0       0.884615  0.10000  0.884615       0.543478  0.891304
1       0.955556  0.03125  0.955556       0.571429  0.961039
Accuracy: 0.9450 | DP diff: 0.0280 | EO diff: 0.0709

=== RF + Post-processing (Equalized Odds) ===
             TPR     FPR    Recall  SelectionRate  Accuracy
gender                                                     
0       0.884615  0.1000  0.884615       0.543478  0.891304
1       0.955556  0.0625  0.955556       0.584416  0.948052
Accuracy: 0.9350 | DP diff: 0.0409 | EO diff: 0.0709

=== RF + Post-processing (Demographic Parity) ===
             TPR     FPR    Recall  SelectionRate  Accuracy
gender                                                     
0       0.884615  0.1000  0.884615       0.543478  0.891304
1       0.955556  0.0625  0.955556       0.584416  0.948052
Accuracy: 0.9350 | DP diff: 0.0409 | EO

### Random Forest Bias Mitigation (Post-processing)

### Summary

| Model               | Accuracy | DP diff | EO diff | Interpretation                                     |
|---------------------|:--------:|:-------:|:-------:|----------------------------------------------------|
| **RF Baseline**     | 0.9450   | 0.0280  | 0.0709  | High accuracy; DP near parity; EO driven by TPR gap. |
| **RF + Post (EO)**  | 0.9350   | 0.0409  | 0.0709  | **Accuracy ↓**; **DP worsens**; **EO unchanged**.  |
| **RF + Post (DP)**  | 0.9350   | 0.0409  | 0.0709  | Same as EO result → no fairness gain, accuracy ↓.  |

### Summary:
- **Selection rates:** group 0 stays **0.543** while group 1 rises **0.571 → 0.584** ⇒ **DP increases** to **0.0409**.  
- **Error rates:** **TPR gap** stays **0.8846 vs 0.9556** (≈ **0.071**), dominating EO; **FPR gap** widens slightly (0.1000 vs **0.0625**), yet **EO remains 0.0709**.  
- Both post-processing variants converge to the **same thresholds** (identical metrics).

**Takeaway:** For this RF, ThresholdOptimizer post-processing **reduces accuracy** and **increases DP** while leaving **EO unchanged**; the **baseline RF** remains preferable under these settings.

---

### Deep Learning - Multi-layer Perceptron

In [23]:
#import required library 
from sklearn.neural_network import MLPClassifier

In [24]:
#Adam + Early Stopping 
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report, confusion_matrix

adammlp = MLPClassifier(
    hidden_layer_sizes=(64, 32),   # slightly smaller/deeper can help
    activation='relu',
    solver='adam',
    learning_rate_init=1e-3,       # smaller step can stabilize
    alpha=1e-3,                    # L2 regularization to reduce overfitting
    batch_size=32,
    max_iter=1000,                 # increased max_iter
    early_stopping=True,           # use a validation split internally
    validation_fraction=0.15,
    n_iter_no_change=25,          
    tol=1e-4,
    random_state=42
)

adammlp.fit(X_train_ready, y_train)  
y_pred_mlp = adammlp.predict(X_test_ready)                     
y_prob_mlp = adammlp.predict_proba(X_test_ready)[:, 1]         

evaluate_model(y_test, y_pred_mlp, "(Adam + EarlyStopping)")

=== (Adam + EarlyStopping) Evaluation ===
Accuracy : 0.91
Precision: 0.9224137931034483
Recall   : 0.9224137931034483
F1 Score : 0.9224137931034483

Classification Report:
               precision    recall  f1-score   support

           0       0.89      0.89      0.89        84
           1       0.92      0.92      0.92       116

    accuracy                           0.91       200
   macro avg       0.91      0.91      0.91       200
weighted avg       0.91      0.91      0.91       200

Confusion Matrix:
 [[ 75   9]
 [  9 107]]




### Bias mitigation MLP: Inprocessing: Exponentiated Gradient 

In [25]:
from sklearn.neural_network import MLPClassifier
from fairlearn.reductions import ExponentiatedGradient, EqualizedOdds, DemographicParity
from sklearn.base import clone
import pandas as pd

# 0) Baseline MLP (seeded for reproducibility)
mlp = MLPClassifier(
    hidden_layer_sizes=(64, 32),
    activation='relu',
    solver='adam',
    learning_rate_init=1e-3,
    alpha=1e-3,
    batch_size=32,
    max_iter=1000,
    early_stopping=True,
    validation_fraction=0.15,
    n_iter_no_change=25,
    tol=1e-4,
    random_state=42
)

mlp.fit(X_train_ready, y_train)

y_pred_mlp_base = mlp.predict(X_test_ready)
m_mlp_base = eval_fairness(y_test, y_pred_mlp_base, A_test)

print("=== Baseline (MLP) ===")
print(m_mlp_base["by_group"])
print(f"Accuracy: {m_mlp_base['acc']:.4f} | DP diff: {m_mlp_base['dp']:.4f} | EO diff: {m_mlp_base['eo']:.4f}")

# 1) EG with Equalized Odds
eg_eo_mlp = ExponentiatedGradient(
    estimator=clone(mlp),   # inherits random_state=42
    constraints=EqualizedOdds(),
    eps=0.01,
    max_iter=50
)
eg_eo_mlp.fit(X_train_ready, y_train, sensitive_features=A_train)

# Prefer predict(..., random_state=42) if supported; otherwise fall back without global seeds
try:
    y_pred_mlp_eo = eg_eo_mlp.predict(X_test_ready, random_state=42)
except TypeError:
    y_pred_mlp_eo = eg_eo_mlp.predict(X_test_ready)

m_mlp_eo = eval_fairness(y_test, y_pred_mlp_eo, A_test)

print("\n=== In-processing MLP: EG (Equalized Odds) ===")
print(m_mlp_eo["by_group"])
print(f"Accuracy: {m_mlp_eo['acc']:.4f} | DP diff: {m_mlp_eo['dp']:.4f} | EO diff: {m_mlp_eo['eo']:.4f}")

# 2) EG with Demographic Parity
eg_dp_mlp = ExponentiatedGradient(
    estimator=clone(mlp),
    constraints=DemographicParity(),
    eps=0.01,
    max_iter=50
)
eg_dp_mlp.fit(X_train_ready, y_train, sensitive_features=A_train)

try:
    y_pred_mlp_dp = eg_dp_mlp.predict(X_test_ready, random_state=42)
except TypeError:
    y_pred_mlp_dp = eg_dp_mlp.predict(X_test_ready)

m_mlp_dp = eval_fairness(y_test, y_pred_mlp_dp, A_test)

print("\n=== In-processing MLP: EG (Demographic Parity) ===")
print(m_mlp_dp["by_group"])
print(f"Accuracy: {m_mlp_dp['acc']:.4f} | DP diff: {m_mlp_dp['dp']:.4f} | EO diff: {m_mlp_dp['eo']:.4f}")

# 3) Summary Table
summary_mlp = pd.DataFrame([
    {"model":"MLP Baseline",  "accuracy":m_mlp_base["acc"], "dp_diff":m_mlp_base["dp"], "eo_diff":m_mlp_base["eo"]},
    {"model":"MLP + EG (EO)", "accuracy":m_mlp_eo["acc"],   "dp_diff":m_mlp_eo["dp"],   "eo_diff":m_mlp_eo["eo"]},
    {"model":"MLP + EG (DP)", "accuracy":m_mlp_dp["acc"],   "dp_diff":m_mlp_dp["dp"],   "eo_diff":m_mlp_dp["eo"]},
]).round(4)

print("\n=== MLP: Baseline vs In-processing (EG) ===")
print(summary_mlp)

=== Baseline (MLP) ===
             TPR      FPR    Recall  SelectionRate  Accuracy
gender                                                      
0       0.846154  0.15000  0.846154       0.543478  0.847826
1       0.944444  0.09375  0.944444       0.590909  0.928571
Accuracy: 0.9100 | DP diff: 0.0474 | EO diff: 0.0983

=== In-processing MLP: EG (Equalized Odds) ===
             TPR      FPR    Recall  SelectionRate  Accuracy
gender                                                      
0       0.769231  0.10000  0.769231       0.478261  0.826087
1       0.933333  0.09375  0.933333       0.584416  0.922078
Accuracy: 0.9000 | DP diff: 0.1062 | EO diff: 0.1641

=== In-processing MLP: EG (Demographic Parity) ===
             TPR     FPR    Recall  SelectionRate  Accuracy
gender                                                     
0       0.846154  0.1500  0.846154       0.543478  0.847826
1       0.933333  0.0625  0.933333       0.571429  0.935065
Accuracy: 0.9150 | DP diff: 0.0280 | EO dif

### MLP In-Processing Bias Mitigation Results

#### Summary

| Model             | Accuracy | DP diff | EO diff | Interpretation |
|-------------------|:--------:|:-------:|:-------:|----------------|
| **MLP Baseline**  | 0.9100   | 0.0474  | 0.0983  | Small selection-rate gap; moderate EO gap driven mainly by TPR. |
| **MLP + EG (EO)** | 0.9000   | 0.1062  | 0.1641  | **Worse than baseline:** accuracy ↓; DP and EO both increase. |
| **MLP + EG (DP)** | 0.9150   | 0.0280  | 0.0875  | **Best of the three:** accuracy ↑; DP improves to near parity; EO improves slightly. |

#### Interpretation
- The baseline MLP already shows **limited gender disparity** in selection rates (DP ≈ 0.05) but a **moderate error-rate gap** (EO ≈ 0.10) from higher TPR for gender=1.
- **EG with an Equalized-Odds constraint** under these settings **degrades fairness and accuracy**, increasing both DP and EO.
- **EG with a Demographic-Parity constraint** yields the **most favorable trade-off**: it **raises accuracy**, **reduces DP to 0.028** (more balanced alerts across genders), and **slightly lowers EO** (more similar TPR/FPR), though a residual error-rate gap remains.

---

### Bias mitigation MLP: Inprocessing: Grid Search

In [26]:
from sklearn.base import clone
from fairlearn.reductions import GridSearch, EqualizedOdds, DemographicParity

# 1) GridSearch with Equalized Odds (MLP)
gs_eo_mlp = GridSearch(
    estimator=clone(mlp),                 # unfitted clone of your MLP (inherits random_state=42)
    constraints=EqualizedOdds(),
    selection_rule="tradeoff_optimization",  
    constraint_weight=0.5,                   # trade-off weight (0..1); tune as needed
    grid_size=15                             
)
gs_eo_mlp.fit(X_train_ready, y_train, sensitive_features=A_train)
try:
    y_pred_gs_eo_mlp = gs_eo_mlp.predict(X_test_ready, random_state=42)
except TypeError:
    y_pred_gs_eo_mlp = gs_eo_mlp.predict(X_test_ready)

m_gs_eo_mlp = eval_fairness(y_test, y_pred_gs_eo_mlp, A_test)
print("\n=== In-processing MLP: GridSearch (Equalized Odds) ===")
print(m_gs_eo_mlp["by_group"])
print(f"Accuracy: {m_gs_eo_mlp['acc']:.4f} | DP diff: {m_gs_eo_mlp['dp']:.4f} | EO diff: {m_gs_eo_mlp['eo']:.4f}")

# 2) GridSearch with Demographic Parity (MLP)
gs_dp_mlp = GridSearch(
    estimator=clone(mlp),
    constraints=DemographicParity(),
    selection_rule="tradeoff_optimization",
    constraint_weight=0.5,
    grid_size=15
)
gs_dp_mlp.fit(X_train_ready, y_train, sensitive_features=A_train)
try:
    y_pred_gs_dp_mlp = gs_dp_mlp.predict(X_test_ready, random_state=42)
except TypeError:
    y_pred_gs_dp_mlp = gs_dp_mlp.predict(X_test_ready)

m_gs_dp_mlp = eval_fairness(y_test, y_pred_gs_dp_mlp, A_test)
print("\n=== In-processing MLP: GridSearch (Demographic Parity) ===")
print(m_gs_dp_mlp["by_group"])
print(f"Accuracy: {m_gs_dp_mlp['acc']:.4f} | DP diff: {m_gs_dp_mlp['dp']:.4f} | EO diff: {m_gs_dp_mlp['eo']:.4f}")

# 3) Compare with existing MLP runs (baseline + EG)
summary_mlp = pd.concat([
    summary_mlp,
    pd.DataFrame([
        {"model":"MLP + GS (EO)", "accuracy":m_gs_eo_mlp["acc"], "dp_diff":m_gs_eo_mlp["dp"], "eo_diff":m_gs_eo_mlp["eo"]},
        {"model":"MLP + GS (DP)", "accuracy":m_gs_dp_mlp["acc"], "dp_diff":m_gs_dp_mlp["dp"], "eo_diff":m_gs_dp_mlp["eo"]},
    ]).round(4)
], ignore_index=True)

print("\n=== MLP: Baseline vs EG vs GS ===")
print(summary_mlp)


=== In-processing MLP: GridSearch (Equalized Odds) ===
             TPR     FPR    Recall  SelectionRate  Accuracy
gender                                                     
0       0.769231  0.1000  0.769231       0.478261  0.826087
1       0.911111  0.0625  0.911111       0.558442  0.922078
Accuracy: 0.9000 | DP diff: 0.0802 | EO diff: 0.1419

=== In-processing MLP: GridSearch (Demographic Parity) ===
             TPR       FPR    Recall  SelectionRate  Accuracy
gender                                                       
0       0.807692  0.150000  0.807692       0.521739  0.826087
1       0.933333  0.078125  0.933333       0.577922  0.928571
Accuracy: 0.9050 | DP diff: 0.0562 | EO diff: 0.1256

=== MLP: Baseline vs EG vs GS ===
           model  accuracy  dp_diff  eo_diff
0   MLP Baseline     0.910   0.0474   0.0983
1  MLP + EG (EO)     0.900   0.1062   0.1641
2  MLP + EG (DP)     0.915   0.0280   0.0875
3  MLP + GS (EO)     0.900   0.0802   0.1419
4  MLP + GS (DP)     0.905   0

### MLP — In-Processing vs GridSearch

#### Comparative table (vs. Baseline)
| Model              | Accuracy | ΔAcc (pp) | DP diff |   ΔDP   | EO diff |   ΔEO   | Notes |
|--------------------|:--------:|:---------:|:-------:|:-------:|:-------:|:-------:|-------|
| **Baseline (MLP)** | 0.9100   |     –     | 0.0474  |    –    | 0.0983  |    –    | Reference |
| **EG (EO)**        | 0.9000   |  **−1.0** | 0.1062  | +0.0588 | 0.1641  | +0.0658 | Accuracy drops; DP and EO both worsen |
| **EG (DP)**        | 0.9150   |  **+0.5** | 0.0280  | −0.0194 | 0.0875  | −0.0108 | **Best trade-off:** Acc ↑, DP ↓ to near parity, EO ↓ slightly |
| **GS (EO)**        | 0.9000   |  **−1.0** | 0.0802  | +0.0328 | 0.1419  | +0.0436 | Degrades both fairness metrics and accuracy |
| **GS (DP)**        | 0.9050   |  **−0.5** | 0.0562  | +0.0088 | 0.1256  | +0.0273 | Small accuracy dip; DP and EO both worsen |

#### Interpretation
- The baseline MLP already shows **low disparity** (DP ≈ 0.05; EO ≈ 0.10).
- **EG with a DP constraint** is the only configuration that improves both performance **and** fairness: it **raises accuracy**, **reduces DP** to ~0.03 (closer selection parity), and **slightly lowers EO**.
- **EG (EO)** and both **GridSearch** variants **worsen EO** and **increase DP** (and, for GS, also reduce accuracy), indicating the constraints or search path did not find better fairness–utility points for this model.

---

### Bias mitigation MLP: Postprocessing: Threshold Optimizer

In [27]:
from fairlearn.postprocessing import ThresholdOptimizer
import pandas as pd

# 0) Baseline MLP
mlp.fit(X_train_ready, y_train)
y_mlp_base = mlp.predict(X_test_ready)
m_mlp_base = eval_fairness(y_test, y_mlp_base, A_test)

print("=== Baseline (MLP) ===")
print(m_mlp_base["by_group"])
print(f"Accuracy: {m_mlp_base['acc']:.4f} | DP diff: {m_mlp_base['dp']:.4f} | EO diff: {m_mlp_base['eo']:.4f}")

# 1) Post-processing: Equalized Odds
post_mlp_eo = ThresholdOptimizer(
    estimator=mlp,
    constraints="equalized_odds",
    predict_method="predict_proba",
    grid_size=200,
    flip=True
)
post_mlp_eo.fit(X_train_ready, y_train, sensitive_features=A_train)
y_mlp_eo = post_mlp_eo.predict(X_test_ready, sensitive_features=A_test)
m_mlp_eo = eval_fairness(y_test, y_mlp_eo, A_test)

print("\n=== MLP + Post-processing (Equalized Odds) ===")
print(m_mlp_eo["by_group"])
print(f"Accuracy: {m_mlp_eo['acc']:.4f} | DP diff: {m_mlp_eo['dp']:.4f} | EO diff: {m_mlp_eo['eo']:.4f}")

# 2) Post-processing: Demographic Parity
post_mlp_dp = ThresholdOptimizer(
    estimator=mlp,
    constraints="demographic_parity",
    predict_method="predict_proba",
    grid_size=200,
    flip=True
)
post_mlp_dp.fit(X_train_ready, y_train, sensitive_features=A_train)
y_mlp_dp = post_mlp_dp.predict(X_test_ready, sensitive_features=A_test)
m_mlp_dp = eval_fairness(y_test, y_mlp_dp, A_test)

print("\n=== MLP + Post-processing (Demographic Parity) ===")
print(m_mlp_dp["by_group"])
print(f"Accuracy: {m_mlp_dp['acc']:.4f} | DP diff: {m_mlp_dp['dp']:.4f} | EO diff: {m_mlp_dp['eo']:.4f}")

# 3) Summary Table
summary_mlp_post = pd.DataFrame([
    {"model":"MLP Baseline",       "accuracy":m_mlp_base["acc"], "dp_diff":m_mlp_base["dp"], "eo_diff":m_mlp_base["eo"]},
    {"model":"MLP + Post (EO)",    "accuracy":m_mlp_eo["acc"],   "dp_diff":m_mlp_eo["dp"],   "eo_diff":m_mlp_eo["eo"]},
    {"model":"MLP + Post (DP)",    "accuracy":m_mlp_dp["acc"],   "dp_diff":m_mlp_dp["dp"],   "eo_diff":m_mlp_dp["eo"]},
]).round(4)

print("\n=== MLP: Baseline vs Post-processing ===")
print(summary_mlp_post)

=== Baseline (MLP) ===
             TPR      FPR    Recall  SelectionRate  Accuracy
gender                                                      
0       0.846154  0.15000  0.846154       0.543478  0.847826
1       0.944444  0.09375  0.944444       0.590909  0.928571
Accuracy: 0.9100 | DP diff: 0.0474 | EO diff: 0.0983

=== MLP + Post-processing (Equalized Odds) ===
             TPR    FPR    Recall  SelectionRate  Accuracy
gender                                                    
0       0.807692  0.150  0.807692       0.521739  0.826087
1       0.788889  0.125  0.788889       0.512987  0.824675
Accuracy: 0.8250 | DP diff: 0.0088 | EO diff: 0.0250

=== MLP + Post-processing (Demographic Parity) ===
             TPR       FPR    Recall  SelectionRate  Accuracy
gender                                                       
0       0.846154  0.150000  0.846154       0.543478  0.847826
1       0.944444  0.078125  0.944444       0.584416  0.935065
Accuracy: 0.9150 | DP diff: 0.0409 | EO dif

### MLP — Post-Processing: Threshold Optimizer

#### Summary

| Model               | Accuracy | DP diff | EO diff | Notes |
|---------------------|:--------:|:-------:|:-------:|-------|
| **Baseline (MLP)**  | 0.9100   | 0.0474  | 0.0983  | Low DP gap; small EO gap. |
| **Post (EO)**       | 0.8250   | 0.0088  | 0.0250  | **Large fairness gains** (DP & EO ↓ markedly) but **accuracy −8.5 pp**. |
| **Post (DP)**       | 0.9150   | 0.0409  | 0.0983  | **Accuracy +0.5 pp**; **DP improves slightly**; **EO unchanged**. |

#### Interpretation
- **Equalized Odds post-processing** selects group-specific thresholds that tightly align **TPR/FPR** across genders (**EO ↓ from 0.098 → 0.025**) and nearly equalize selection rates (**DP ↓ from 0.047 → 0.009**). The trade-off is a **notable utility loss** (accuracy drops to **0.825**), reflecting more conservative predictions overall (e.g., TPRs ≈ 0.81/0.79).
- **Demographic Parity post-processing** nudges selection rates closer (**DP 0.041**) with a **small accuracy uptick** to **0.915**, but **does not reduce error-rate disparity** (EO remains ≈ **0.098**), since TPR/FPR gaps are largely preserved.

#### Quick read (selection rates)
- **Baseline:** SR₍F₎ **0.543** vs SR₍M₎ **0.591** → **DP 0.047**.  
- **Post (EO):** SR₍F₎ **0.522** vs SR₍M₎ **0.513** → **DP 0.009** (near parity).  
- **Post (DP):** SR₍F₎ **0.543** vs SR₍M₎ **0.584** → **DP 0.041** (slight improvement).

**Takeaway:** If **maximizing fairness** (both outcome parity and aligned error rates) is paramount, **Post (EO)** achieves the best parity but at a **substantial accuracy cost**. If **maintaining accuracy** is the priority while keeping disparities low, the **baseline/DP-post** results are preferable, noting that **EO remains** at baseline levels.

---

## Overall Comparison:

# Overall Bias-Mitigation Comparison (Fairlearn) — Gender Bias in CVD Prediction

**Metric keys:**  
- **DP diff** (Demographic Parity): selection-rate gap across genders (lower = fairer outcomes).  
- **EO diff** (Equalized Odds): error-rate gap (TPR/FPR) across genders (lower = fairer errors).  

---

## Aggregated Summary of Bias Mitigation for all models

| Model / Technique                          | Accuracy | DP diff | EO diff | Verdict |
|--------------------------------------------|:--------:|:-------:|:-------:|---------|
| **PCA+KNN Baseline**                       | 0.9150   | 0.0302  | 0.1187  | Small DP; EO moderate |
| **PCA+KNN + CorrelationRemover**           | 0.9050   | **0.0020** | 0.1688  | **Best DP** for KNN; EO worsens; small acc drop |
| **KNN + Post (DP/EO)**                     | 0.9150 / 0.9150 | 0.0302 / 0.0302 | 0.1187 / 0.1187 | No effect (0% flips) |
| **DT Baseline (tuned)**                    | 0.9050   | 0.0387  | 0.0406  | DP near zero; EO small–moderate |
| **DT + EG (Equalized Odds)**               | 0.9000   | 0.0169  | **0.0103** | **Best EO** for DT; slight acc cost |
| **DT + Post (EO)**                         | 0.8950   | **0.0025** | 0.0250  | Lowest DP for DT; EO improves; small acc drop |
| **DT + Post (DP) / GS (EO)**               | 0.8900 / 0.9050 | 0.0265 / 0.0387 | 0.1094 / 0.0406 | Post(DP) worsens EO; GS(EO) = baseline |
| **DT + EG (DP) / GS (DP)**                 | 0.9300 / 0.9150 | 0.0265 / 0.0692 | 0.1063 / 0.0983 | Higher acc but fairness worse than baseline |
| **RF Baseline**                            | 0.9450   | 0.0280  | 0.0709  | DP small; EO moderate (TPR/FPR gaps) |
| **RF + EG (EO/DP)**                        | 0.9450   | 0.0280  | 0.0709  | No effect (constraints not binding) |
| **RF + GridSearch (DP, i=16)**             | 0.9500   | **0.0040** | 0.0375  | **Near-parity DP** with low EO; high acc |
| **RF + GridSearch (EO, i=5)**              | **0.9650** | 0.0127  | **0.0188** | **Best RF overall**: highest acc, very low EO & DP |
| **RF + Post (EO/DP)**                      | 0.9350   | 0.0409  | 0.0709  | Acc ↓, DP worse, EO unchanged |
| **MLP Baseline**                           | 0.9100   | 0.0474  | 0.0983  | Near parity |
| **MLP + EG (EO)**                          | 0.9000   | 0.1062  | 0.1641  | EO & DP worsen; acc ↓ |
| **MLP + EG (DP)**                          | **0.9150** | **0.0280** | 0.0875  | **Best MLP trade-off**: acc ↑, DP ↓, EO ↓ slightly |
| **MLP + GS (EO / DP)**                     | 0.9000 / 0.9050 | 0.0802 / 0.0562 | 0.1419 / 0.1256 | Both worsen fairness; acc ↓ |
| **MLP + Post (EO)**                        | 0.8250   | **0.0088** | **0.0250** | Large fairness gains but **−8.5 pp** accuracy |
| **MLP + Post (DP)**                        | 0.9150   | 0.0409  | 0.0983  | Acc +0.5 pp; DP slightly better; EO unchanged |

---

## What worked

- **DT + EG (EO):** Equalized Odds constraint binds effectively, **driving EO to ~0.01** with only a **0.5 pp accuracy drop**, while DP remains small.  
- **RF + GridSearch:** Frontier models provide **joint gains**—  
  - **EO-focused (i=5):** **Acc 0.965**, **EO 0.0188**, **DP 0.0127**.  
  - **DP-focused (i=16):** **Acc 0.950**, **DP 0.0040**, **EO 0.0375**.  
  These are strictly better operating points than the RF baseline.  
- **KNN + CorrelationRemover:** Preprocessing decorrelation achieves **near-zero DP** (0.002) when thresholding cannot move predictions.

## What did not help

- **Post-processing for KNN:** **0% label flips** pre/post CR—KNN’s coarse scores limit threshold optimization.  
- **RF + EG (EO/DP):** No movement—ensembles often **resist reweighting**; optimizer selected the baseline frontier point.  
- **MLP (GS / EG-EO):** Either **no benefit** or **worse DP/EO**; **Post (EO)** greatly improves parity but at an **unacceptable accuracy cost** for clinical use.

---

## Practical implications for gender bias in CVD prediction

- **Clinical safety (error-rate parity):** Prefer **DT + EG (EO)** or **RF + GS (EO, i=5)** to minimize **EO** (align **TPR/FPR**), reducing the risk that one gender faces more missed CVD cases or false alarms.  
- **Access parity (selection-rate parity):** If policy mandates equal alerting, use **RF + GS (DP, i=16)** (DP ≈ 0) or **DT + Post (EO)** (DP ≈ 0.0025) accepting slight accuracy trade-offs.  
- **KNN with CR** is viable if **DP parity** is paramount, but note **EO increases**; additional steps would be needed to balance error rates.  
- **Avoid** configurations that inflate **EO** (e.g., **MLP + EG(EO)** or **DT/MLP GS variants** that worsened gaps), as unequal error burdens raise clinical safety concerns.

---

## Summary

1. **Primary choices (fairness + accuracy):**  
   - **Random Forest + GridSearch (EO, i=5)** for **lowest EO** and **highest accuracy**, or  
   - **Random Forest + GridSearch (DP, i=16)** for **near-zero DP** with low EO and high accuracy.  
   - **Decision Tree + EG (EO)** is the best **interpretable** alternative with excellent EO.
2. **If retaining KNN:** apply **CorrelationRemover** to achieve **DP parity**; post-processing won’t move it.  
3. **For MLP:** the **baseline** or **EG (DP)** is acceptable; be cautious with **Post (EO)** due to large utility loss.  
4. **Selection policy:** set explicit gates (e.g., **EO ≤ 0.05** and **DP ≤ 0.03**) and pick models that satisfy both on a held-out set.

**Conclusion:** In this CVD context, **in-processing Equalized Odds for DT** and **frontier models from RF GridSearch** provide the **most reliable reductions in gender error-rate disparities** without sacrificing—and often improving—accuracy, thereby lowering the risk of **gendered underdiagnosis or over-alerting**.

---