## Bias Mitigation using Fairlearn - Heart Failure Prediction Dataset (Source: https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction/data)

In [1]:
#load preprocessed data 
import pandas as pd
train_df = pd.read_csv("./data_subsets/train_75M_25F.csv")

X_test = pd.read_csv("./data_splits/X_test.csv")
y_test = pd.read_csv("./data_splits/y_test.csv")

#check out the data
train_df.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,61,1,3,146.0,241.0,0,0,148.0,1,3.0,0,1
1,52,1,1,120.0,284.0,0,0,118.0,0,0.0,2,0
2,48,0,3,150.0,227.0,0,0,130.0,1,1.0,1,0
3,49,1,3,128.0,212.0,0,0,96.0,1,0.0,1,1
4,56,1,3,120.0,236.0,0,1,148.0,0,0.0,1,1


In [2]:
# Ensure y_test is a Series (not a DataFrame with 1 column)
y_test = y_test.squeeze("columns")

# Define target and sensitive column names
TARGET = "HeartDisease"
SENSITIVE = "Sex"

# Split train into X/y
X_train = train_df.drop(columns=[TARGET])
y_train = train_df[TARGET]

# Extract sensitive features separately
A_train = X_train[SENSITIVE].astype(int)
A_test  = X_test[SENSITIVE].astype(int)


In [3]:
TARGET = "HeartDisease"
SENSITIVE = "Sex"   # 1 = Male, 0 = Female

categorical_cols = ['Sex','ChestPainType','FastingBS','RestingECG','ExerciseAngina','ST_Slope']
continuous_cols  = ['Age','RestingBP','Cholesterol','MaxHR','Oldpeak']

In [4]:
# Split train into X / y and keep sensitive feature for fairness evaluation
X_train = train_df.drop(columns=[TARGET])
y_train = train_df[TARGET]

In [5]:
# scale numeric features only, fit on train, transform test
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train_num_scaled = pd.DataFrame(
    scaler.fit_transform(X_train[continuous_cols]),
    columns=continuous_cols, index=X_train.index
)
X_test_num_scaled = pd.DataFrame(
    scaler.transform(X_test[continuous_cols]),
    columns=continuous_cols, index=X_test.index
)

In [6]:
#one-hot encode categoricals; numeric are kept as is 
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(handle_unknown="ignore", drop="if_binary", sparse_output=False)
ohe.fit(X_train[categorical_cols])

X_train_cat = pd.DataFrame(
    ohe.transform(X_train[categorical_cols]),
    columns=ohe.get_feature_names_out(categorical_cols),
    index=X_train.index
)
X_test_cat = pd.DataFrame(
    ohe.transform(X_test[categorical_cols]),
    columns=ohe.get_feature_names_out(categorical_cols),
    index=X_test.index
)

In [7]:
# Assemble final matrices
X_train_ready = pd.concat([X_train_cat, X_train_num_scaled], axis=1)
X_test_ready  = pd.concat([X_test_cat,  X_test_num_scaled],  axis=1)

print("Final feature shapes:", X_train_ready.shape, X_test_ready.shape)

Final feature shapes: (600, 18) (184, 18)


### Traditional ML Models - Baseline: K-Nearest Neighbors (KNN) & Decision Tree (DT)

In [8]:
#import required libraries
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    classification_report, confusion_matrix
)

#define a function 
def evaluate_model(y_true, y_pred, model_name):
    print(f"=== {model_name} Evaluation ===")
    print("Accuracy :", accuracy_score(y_true, y_pred))
    print("Precision:", precision_score(y_true, y_pred, average='binary'))
    print("Recall   :", recall_score(y_true, y_pred, average='binary'))
    print("F1 Score :", f1_score(y_true, y_pred, average='binary'))
    print("\nClassification Report:\n", classification_report(y_true, y_pred))
    print("Confusion Matrix:\n", confusion_matrix(y_true, y_pred))
    print("\n" + "="*40 + "\n")

### PCA-KNN

In [9]:
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    classification_report, confusion_matrix
)
import numpy as np

#1) PCA + KNN pipeline (on one-hot encoded + scaled features)
pca_knn = Pipeline([
    ('pca', PCA(n_components=0.95, random_state=42)),  # keep 95% variance
    ('knn', KNeighborsClassifier(
        n_neighbors=15, metric='manhattan', weights='distance'
    ))
])

pca_knn.fit(X_train_ready, y_train)

# Inspect PCA details
n_comp = pca_knn.named_steps['pca'].n_components_
expl_var = pca_knn.named_steps['pca'].explained_variance_ratio_.sum()

print("=== Baseline (tuned PCA+KNN, no mitigation) ===")
# 2) Evaluate 
y_pred_pca_knn = pca_knn.predict(X_test_ready)
probs_pca_knn = pca_knn.predict_proba(X_test_ready)[:, 1]
  
evaluate_model(y_test, y_pred_pca_knn, "KNN (best params)")

=== Baseline (tuned PCA+KNN, no mitigation) ===
=== KNN (best params) Evaluation ===
Accuracy : 0.8858695652173914
Precision: 0.9090909090909091
Recall   : 0.8823529411764706
F1 Score : 0.8955223880597015

Classification Report:
               precision    recall  f1-score   support

           0       0.86      0.89      0.87        82
           1       0.91      0.88      0.90       102

    accuracy                           0.89       184
   macro avg       0.88      0.89      0.88       184
weighted avg       0.89      0.89      0.89       184

Confusion Matrix:
 [[73  9]
 [12 90]]




### Post-Processing -  KNN

In [10]:
# Demographic Parity post-processing for your tuned PCA+KNN

from fairlearn.postprocessing import ThresholdOptimizer
from fairlearn.metrics import (
    MetricFrame, true_positive_rate, false_positive_rate, selection_rate,
    demographic_parity_difference, equalized_odds_difference
)
from sklearn.metrics import accuracy_score
import numpy as np
import pandas as pd

# Helper function
def eval_fairness(y_true, y_pred, A):
    mf = MetricFrame(
        metrics={
            "TPR": true_positive_rate,
            "FPR": false_positive_rate,
            "Recall": recall_score, 
            "SelectionRate": selection_rate,
            "Accuracy": accuracy_score,
        },
        y_true=y_true, y_pred=y_pred, sensitive_features=A
    )
    return {
        "by_group": mf.by_group,
        "acc": accuracy_score(y_true, y_pred),
        "recall": recall_score(y_true, y_pred),
        "dp": demographic_parity_difference(y_true, y_pred, sensitive_features=A),
        "eo": equalized_odds_difference(y_true, y_pred, sensitive_features=A),
    }

# 1) Baseline metrics (no mitigation) 
pca_knn.fit(X_train_ready, y_train)
y_base = pca_knn.predict(X_test_ready)
m_base = eval_fairness(y_test, y_base, A_test)

print("=== Baseline (tuned PCA+KNN) ===")
print(m_base["by_group"])
print(f"Accuracy: {m_base['acc']:.4f} | DP diff: {m_base['dp']:.4f} | EO diff: {m_base['eo']:.4f}")

# 2) Post-processing with DEMOGRAPHIC PARITY
post_dp = ThresholdOptimizer(
    estimator=pca_knn,
    constraints="demographic_parity",
    predict_method="predict_proba",   # KNN supports this
    grid_size=200,
    flip=True
)
post_dp.fit(X_train_ready, y_train, sensitive_features=A_train)

y_dp = post_dp.predict(X_test_ready, sensitive_features=A_test, random_state=42)
m_dp = eval_fairness(y_test, y_dp, A_test)

print("\n=== Post-processing (Demographic Parity) ===")
print(m_dp["by_group"])
print(f"Accuracy: {m_dp['acc']:.4f} | DP diff: {m_dp['dp']:.4f} | EO diff: {m_dp['eo']:.4f}")

# 3) Post-processing with EQUALIZED ODDS
post_eod = ThresholdOptimizer(
    estimator=pca_knn,
    constraints="equalized_odds",
    predict_method="predict_proba",   # KNN supports this
    grid_size=200,
    flip=True,                                # makes randomized post-processing reproducible
)
post_eod.fit(X_train_ready, y_train, sensitive_features=A_train)

y_eod = post_eod.predict(X_test_ready, sensitive_features=A_test, random_state=42)
m_eod = eval_fairness(y_test, y_eod, A_test)

print("\n=== Post-processing (Equalized Odds) ===")
print(m_eod["by_group"])
print(f"Accuracy: {m_eod['acc']:.4f} | DP diff: {m_eod['dp']:.4f} | EO diff: {m_eod['eo']:.4f}")


=== Baseline (tuned PCA+KNN) ===
          TPR    FPR    Recall  SelectionRate  Accuracy
Sex                                                    
0    0.833333  0.125  0.833333       0.236842  0.868421
1    0.885417  0.100  0.885417       0.616438  0.890411
Accuracy: 0.8859 | DP diff: 0.3796 | EO diff: 0.0521

=== Post-processing (Demographic Parity) ===
          TPR    FPR    Recall  SelectionRate  Accuracy
Sex                                                    
0    0.833333  0.125  0.833333       0.236842  0.868421
1    0.885417  0.100  0.885417       0.616438  0.890411
Accuracy: 0.8859 | DP diff: 0.3796 | EO diff: 0.0521

=== Post-processing (Equalized Odds) ===
          TPR    FPR    Recall  SelectionRate  Accuracy
Sex                                                    
0    0.833333  0.125  0.833333       0.236842  0.868421
1    0.885417  0.100  0.885417       0.616438  0.890411
Accuracy: 0.8859 | DP diff: 0.3796 | EO diff: 0.0521


---

### Post-Processing (Demographic Parity & Equalized Odds) — PCA+KNN

#### Combined Results

| Model                                            | Accuracy | DP diff | EO diff | Notes                                                |
|--------------------------------------------------|----------|---------|---------|------------------------------------------------------|
| Baseline (tuned PCA+KNN)                         | 0.8859   | 0.3796  | 0.0521  | High DP disparity, low EO gap                        |
| Post-processing (Demographic Parity constraint)  | 0.8859   | 0.3796  | 0.0521  | Identical to baseline — no fairness change           |
| Post-processing (Equalized Odds constraint)      | 0.8859   | 0.3796  | 0.0521  | Identical to baseline — no fairness change           |

#### Interpretation
- Metrics are unchanged across baseline, DP, and EO post-processing: accuracy ≈ **88.6%**, **DP diff = 0.38 (large)**, **EO diff = 0.05 (small)**.
- With KNN, `predict_proba` is **coarse (steps of 1/k)**, limiting randomized thresholding; `ThresholdOptimizer` defaulted to the **baseline mapping**.
- **In-processing mitigation via Fairlearn reductions is not available with KNN** (no `sample_weight`), so if DP parity is required, a reductions-compatible estimator or (within KNN) try larger **k** and a higher `grid_size` in `ThresholdOptimizer` can be considered.

---


**CorrelationRemover** will be implemented to improve fairness after DP/EOD post-processing failed to change any predictions (0% flips), leaving metrics unchanged. By removing linear correlation between features and the sensitive attribute, we reduce leakage and make group score distributions more comparable, giving PCA+KNN and also any subsequent post-processing room to adjust selection rates and error rates—all while staying.

In [12]:
from fairlearn.preprocessing import CorrelationRemover
from sklearn.metrics import recall_score  

# X_* are DataFrames; A_* are Series
Xtr_df = X_train_ready.copy()
Xte_df = X_test_ready.copy()
Xtr_df["__A__"] = A_train.values
Xte_df["__A__"] = A_test.values

cr = CorrelationRemover(sensitive_feature_ids=["__A__"])

Xtr_fair_arr = cr.fit_transform(Xtr_df)   # shape: (n_samples, n_features - 1)
Xte_fair_arr = cr.transform(Xte_df)

# Rebuild DataFrames with columns that exclude the sensitive column
cols_out = [c for c in Xtr_df.columns if c != "__A__"]
Xtr_fair = pd.DataFrame(Xtr_fair_arr, index=Xtr_df.index, columns=cols_out)
Xte_fair = pd.DataFrame(Xte_fair_arr, index=Xte_df.index, columns=cols_out)

# Refit your PCA+KNN
pca_knn.fit(Xtr_fair, y_train)
y_cr = pca_knn.predict(Xte_fair)
m_cr = eval_fairness(y_test, y_cr, A_test)

print("\n=== Preprocessing: CorrelationRemover + PCA+KNN ===")
print(m_cr["by_group"])
print(f"Accuracy: {m_cr['acc']:.4f} | DP diff: {m_cr['dp']:.4f} | EO diff: {m_cr['eo']:.4f}")


=== Preprocessing: CorrelationRemover + PCA+KNN ===
          TPR      FPR    Recall  SelectionRate  Accuracy
Sex                                                      
0    1.000000  0.09375  1.000000       0.236842  0.921053
1    0.854167  0.10000  0.854167       0.595890  0.869863
Accuracy: 0.8804 | DP diff: 0.3590 | EO diff: 0.1458


**Interpretation**:
- Accuracy is **0.8804**, a slight decrease from the baseline (0.8859).
- **DP diff = 0.3590** shows a modest improvement vs. baseline (0.3796), but selection rates remain far apart (Sex=0: 0.237 vs. Sex=1: 0.596 ≈ 2.5× higher).
- **EO diff = 0.1458** worsened vs. baseline (0.0521), driven by a larger **TPR gap** (Sex=0: 1.000 vs. Sex=1: 0.854); **FPRs** are similar (0.094 vs. 0.100).
- Net effect: CorrelationRemover slightly improved **DP** but **hurt EO**, indicating reduced proxy correlation did not equalize error rates and in fact amplified the TPR disparity.


In [13]:
# Demographic Parity on top of the CorrelationRemover
post_dp_cr = ThresholdOptimizer(
    estimator=pca_knn,
    constraints="demographic_parity",
    predict_method="predict_proba",
    grid_size=1000,
    flip=True
)
post_dp_cr.fit(Xtr_fair, y_train, sensitive_features=A_train)
y_dp_cr = post_dp_cr.predict(Xte_fair, sensitive_features=A_test, random_state=42)
m_dp_cr = eval_fairness(y_test, y_dp_cr, A_test)

# Equalized Odds on top of  CorrelationRemover
post_eod_cr = ThresholdOptimizer(
    estimator=pca_knn,
    constraints="equalized_odds",
    predict_method="predict_proba",
    grid_size=1000,
    flip=True
)
post_eod_cr.fit(Xtr_fair, y_train, sensitive_features=A_train)
y_eod_cr = post_eod_cr.predict(Xte_fair, sensitive_features=A_test, random_state=42)
m_eod_cr = eval_fairness(y_test, y_eod_cr, A_test)

# check out the results
import numpy as np
print(f"Changed vs CR baseline (DP):  {np.mean(y_dp_cr  != y_cr):.3%}")
print(f"Changed vs CR baseline (eOD): {np.mean(y_eod_cr != y_cr):.3%}")

print("\n=== Post-CR (DP) ===")
print(m_dp_cr["by_group"])
print(f"Accuracy: {m_dp_cr['acc']:.4f} | DP diff: {m_dp_cr['dp']:.4f} | EO diff: {m_dp_cr['eo']:.4f}")

print("\n=== Post-CR (eOD) ===")
print(m_eod_cr["by_group"])
print(f"Accuracy: {m_eod_cr['acc']:.4f} | DP diff: {m_eod_cr['dp']:.4f} | EO diff: {m_eod_cr['eo']:.4f}")

Changed vs CR baseline (DP):  0.000%
Changed vs CR baseline (eOD): 0.000%

=== Post-CR (DP) ===
          TPR      FPR    Recall  SelectionRate  Accuracy
Sex                                                      
0    1.000000  0.09375  1.000000       0.236842  0.921053
1    0.854167  0.10000  0.854167       0.595890  0.869863
Accuracy: 0.8804 | DP diff: 0.3590 | EO diff: 0.1458

=== Post-CR (eOD) ===
          TPR      FPR    Recall  SelectionRate  Accuracy
Sex                                                      
0    1.000000  0.09375  1.000000       0.236842  0.921053
1    0.854167  0.10000  0.854167       0.595890  0.869863
Accuracy: 0.8804 | DP diff: 0.3590 | EO diff: 0.1458


Interpretation:
- **0% flips for both DP and eOD** means post-processing after CorrelationRemover made **no label changes**, so metrics are **identical** to the CR+KNN baseline.
- **Metrics (unchanged):** Accuracy = **0.8804**; **DP diff = 0.3590** (large gap: selection rates 0.237 vs 0.596); **EO diff = 0.1458**, driven by a **TPR gap** (1.000 vs 0.854) while FPRs are similar (0.094 vs 0.100).
- **Interpretation:** Even with `grid_size=1000` and `flip=True`, the optimizer found no feasible re-thresholding better than CR+KNN—consistent with **KNN’s coarse probability grid** and the accuracy–fairness trade-off.

### Bias mitigation comparison (PCA+KNN)

| Model variant                      | Accuracy | DP diff | EO diff | SelRate S=0 | SelRate S=1 | TPR S=0 | TPR S=1 | FPR S=0 | FPR S=1 | Notes                          |
|-----------------------------------|:--------:|:-------:|:-------:|:-----------:|:-----------:|:-------:|:-------:|:-------:|:-------:|--------------------------------|
| Baseline (tuned PCA+KNN)          | 0.8859   | 0.3796  | 0.0521  | 0.2368      | 0.6164      | 0.8333  | 0.8854  | 0.1250  | 0.1000  | Reference                      |
| Post-processing (DP constraint)   | 0.8859   | 0.3796  | 0.0521  | 0.2368      | 0.6164      | 0.8333  | 0.8854  | 0.1250  | 0.1000  | **Flips vs baseline: 0%**      |
| Post-processing (EO constraint)   | 0.8859   | 0.3796  | 0.0521  | 0.2368      | 0.6164      | 0.8333  | 0.8854  | 0.1250  | 0.1000  | **Flips vs baseline: 0%**      |
| CorrelationRemover + PCA+KNN      | 0.8804   | 0.3590  | 0.1458  | 0.2368      | 0.5959      | 1.0000  | 0.8542  | 0.0938  | 0.1000  | New baseline after CR          |
| Post-CR (DP constraint)           | 0.8804   | 0.3590  | 0.1458  | 0.2368      | 0.5959      | 1.0000  | 0.8542  | 0.0938  | 0.1000  | **Flips vs CR baseline: 0%**   |
| Post-CR (EO constraint)           | 0.8804   | 0.3590  | 0.1458  | 0.2368      | 0.5959      | 1.0000  | 0.8542  | 0.0938  | 0.1000  | **Flips vs CR baseline: 0%**   |

**Takeaway:** Post-processing produced no label changes (0% flips) and thus no metric changes; CorrelationRemover slightly reduced DP disparity but increased EO disparity and reduced accuracy. In the CVD prediction setting, the persistent DP gap (≈0.36–0.38) indicates substantially higher positive rates for Males than females risking unequal alerting while decorrelation increased the TPR gap (EO) so that males experiences more missed positives than females, meaning gender bias remains and may be exacerbated by the CR step.

--- 

### Tuned Decision Tree (DT)

In [14]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.metrics import (
    classification_report, confusion_matrix, accuracy_score,
    precision_score, recall_score, f1_score
)

# 1) Base model
dt = DecisionTreeClassifier(random_state=42)

# 2) Hyperparameter grid 
param_grid = {
    "criterion": ["gini", "entropy"],
    "max_depth": [3, 5, 7, 9, None],
    "min_samples_split": [2, 5, 10, 20],
    "min_samples_leaf": [1, 2, 4, 6, 10],
}

# 3) Cross-validation setup
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# 4) Grid search 
grid_dt = GridSearchCV(
    estimator=dt,
    param_grid=param_grid,
    cv=cv,
    scoring="f1",      
    n_jobs=-1,
    verbose=0
)

grid_dt.fit(X_train_ready, y_train)

print("Best Decision Tree params:", grid_dt.best_params_)
print("Best CV F1:", grid_dt.best_score_)

# 5) Train & evaluate best DT
tuned_dt = grid_dt.best_estimator_
y_pred_dt_best = tuned_dt.predict(X_test_ready)
y_prob_dt_best = tuned_dt.predict_proba(X_test_ready)[:, 1] 

evaluate_model(y_test, y_pred_dt_best, "Tuned Decision Tree (best params)")

Best Decision Tree params: {'criterion': 'entropy', 'max_depth': 9, 'min_samples_leaf': 2, 'min_samples_split': 2}
Best CV F1: 0.8593494246061409
=== Tuned Decision Tree (best params) Evaluation ===
Accuracy : 0.8097826086956522
Precision: 0.819047619047619
Recall   : 0.8431372549019608
F1 Score : 0.8309178743961353

Classification Report:
               precision    recall  f1-score   support

           0       0.80      0.77      0.78        82
           1       0.82      0.84      0.83       102

    accuracy                           0.81       184
   macro avg       0.81      0.81      0.81       184
weighted avg       0.81      0.81      0.81       184

Confusion Matrix:
 [[63 19]
 [16 86]]




### Bias Mitigation DT: Inprocessing - Exponentiated Gradient Reduction

In [17]:
# In-processing mitigation for tuned Decision Tree
from sklearn.base import clone
from fairlearn.reductions import ExponentiatedGradient, EqualizedOdds, DemographicParity
from fairlearn.metrics import (
    MetricFrame, true_positive_rate, false_positive_rate, selection_rate,
    demographic_parity_difference, equalized_odds_difference
)
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np

# 0) Baseline: tuned DT without mitigation (for comparison)
y_pred_dt_base = tuned_dt.predict(X_test_ready)
m_base = eval_fairness(y_test, y_pred_dt_base, A_test)
print("=== Baseline (Tuned DT) ===")
print(m_base["by_group"])
print(f"Accuracy: {m_base['acc']:.4f} | DP diff: {m_base['dp']:.4f} | EO diff: {m_base['eo']:.4f}")

# 1) Exponentiated Gradient with Equalized Odds
eg_eo = ExponentiatedGradient(
    estimator=clone(tuned_dt),
    constraints=EqualizedOdds(),
    eps=0.01,
    max_iter=50
)
eg_eo.fit(X_train_ready, y_train, sensitive_features=A_train)
y_pred_eo = eg_eo.predict(X_test_ready, random_state=42)
m_eo = eval_fairness(y_test, y_pred_eo, A_test)
print("\n=== In-processing: EG (Equalized Odds) ===")
print(m_eo["by_group"])
print(f"Accuracy: {m_eo['acc']:.4f} | DP diff: {m_eo['dp']:.4f} | EO diff: {m_eo['eo']:.4f}")

# 2) Exponentiated Gradient with Demographic Parity
eg_dp = ExponentiatedGradient(
    estimator=clone(tuned_dt),
    constraints=DemographicParity(),
    eps=0.01,
    max_iter=50
)
eg_dp.fit(X_train_ready, y_train, sensitive_features=A_train)
y_pred_dp = eg_dp.predict(X_test_ready, random_state=42)
m_dp = eval_fairness(y_test, y_pred_dp, A_test)
print("\n=== In-processing: EG (Demographic Parity) ===")
print(m_dp["by_group"])
print(f"Accuracy: {m_dp['acc']:.4f} | DP diff: {m_dp['dp']:.4f} | EO diff: {m_dp['eo']:.4f}")

# 3) Summary table
summary_dt = pd.DataFrame([
    {"model": "DT Baseline (tuned)", "accuracy": m_base["acc"], "dp_diff": m_base["dp"], "eo_diff": m_base["eo"]},
    {"model": "DT + EG (EO)",        "accuracy": m_eo["acc"],   "dp_diff": m_eo["dp"],   "eo_diff": m_eo["eo"]},
    {"model": "DT + EG (DP)",        "accuracy": m_dp["acc"],   "dp_diff": m_dp["dp"],   "eo_diff": m_dp["eo"]},
]).round(4)
print("\n=== Decision Tree: Baseline vs In-processing (EG) ===")
summary_dt

=== Baseline (Tuned DT) ===
          TPR      FPR    Recall  SelectionRate  Accuracy
Sex                                                      
0    0.833333  0.28125  0.833333       0.368421  0.736842
1    0.843750  0.20000  0.843750       0.623288  0.828767
Accuracy: 0.8098 | DP diff: 0.2549 | EO diff: 0.0812

=== In-processing: EG (Equalized Odds) ===
          TPR   FPR    Recall  SelectionRate  Accuracy
Sex                                                   
0    0.666667  0.25  0.666667       0.315789  0.736842
1    0.854167  0.22  0.854167       0.636986  0.828767
Accuracy: 0.8098 | DP diff: 0.3212 | EO diff: 0.1875

=== In-processing: EG (Demographic Parity) ===
          TPR   FPR    Recall  SelectionRate  Accuracy
Sex                                                   
0    0.833333  0.25  0.833333       0.342105  0.763158
1    0.843750  0.22  0.843750       0.630137  0.821918
Accuracy: 0.8098 | DP diff: 0.2880 | EO diff: 0.0300

=== Decision Tree: Baseline vs In-processing (EG

Unnamed: 0,model,accuracy,dp_diff,eo_diff
0,DT Baseline (tuned),0.8098,0.2549,0.0812
1,DT + EG (EO),0.8098,0.3212,0.1875
2,DT + EG (DP),0.8098,0.288,0.03


### Bias Mitigation Results: Decision Tree – In-Processing

#### Metrics Overview

| Model                 | Accuracy | DP diff | EO diff | Notes                                                                 |
|-----------------------|:--------:|:-------:|:-------:|------------------------------------------------------------------------|
| **DT Baseline (tuned)** | 0.8098   | 0.2549  | 0.0812  | Moderate DP gap; small-to-moderate EO gap                              |
| **DT + EG (EO)**      | 0.8098   | 0.3212  | 0.1875  | Accuracy unchanged; **DP worsens** (↑), **EO worsens** sharply         |
| **DT + EG (DP)**      | 0.8098   | 0.2880  | 0.0300  | Accuracy unchanged; **EO improves strongly**, DP still elevated        |

---

#### Interpretation
- The **baseline DT** already has a **substantial DP disparity** (≈0.25) and a **smaller EO gap** (≈0.08).  
- **EG (Equalized Odds)** fails here: it does **not improve accuracy** and makes fairness worse — both **DP** and **EO** increase.  
- **EG (Demographic Parity)** is more promising: it drives **EO down to 0.03**, greatly aligning error rates, but **DP remains high** (≈0.29), meaning outcome disparities persist.  

**Conclusion:** For this DT, **EG with a DP constraint** provides the most useful improvement (minimizing error-rate disparity without hurting accuracy). However, **DP remains unresolved**, suggesting that further tuning (e.g., smaller `eps`) or alternative mitigation strategies may be needed if demographic parity is a strict requirement.  

---

#### Bias Mitigation DT: In-processing: GridSearch Reduction

In [19]:
from sklearn.base import clone
from fairlearn.reductions import GridSearch, EqualizedOdds, DemographicParity

# 1) GridSearch with Equalized Odds
gs_eo = GridSearch(
    estimator=clone(tuned_dt),              # unfitted clone of tuned DT
    constraints=EqualizedOdds(),            # EO constraint
    selection_rule="tradeoff_optimization", # or: "best_classifier", "minimum_violation"
    constraint_weight=0.5,                  # trade-off weight (0..1); tune this
    grid_size=15                          # more points -> finer Pareto front                      
)
gs_eo.fit(X_train_ready, y_train, sensitive_features=A_train)
y_pred_gs_eo = gs_eo.predict(X_test_ready)
m_gs_eo = eval_fairness(y_test, y_pred_gs_eo, A_test)
print("\n=== In-processing: GridSearch (Equalized Odds) ===")
print(m_gs_eo["by_group"])
print(f"Accuracy: {m_gs_eo['acc']:.4f} | DP diff: {m_gs_eo['dp']:.4f} | EO diff: {m_gs_eo['eo']:.4f}")

# 2) GridSearch with Demographic Parity
gs_dp = GridSearch(
    estimator=clone(tuned_dt),
    constraints=DemographicParity(),
    selection_rule="tradeoff_optimization",
    constraint_weight=0.5,
    grid_size=15
)
gs_dp.fit(X_train_ready, y_train, sensitive_features=A_train)
y_pred_gs_dp = gs_dp.predict(X_test_ready)
m_gs_dp = eval_fairness(y_test, y_pred_gs_dp, A_test)
print("\n=== In-processing: GridSearch (Demographic Parity) ===")
print(m_gs_dp["by_group"])
print(f"Accuracy: {m_gs_dp['acc']:.4f} | DP diff: {m_gs_dp['dp']:.4f} | EO diff: {m_gs_dp['eo']:.4f}")

# 3) Compare with your existing runs
summary_dt = pd.concat([
    summary_dt,  
    pd.DataFrame([
        {"model":"DT + GS (EO)", "accuracy":m_gs_eo["acc"], "dp_diff":m_gs_eo["dp"], "eo_diff":m_gs_eo["eo"]},
        {"model":"DT + GS (DP)", "accuracy":m_gs_dp["acc"], "dp_diff":m_gs_dp["dp"], "eo_diff":m_gs_dp["eo"]},
    ]).round(4)
], ignore_index=True)
print("\n=== Decision Tree: Baseline vs EG vs GS ===")
summary_dt


=== In-processing: GridSearch (Equalized Odds) ===
          TPR      FPR    Recall  SelectionRate  Accuracy
Sex                                                      
0    0.833333  0.28125  0.833333       0.368421  0.736842
1    0.843750  0.20000  0.843750       0.623288  0.828767
Accuracy: 0.8098 | DP diff: 0.2549 | EO diff: 0.0812

=== In-processing: GridSearch (Demographic Parity) ===
          TPR     FPR    Recall  SelectionRate  Accuracy
Sex                                                     
0    0.500000  0.1875  0.500000       0.236842  0.763158
1    0.802083  0.2400  0.802083       0.609589  0.787671
Accuracy: 0.7826 | DP diff: 0.3727 | EO diff: 0.3021

=== Decision Tree: Baseline vs EG vs GS ===


Unnamed: 0,model,accuracy,dp_diff,eo_diff
0,DT Baseline (tuned),0.8098,0.2549,0.0812
1,DT + EG (EO),0.8098,0.3212,0.1875
2,DT + EG (DP),0.8098,0.288,0.03
3,DT + GS (EO),0.8098,0.2549,0.0812
4,DT + GS (DP),0.7826,0.3727,0.3021


### Decision Tree — In-Processing: EG vs. GridSearch (EO & DP)

#### Summary of results
| Model                | Accuracy | DP diff | EO diff | Interpretation |
|----------------------|:--------:|:-------:|:-------:|----------------|
| **DT Baseline (tuned)** | 0.8098   | 0.2549  | 0.0812  | Moderate disparities in DP (≈0.25) and EO (≈0.08). |
| **DT + EG (EO)**       | 0.8098   | 0.3212  | 0.1875  | Accuracy unchanged; **DP worsens**; **EO worsens** markedly. |
| **DT + EG (DP)**       | 0.8098   | 0.2880  | 0.0300  | Accuracy unchanged; **EO improves strongly**; **DP slightly worse**. |
| **DT + GS (EO)**       | 0.8098   | 0.2549  | 0.0812  | **Identical to baseline** — no improvement from GridSearch. |
| **DT + GS (DP)**       | 0.7826   | 0.3727  | 0.3021  | Accuracy ↓ (−2.7 pp); **DP and EO both worsen** significantly. |

---

#### Interpretation
- The **baseline DT** exhibits a **substantial DP disparity** (≈0.25) and a smaller EO gap (≈0.08).  
- **EG (EO)** is counterproductive: fairness worsens (DP ↑, EO ↑) with **no accuracy gain**.  
- **EG (DP)** is the **most effective option**: EO improves dramatically (0.08 → 0.03) while accuracy is preserved, though DP disparity remains high.  
- **GS (EO)** provides **no change**, converging to the baseline point.  
- **GS (DP)** performs poorly, reducing accuracy and worsening both fairness metrics — an undesirable outcome.  

---

**Takeaway:**  
For this Decision Tree, the best fairness–utility trade-off comes from **Exponentiated Gradient with a Demographic Parity constraint**: it keeps accuracy stable while substantially reducing EO disparity. However, **DP remains unresolved**, and GridSearch methods under current settings provide no benefit or even worsen outcomes.  

---


#### Bias Mitigation DT: Post-processing: Threshold Optimizer 

In [20]:
from fairlearn.postprocessing import ThresholdOptimizer

# Baseline for mitigation: fixed tuned DT
tuned_dt.fit(X_train_ready, y_train)
y_base = tuned_dt.predict(X_test_ready)
m_base = eval_fairness(y_test, y_base, A_test)
print("=== Baseline (tuned DT) ===")
print(m_base["by_group"])
print(f"Accuracy: {m_base['acc']:.4f} | DP diff: {m_base['dp']:.4f} | EO diff: {m_base['eo']:.4f}")

# Post-processing: Equalized Odds
post_eo = ThresholdOptimizer(
    estimator=tuned_dt,
    constraints="equalized_odds",
    predict_method="predict_proba",   
    grid_size=200,
    flip=True
)
post_eo.fit(X_train_ready, y_train, sensitive_features=A_train)
y_eo = post_eo.predict(X_test_ready, sensitive_features=A_test)
m_eo = eval_fairness(y_test, y_eo, A_test)
print("\n=== Post-processing (Equalized Odds) ===")
print(m_eo["by_group"])
print(f"Accuracy: {m_eo['acc']:.4f} | DP diff: {m_eo['dp']:.4f} | EO diff: {m_eo['eo']:.4f}")

# Post-processing: Demographic Parity
post_dp = ThresholdOptimizer(
    estimator=tuned_dt,
    constraints="demographic_parity",
    predict_method="predict_proba",
    grid_size=200,
    flip=True,
)
post_dp.fit(X_train_ready, y_train, sensitive_features=A_train)
y_dp = post_dp.predict(X_test_ready, sensitive_features=A_test)
m_dp = eval_fairness(y_test, y_dp, A_test)
print("\n=== Post-processing (Demographic Parity) ===")
print(m_dp["by_group"])
print(f"Accuracy: {m_dp['acc']:.4f} | DP diff: {m_dp['dp']:.4f} | EO diff: {m_dp['eo']:.4f}")

# Create summary table 
summary = pd.DataFrame([
    {"model":"DT Baseline (tuned)", "accuracy":m_base["acc"], "dp_diff":m_base["dp"], "eo_diff":m_base["eo"]},
    {"model":"DT + Post (EO)",      "accuracy":m_eo["acc"],   "dp_diff":m_eo["dp"],   "eo_diff":m_eo["eo"]},
    {"model":"DT + Post (DP)",      "accuracy":m_dp["acc"],   "dp_diff":m_dp["dp"],   "eo_diff":m_dp["eo"]},
]).round(4)
print("\n=== Decision Tree: Baseline vs Post-processing ===")
summary


=== Baseline (tuned DT) ===
          TPR      FPR    Recall  SelectionRate  Accuracy
Sex                                                      
0    0.833333  0.28125  0.833333       0.368421  0.736842
1    0.843750  0.20000  0.843750       0.623288  0.828767
Accuracy: 0.8098 | DP diff: 0.2549 | EO diff: 0.0812

=== Post-processing (Equalized Odds) ===
          TPR   FPR    Recall  SelectionRate  Accuracy
Sex                                                   
0    0.666667  0.25  0.666667       0.315789  0.736842
1    0.843750  0.20  0.843750       0.623288  0.828767
Accuracy: 0.8098 | DP diff: 0.3075 | EO diff: 0.1771

=== Post-processing (Demographic Parity) ===
          TPR     FPR    Recall  SelectionRate  Accuracy
Sex                                                     
0    0.833333  0.3125  0.833333       0.394737  0.710526
1    0.864583  0.2000  0.864583       0.636986  0.842466
Accuracy: 0.8152 | DP diff: 0.2422 | EO diff: 0.1125

=== Decision Tree: Baseline vs Post-processi

Unnamed: 0,model,accuracy,dp_diff,eo_diff
0,DT Baseline (tuned),0.8098,0.2549,0.0812
1,DT + Post (EO),0.8098,0.3075,0.1771
2,DT + Post (DP),0.8152,0.2422,0.1125


### Decision Tree: Post- vs In-Processing

### Combined Results

| Model / Method          | Accuracy | DP diff | EO diff | Notes |
|-------------------------|:--------:|:-------:|:-------:|-------|
| **Baseline (Tuned DT)** | 0.8098   | 0.2549  | 0.0812  | Reference point |
| **Post (EO)**           | 0.8098   | 0.3075  | 0.1771  | Accuracy = baseline; **EO worsens sharply**; DP ↑ |
| **Post (DP)**           | 0.8152   | **0.2422** | 0.1125 | Slight accuracy gain; **best DP** among post-proc; EO ↑ |
| **EG (EO)**             | 0.8098   | 0.3212  | 0.1875  | Accuracy = baseline; both **DP and EO worsen** |
| **EG (DP)**             | 0.8098   | 0.2880  | **0.0300** | Accuracy = baseline; **EO improves strongly**; DP ↑ |
| **GS (EO)**             | 0.8098   | 0.2549  | 0.0812  | Identical to baseline — no change |
| **GS (DP)**             | 0.7826   | 0.3727  | 0.3021  | Accuracy ↓ (−2.7 pp); **worst DP & EO** |

---

### Interpretation
- The **baseline DT** has a **large DP gap (~0.25)** and a **moderate EO gap (~0.08)**.  
- **Post (EO)** worsens fairness overall — EO nearly doubles, DP rises, with no accuracy benefit.  
- **Post (DP)** slightly improves DP (best among post-proc, 0.2422) and increases accuracy a bit, but EO worsens notably.  
- **EG (EO)** is counterproductive: DP and EO both worsen.  
- **EG (DP)** is the only method that **substantially improves EO** (0.03) while keeping accuracy steady, though DP disparity remains high.  
- **GS (EO)** returns the baseline point (no improvement).  
- **GS (DP)** harms both fairness metrics and lowers accuracy — clearly undesirable.  

---

**Takeaway:**  
- If the priority is **error-rate parity (Equalized Odds)** → **EG (DP)** is the best option (EO ↓ to 0.03).  
- If the focus is **outcome-rate parity (Demographic Parity)** → **Post (DP)** gives the lowest DP (0.2422) with a slight accuracy boost, but worsens EO.  
- **Post (EO), EG (EO), and GS (DP)** should be avoided as they worsen disparities.  
- **GridSearch (EO)** adds no value under current settings.

---

### Ensemble Model - Random Forest (RF)

In [21]:
from sklearn.ensemble import RandomForestClassifier

# Initialize Random Forest
rf = RandomForestClassifier(random_state=42)

# Train the model
rf.fit(X_train_ready, y_train)

# Predict on test set
y_pred_rf = rf.predict(X_test_ready)
y_prob_rf = rf.predict_proba(X_test_ready)[:, 1]  
evaluate_model(y_test, y_pred_rf, "Random Forest")

=== Random Forest Evaluation ===
Accuracy : 0.8804347826086957
Precision: 0.8703703703703703
Recall   : 0.9215686274509803
F1 Score : 0.8952380952380953

Classification Report:
               precision    recall  f1-score   support

           0       0.89      0.83      0.86        82
           1       0.87      0.92      0.90       102

    accuracy                           0.88       184
   macro avg       0.88      0.88      0.88       184
weighted avg       0.88      0.88      0.88       184

Confusion Matrix:
 [[68 14]
 [ 8 94]]




### Bias Mitgation RF: In-processing: Exponentiated Gradient 

In [23]:
# 0) Baseline Random Forest
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train_ready, y_train)

y_pred_rf_base = rf.predict(X_test_ready)
m_rf_base = eval_fairness(y_test, y_pred_rf_base, A_test)

print("=== Baseline (Random Forest) ===")
print(m_rf_base["by_group"])
print(f"Accuracy: {m_rf_base['acc']:.4f} | DP diff: {m_rf_base['dp']:.4f} | EO diff: {m_rf_base['eo']:.4f}")

# 1) EG with Equalized Odds
eg_eo_rf = ExponentiatedGradient(
    estimator=clone(rf),
    constraints=EqualizedOdds(),
    eps=0.01,
    max_iter=50,
)
eg_eo_rf.fit(X_train_ready, y_train, sensitive_features=A_train)
y_pred_rf_eo = eg_eo_rf.predict(X_test_ready)
m_rf_eo = eval_fairness(y_test, y_pred_rf_eo, A_test)

print("\n=== In-processing RF: EG (Equalized Odds) ===")
print(m_rf_eo["by_group"])
print(f"Accuracy: {m_rf_eo['acc']:.4f} | DP diff: {m_rf_eo['dp']:.4f} | EO diff: {m_rf_eo['eo']:.4f}")

# 2) EG with Demographic Parity 
eg_dp_rf = ExponentiatedGradient(
    estimator=clone(rf),
    constraints=DemographicParity(),
    eps=0.01,
    max_iter=50
)
eg_dp_rf.fit(X_train_ready, y_train, sensitive_features=A_train)
y_pred_rf_dp = eg_dp_rf.predict(X_test_ready)
m_rf_dp = eval_fairness(y_test, y_pred_rf_dp, A_test)

print("\n=== In-processing RF: EG (Demographic Parity) ===")
print(m_rf_dp["by_group"])
print(f"Accuracy: {m_rf_dp['acc']:.4f} | DP diff: {m_rf_dp['dp']:.4f} | EO diff: {m_rf_dp['eo']:.4f}")

# 3) Summary Table 
summary_rf = pd.DataFrame([
    {"model":"RF Baseline",      "accuracy":m_rf_base["acc"], "dp_diff":m_rf_base["dp"], "eo_diff":m_rf_base["eo"]},
    {"model":"RF + EG (EO)",     "accuracy":m_rf_eo["acc"],   "dp_diff":m_rf_eo["dp"],   "eo_diff":m_rf_eo["eo"]},
    {"model":"RF + EG (DP)",     "accuracy":m_rf_dp["acc"],   "dp_diff":m_rf_dp["dp"],   "eo_diff":m_rf_dp["eo"]},
]).round(4)

print("\n=== Random Forest: Baseline vs In-processing (EG) ===")
print(summary_rf)

=== Baseline (Random Forest) ===
          TPR    FPR    Recall  SelectionRate  Accuracy
Sex                                                    
0    1.000000  0.125  1.000000       0.263158  0.894737
1    0.916667  0.200  0.916667       0.671233  0.876712
Accuracy: 0.8804 | DP diff: 0.4081 | EO diff: 0.0833

=== In-processing RF: EG (Equalized Odds) ===
          TPR    FPR    Recall  SelectionRate  Accuracy
Sex                                                    
0    1.000000  0.125  1.000000       0.263158  0.894737
1    0.916667  0.200  0.916667       0.671233  0.876712
Accuracy: 0.8804 | DP diff: 0.4081 | EO diff: 0.0833

=== In-processing RF: EG (Demographic Parity) ===
          TPR    FPR    Recall  SelectionRate  Accuracy
Sex                                                    
0    1.000000  0.125  1.000000       0.263158  0.894737
1    0.916667  0.200  0.916667       0.671233  0.876712
Accuracy: 0.8804 | DP diff: 0.4081 | EO diff: 0.0833

=== Random Forest: Baseline vs In-pro

## Random Forest Bias Mitigation Results

### Summary

| Model            | Accuracy | DP Diff | EO Diff | Interpretation                                |
|------------------|:--------:|:-------:|:-------:|-----------------------------------------------|
| **RF Baseline**  | 0.8804   | 0.4081  | 0.0833  | Strong accuracy; **large DP gap** remains (≈0.41); EO moderate. |
| **RF + EG (EO)** | 0.8804   | 0.4081  | 0.0833  | **Identical to baseline** — EO constraint had no impact. |
| **RF + EG (DP)** | 0.8804   | 0.4081  | 0.0833  | **Identical to baseline** — DP constraint had no impact. |

---

### Key Points
- **Random Forest baseline** achieves high accuracy but suffers from a **large disparity in selection rates (DP ≈ 0.41)**, with EO at a moderate level (~0.08).  
- **ExponentiatedGradient** with either **Equalized Odds** or **Demographic Parity** produced **no change**.  
- This typically occurs when:
  - The model is **insensitive to sample reweighting** (as tree ensembles often are).  
  - The fairness tolerance (`eps=0.01`) is too strict, so EG returns the original classifier.  
- In contrast to Decision Trees, RF here is **locked at its baseline frontier point**, offering no fairness–utility trade-off under EG.  

---

**Takeaway:**  
For this RF setup, **EG provides no benefit**. Addressing the strong DP disparity likely requires alternative mitigation strategies.

---

### Bias Mitigation: RF: In-processing: Grid Search

In [24]:
from sklearn.base import clone
from fairlearn.reductions import GridSearch, EqualizedOdds, DemographicParity

weights = [0.0, 0.25, 0.5, 0.75, 1.0]   # 0.0 = accuracy-first, 1.0 = fairness-first
grid = 50                               

rows = []

#Equalized Odds sweep
for w in weights:
    gs_eo_rf = GridSearch(
        estimator=clone(rf),                 
        constraints=EqualizedOdds(),
        selection_rule="tradeoff_optimization",
        constraint_weight=w,
        grid_size=grid
    )
    gs_eo_rf.fit(X_train_ready, y_train, sensitive_features=A_train)
    # Some versions accept random_state in predict; if yours doesn't, seed numpy before predicting
    try:
        y_hat = gs_eo_rf.predict(X_test_ready, random_state=42)
    except TypeError:
        import numpy as np, random
        np.random.seed(42); random.seed(42)
        y_hat = gs_eo_rf.predict(X_test_ready)
    m = eval_fairness(y_test, y_hat, A_test)
    rows.append({"method":"RF + GS (EO)", "weight": w, "acc": m["acc"], "dp": m["dp"], "eo": m["eo"]})

# Demographic Parity sweep
for w in weights:
    gs_dp_rf = GridSearch(
        estimator=clone(rf),
        constraints=DemographicParity(),
        selection_rule="tradeoff_optimization",
        constraint_weight=w,
        grid_size=grid
    )
    gs_dp_rf.fit(X_train_ready, y_train, sensitive_features=A_train)
    try:
        y_hat = gs_dp_rf.predict(X_test_ready, random_state=42)
    except TypeError:
        import numpy as np, random
        np.random.seed(42); random.seed(42)
        y_hat = gs_dp_rf.predict(X_test_ready)
    m = eval_fairness(y_test, y_hat, A_test)
    rows.append({"method":"RF + GS (DP)", "weight": w, "acc": m["acc"], "dp": m["dp"], "eo": m["eo"]})

df_gs = pd.DataFrame(rows).sort_values(["method","weight"])
print(df_gs)

         method  weight       acc        dp        eo
5  RF + GS (DP)    0.00  0.880435  0.474405  0.126250
6  RF + GS (DP)    0.25  0.880435  0.474405  0.126250
7  RF + GS (DP)    0.50  0.880435  0.474405  0.126250
8  RF + GS (DP)    0.75  0.880435  0.474405  0.126250
9  RF + GS (DP)    1.00  0.880435  0.474405  0.126250
0  RF + GS (EO)    0.00  0.880435  0.493872  0.260417
1  RF + GS (EO)    0.25  0.880435  0.493872  0.260417
2  RF + GS (EO)    0.50  0.880435  0.493872  0.260417
3  RF + GS (EO)    0.75  0.880435  0.493872  0.260417
4  RF + GS (EO)    1.00  0.880435  0.493872  0.260417


### Interpretation — RF + GridSearch (weight sweep)

- **DP-constrained GridSearch:** All weights (0.00 → 1.00) converge to the **same solution**:  
  - Accuracy = **0.8804**  
  - DP diff = **0.4744** (very high)  
  - EO diff = **0.1263**  
- **EO-constrained GridSearch:** Similarly, all weights (0.00 → 1.00) produce **identical results**:  
  - Accuracy = **0.8804**  
  - DP diff = **0.4939** (very high)  
  - EO diff = **0.2604** (extremely high)  

- The invariance across weights shows that **GridSearch is stuck on a single frontier point** for both constraints.  
- These frontier points are **worse than the baseline RF** on fairness:  
  - DP disparities are nearly **0.47–0.49** (vs. ~0.41 baseline).  
  - EO worsens under the EO constraint (0.26 vs. ~0.08 baseline).  
  - Accuracy remains unchanged.  

---

**Takeaway:**  
Under current settings, **RF + GridSearch does not provide any useful fairness–utility trade-off**. Both DP- and EO-constrained runs simply reproduce **fairness-worse frontier points** with unchanged accuracy, suggesting the method is ineffective here.

---

In [25]:
# Inspect how many distinct models GridSearch actually produced
len(gs_eo_rf.predictors_), len(gs_dp_rf.predictors_)

# See the spread across the frontier (test metrics for each predictor)
def eval_frontier(gs, X, y, A):
    rows=[]
    for i, clf in enumerate(gs.predictors_):
        yhat = clf.predict(X)
        m = eval_fairness(y, yhat, A)
        rows.append({"i": i, "acc": m["acc"], "dp": m["dp"], "eo": m["eo"]})
    return pd.DataFrame(rows)

print(eval_frontier(gs_eo_rf, X_test_ready, y_test, A_test))
print(eval_frontier(gs_dp_rf, X_test_ready, y_test, A_test))

     i       acc        dp        eo
0    0  0.869565  0.684932  0.927083
1    1  0.864130  0.678082  0.916667
2    2  0.875000  0.664384  0.916667
3    3  0.869565  0.671233  0.916667
4    4  0.875000  0.650685  0.906250
5    5  0.864130  0.664384  0.906250
6    6  0.869565  0.671233  0.916667
7    7  0.875000  0.664384  0.916667
8    8  0.706522  0.196107  0.708750
9    9  0.880435  0.657534  0.916667
10  10  0.880435  0.657534  0.916667
11  11  0.875000  0.650685  0.906250
12  12  0.875000  0.664384  0.916667
13  13  0.875000  0.664384  0.916667
14  14  0.722826  0.183490  0.717500
15  15  0.728261  0.190339  0.737500
16  16  0.701087  0.763158  0.875000
17  17  0.880435  0.493872  0.260417
18  18  0.885870  0.453857  0.086250
19  19  0.891304  0.493872  0.137500
20  20  0.885870  0.467556  0.106250
21  21  0.869565  0.394376  0.072917
22  22  0.717391  0.301370  0.740000
23  23  0.717391  0.301370  0.740000
24  24  0.706522  0.287671  0.700000
25  25  0.701087  0.763158  0.875000
2

### RF GridSearch frontiers — concise read (CVD gender bias)

**What the tables show:** Each index `i` corresponds to one Random Forest model on the fairness–accuracy frontier. The results span a wide range, with some **degenerate points** and a few **strong candidates**.

---

**Avoid (dominated / degenerate points):**  
- Many candidates (`i ∈ {0–7, 9–15, 16, 18}` in the first table and `i ∈ {0–12, 38–49}` in the second) show **very large DP (≈0.65–0.76)** and **very high EO (≈0.90+)** despite only **moderate accuracy (≤0.88)**.  
- These are **clinically unacceptable** and not worth considering.

---

**Good candidates (vs. baseline RF: Acc 0.8804, DP 0.4081, EO 0.0833):**

- **Best overall balanced improvement:**  
  - `i=30` → **Acc 0.8804**, **DP 0.3886**, **EO 0.0729**.  
  - *Both fairness metrics improve vs. baseline at identical accuracy.*  

- **Best EO & highest accuracy:**  
  - `i=20` → **Acc 0.8967**, **DP 0.4149**, **EO 0.0625**.  
  - *Stronger error-rate parity and higher accuracy, with only a small DP increase.*  

- **Lowest DP with small accuracy cost:**  
  - `i=26` or `i=29` → **Acc ≈0.8750**, **DP 0.3818**, **EO ≈0.0833**.  
  - *Lower outcome disparity (DP) than baseline; EO ≈ baseline; accuracy slightly reduced.*  

---

**CVD context takeaway:**  
- To **minimize sex-based error-rate disparity** → pick **`i=20`** (lowest EO, highest accuracy).  
- To **improve both fairness metrics simultaneously without losing accuracy** → pick **`i=30`**.  
- To **reduce outcome disparity (DP)** specifically → **`i=26/29`** are best, trading off a small accuracy drop.

---

In [26]:
# Show results for the specific frontier models 
# for both RF GridSearch runs (EO- and DP-constrained).

import pandas as pd

indices = [30,20,26]

def eval_selected(gs, label):
    rows = []
    n = len(gs.predictors_)
    print(f"\n=== {label}: {n} frontier candidates ===")
    for i in indices:
        if i >= n:
            print(f"[{label}] Skipping i={i} (only {n} candidates).")
            continue
        clf = gs.predictors_[i]
        y_hat = clf.predict(X_test_ready)
        m = eval_fairness(y_test, y_hat, A_test)
        rows.append({"i": i, "accuracy": m["acc"], "dp_diff": m["dp"], "eo_diff": m["eo"]})

        # Per-group breakdown for this model
        print(f"\n[{label}] i={i}")
        print(m["by_group"])
        print(f"Accuracy: {m['acc']:.4f} | DP diff: {m['dp']:.4f} | EO diff: {m['eo']:.4f}")

    if rows:
        df = pd.DataFrame(rows).sort_values("i").round(4)
        print(f"\n--- Summary ({label}) ---")
        print(df)

# Evaluate selected indices for both EO and DP GridSearch objects
eval_selected(gs_eo_rf, "RF + GS (EO)")
eval_selected(gs_dp_rf, "RF + GS (DP)")


=== RF + GS (EO): 50 frontier candidates ===

[RF + GS (EO)] i=30
          TPR    FPR    Recall  SelectionRate  Accuracy
Sex                                                    
0    1.000000  0.125  1.000000       0.263158  0.894737
1    0.916667  0.200  0.916667       0.671233  0.876712
Accuracy: 0.8804 | DP diff: 0.4081 | EO diff: 0.0833

[RF + GS (EO)] i=20
          TPR      FPR    Recall  SelectionRate  Accuracy
Sex                                                      
0    0.833333  0.09375  0.833333       0.210526  0.894737
1    0.927083  0.20000  0.927083       0.678082  0.883562
Accuracy: 0.8859 | DP diff: 0.4676 | EO diff: 0.1063

[RF + GS (EO)] i=26
          TPR    FPR    Recall  SelectionRate  Accuracy
Sex                                                    
0    0.833333  0.125  0.833333       0.236842  0.868421
1    1.000000  1.000  1.000000       1.000000  0.657534
Accuracy: 0.7011 | DP diff: 0.7632 | EO diff: 0.8750

--- Summary (RF + GS (EO)) ---
    i  accuracy  dp_

## Random Forest — In-Processing: EG vs. GridSearch

### Summary of results

| Method        | i   | Accuracy | DP diff | EO diff | Interpretation |
|---------------|-----|:--------:|:-------:|:-------:|----------------|
| **Baseline**  | –   | 0.8804   | 0.4081  | 0.0833  | High accuracy; **large DP gap (~0.41)**; EO moderate (~0.08). |
| **EG (EO)**   | –   | 0.8804   | 0.4081  | 0.0833  | **No change** vs. baseline — EO constraint ineffective. |
| **EG (DP)**   | –   | 0.8804   | 0.4081  | 0.0833  | **No change** vs. baseline — DP constraint ineffective. |
| **GS (EO)**   | 30  | 0.8804   | 0.4081  | 0.0833  | Baseline-equivalent; no gain. |
| **GS (EO)**   | 20  | 0.8859   | 0.4676  | 0.1063  | Slight accuracy ↑; fairness worsens (DP & EO ↑). |
| **GS (EO)**   | 26  | 0.7011   | 0.7632  | 0.8750  | **Degenerate**: poor accuracy & fairness. |
| **GS (DP)**   | 30  | 0.8804   | 0.3886  | 0.0729  | **Strict improvement**: better DP & EO at baseline accuracy. |
| **GS (DP)**   | 20  | 0.8967   | 0.4149  | 0.0625  | **Best accuracy** with EO improvement; DP slightly worse. |
| **GS (DP)**   | 26  | 0.8750   | 0.3818  | 0.0833  | **Lowest DP**, EO ≈ baseline; small accuracy cost. |

---

### Interpretation
- **Exponentiated Gradient (EO & DP):** Ineffective — RF predictions are unchanged.  
- **GridSearch (EO):** Mostly unhelpful; either baseline-equivalent, worse fairness, or degenerate.  
- **GridSearch (DP):** Provides **real trade-offs**:  
  - **i=30:** best balanced improvement (Acc same, DP ↓, EO ↓).  
  - **i=20:** highest accuracy + EO gain, but DP worsens.  
  - **i=26:** lowest DP, small accuracy loss, EO ≈ baseline.  

**Takeaway:** For RF, **EG fails**, but **GridSearch with DP constraint** yields useful frontier candidates depending on whether the goal is **overall balance (i=30)**, **accuracy + EO parity (i=20)**, or **minimizing DP (i=26)**.

--

### Bias Mitigation RF: Post-processing: Threshold Optimizer

In [27]:
from fairlearn.postprocessing import ThresholdOptimizer

# 0) Baseline RF 
rf.fit(X_train_ready, y_train)
y_rf_base = rf.predict(X_test_ready)
m_rf_base = eval_fairness(y_test, y_rf_base, A_test)

print("=== Baseline (Random Forest) ===")
print(m_rf_base["by_group"])
print(f"Accuracy: {m_rf_base['acc']:.4f} | DP diff: {m_rf_base['dp']:.4f} | EO diff: {m_rf_base['eo']:.4f}")

# 1) Post-processing: Equalized Odds 
post_rf_eo = ThresholdOptimizer(
    estimator=rf,
    constraints="equalized_odds",
    predict_method="predict_proba",   
    grid_size=200,
    flip=True
)
post_rf_eo.fit(X_train_ready, y_train, sensitive_features=A_train)
y_rf_eo = post_rf_eo.predict(X_test_ready, sensitive_features=A_test, random_state=42)
m_rf_eo = eval_fairness(y_test, y_rf_eo, A_test)

print("\n=== RF + Post-processing (Equalized Odds) ===")
print(m_rf_eo["by_group"])
print(f"Accuracy: {m_rf_eo['acc']:.4f} | DP diff: {m_rf_eo['dp']:.4f} | EO diff: {m_rf_eo['eo']:.4f}")

# 2) Post-processing: Demographic Parity 
post_rf_dp = ThresholdOptimizer(
    estimator=rf,
    constraints="demographic_parity",
    predict_method="predict_proba",
    grid_size=200,
    flip=True
)
post_rf_dp.fit(X_train_ready, y_train, sensitive_features=A_train)
y_rf_dp = post_rf_dp.predict(X_test_ready, sensitive_features=A_test, random_state=42)
m_rf_dp = eval_fairness(y_test, y_rf_dp, A_test)

print("\n=== RF + Post-processing (Demographic Parity) ===")
print(m_rf_dp["by_group"])
print(f"Accuracy: {m_rf_dp['acc']:.4f} | DP diff: {m_rf_dp['dp']:.4f} | EO diff: {m_rf_dp['eo']:.4f}")

# 3) Summary Table
summary_rf_post = pd.DataFrame([
    {"model":"RF Baseline",       "accuracy":m_rf_base["acc"], "dp_diff":m_rf_base["dp"], "eo_diff":m_rf_base["eo"]},
    {"model":"RF + Post (EO)",    "accuracy":m_rf_eo["acc"],   "dp_diff":m_rf_eo["dp"],   "eo_diff":m_rf_eo["eo"]},
    {"model":"RF + Post (DP)",    "accuracy":m_rf_dp["acc"],   "dp_diff":m_rf_dp["dp"],   "eo_diff":m_rf_dp["eo"]},
]).round(4)

print("\n=== Random Forest: Baseline vs Post-processing ===")
print(summary_rf_post)

=== Baseline (Random Forest) ===
          TPR    FPR    Recall  SelectionRate  Accuracy
Sex                                                    
0    1.000000  0.125  1.000000       0.263158  0.894737
1    0.916667  0.200  0.916667       0.671233  0.876712
Accuracy: 0.8804 | DP diff: 0.4081 | EO diff: 0.0833

=== RF + Post-processing (Equalized Odds) ===
         TPR    FPR   Recall  SelectionRate  Accuracy
Sex                                                  
0    1.00000  0.125  1.00000       0.263158  0.894737
1    0.90625  0.200  0.90625       0.664384  0.869863
Accuracy: 0.8750 | DP diff: 0.4012 | EO diff: 0.0938

=== RF + Post-processing (Demographic Parity) ===
         TPR    FPR   Recall  SelectionRate  Accuracy
Sex                                                  
0    1.00000  0.125  1.00000       0.263158  0.894737
1    0.90625  0.200  0.90625       0.664384  0.869863
Accuracy: 0.8750 | DP diff: 0.4012 | EO diff: 0.0938

=== Random Forest: Baseline vs Post-processing ===
  

# Random Forest Bias Mitigation (Post-processing)

## Summary

| Model              | Accuracy | DP Diff | EO Diff | Interpretation                                   |
|--------------------|:--------:|:-------:|:-------:|--------------------------------------------------|
| **RF Baseline**    | 0.8804   | 0.4081  | 0.0833  | High accuracy; **large DP disparity (~0.41)**; EO moderate. |
| **RF + Post (EO)** | 0.8750   | 0.4012  | 0.0938  | Accuracy ↓ slightly; DP gap ≈ baseline; **EO worsens**. |
| **RF + Post (DP)** | 0.8750   | 0.4012  | 0.0938  | Same outcome as Post (EO) → **no meaningful fairness gain**. |

## Key Facts
- The **RF baseline** already delivers high accuracy but exhibits a **substantial DP disparity** (≈0.41).  
- **ThresholdOptimizer** (both EO and DP) converged to **nearly identical solutions**, leaving DP almost unchanged and EO slightly worse.  
- Accuracy dropped marginally (−0.5 pp), making these post-processing methods ineffective in this setup.  
- Compared to DT, the **RF model is less responsive to post-processing interventions**, confirming that ensemble stability can limit fairness adjustments.  

---

### Deep Learning - Multi-layer Perceptron

In [28]:
#import required library 
from sklearn.neural_network import MLPClassifier

In [29]:
#Adam + Early Stopping 
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report, confusion_matrix

adammlp = MLPClassifier(
    hidden_layer_sizes=(64, 32),   # slightly smaller/deeper can help
    activation='relu',
    solver='adam',
    learning_rate_init=1e-3,       # smaller step can stabilize
    alpha=1e-3,                    # L2 regularization to reduce overfitting
    batch_size=32,
    max_iter=1000,                 # increased max_iter
    early_stopping=True,           # use a validation split internally
    validation_fraction=0.15,
    n_iter_no_change=25,          
    tol=1e-4,
    random_state=42
)

adammlp.fit(X_train_ready, y_train)  
y_pred_mlp = adammlp.predict(X_test_ready)                     
y_prob_mlp = adammlp.predict_proba(X_test_ready)[:, 1]         

evaluate_model(y_test, y_pred_mlp, "(Adam + EarlyStopping)")

=== (Adam + EarlyStopping) Evaluation ===
Accuracy : 0.8586956521739131
Precision: 0.8877551020408163
Recall   : 0.8529411764705882
F1 Score : 0.87

Classification Report:
               precision    recall  f1-score   support

           0       0.83      0.87      0.85        82
           1       0.89      0.85      0.87       102

    accuracy                           0.86       184
   macro avg       0.86      0.86      0.86       184
weighted avg       0.86      0.86      0.86       184

Confusion Matrix:
 [[71 11]
 [15 87]]




### Bias mitigation MLP: Inprocessing: Exponentiated Gradient 

In [30]:
from sklearn.neural_network import MLPClassifier
from fairlearn.reductions import ExponentiatedGradient, EqualizedOdds, DemographicParity
from sklearn.base import clone
import pandas as pd

# 0) Baseline MLP (seeded for reproducibility)
mlp = MLPClassifier(
    hidden_layer_sizes=(64, 32),
    activation='relu',
    solver='adam',
    learning_rate_init=1e-3,
    alpha=1e-3,
    batch_size=32,
    max_iter=1000,
    early_stopping=True,
    validation_fraction=0.15,
    n_iter_no_change=25,
    tol=1e-4,
    random_state=42
)

mlp.fit(X_train_ready, y_train)

y_pred_mlp_base = mlp.predict(X_test_ready)
m_mlp_base = eval_fairness(y_test, y_pred_mlp_base, A_test)

print("=== Baseline (MLP) ===")
print(m_mlp_base["by_group"])
print(f"Accuracy: {m_mlp_base['acc']:.4f} | DP diff: {m_mlp_base['dp']:.4f} | EO diff: {m_mlp_base['eo']:.4f}")

# 1) EG with Equalized Odds
eg_eo_mlp = ExponentiatedGradient(
    estimator=clone(mlp),   # inherits random_state=42
    constraints=EqualizedOdds(),
    eps=0.01,
    max_iter=50
)
eg_eo_mlp.fit(X_train_ready, y_train, sensitive_features=A_train)

# Prefer predict(..., random_state=42) if supported; otherwise fall back without global seeds
try:
    y_pred_mlp_eo = eg_eo_mlp.predict(X_test_ready, random_state=42)
except TypeError:
    y_pred_mlp_eo = eg_eo_mlp.predict(X_test_ready)

m_mlp_eo = eval_fairness(y_test, y_pred_mlp_eo, A_test)

print("\n=== In-processing MLP: EG (Equalized Odds) ===")
print(m_mlp_eo["by_group"])
print(f"Accuracy: {m_mlp_eo['acc']:.4f} | DP diff: {m_mlp_eo['dp']:.4f} | EO diff: {m_mlp_eo['eo']:.4f}")

# 2) EG with Demographic Parity
eg_dp_mlp = ExponentiatedGradient(
    estimator=clone(mlp),
    constraints=DemographicParity(),
    eps=0.01,
    max_iter=50
)
eg_dp_mlp.fit(X_train_ready, y_train, sensitive_features=A_train)

try:
    y_pred_mlp_dp = eg_dp_mlp.predict(X_test_ready, random_state=42)
except TypeError:
    y_pred_mlp_dp = eg_dp_mlp.predict(X_test_ready)

m_mlp_dp = eval_fairness(y_test, y_pred_mlp_dp, A_test)

print("\n=== In-processing MLP: EG (Demographic Parity) ===")
print(m_mlp_dp["by_group"])
print(f"Accuracy: {m_mlp_dp['acc']:.4f} | DP diff: {m_mlp_dp['dp']:.4f} | EO diff: {m_mlp_dp['eo']:.4f}")

# 3) Summary Table
summary_mlp = pd.DataFrame([
    {"model":"MLP Baseline",  "accuracy":m_mlp_base["acc"], "dp_diff":m_mlp_base["dp"], "eo_diff":m_mlp_base["eo"]},
    {"model":"MLP + EG (EO)", "accuracy":m_mlp_eo["acc"],   "dp_diff":m_mlp_eo["dp"],   "eo_diff":m_mlp_eo["eo"]},
    {"model":"MLP + EG (DP)", "accuracy":m_mlp_dp["acc"],   "dp_diff":m_mlp_dp["dp"],   "eo_diff":m_mlp_dp["eo"]},
]).round(4)

print("\n=== MLP: Baseline vs In-processing (EG) ===")
print(summary_mlp)

=== Baseline (MLP) ===
         TPR    FPR   Recall  SelectionRate  Accuracy
Sex                                                  
0    1.00000  0.125  1.00000       0.263158  0.894737
1    0.84375  0.140  0.84375       0.602740  0.849315
Accuracy: 0.8587 | DP diff: 0.3396 | EO diff: 0.1562

=== In-processing MLP: EG (Equalized Odds) ===
          TPR    FPR    Recall  SelectionRate  Accuracy
Sex                                                    
0    1.000000  0.125  1.000000       0.263158  0.894737
1    0.833333  0.160  0.833333       0.602740  0.835616
Accuracy: 0.8478 | DP diff: 0.3396 | EO diff: 0.1667

=== In-processing MLP: EG (Demographic Parity) ===
          TPR    FPR    Recall  SelectionRate  Accuracy
Sex                                                    
0    1.000000  0.125  1.000000       0.263158  0.894737
1    0.833333  0.160  0.833333       0.602740  0.835616
Accuracy: 0.8478 | DP diff: 0.3396 | EO diff: 0.1667

=== MLP: Baseline vs In-processing (EG) ===
         

# MLP In-Processing Bias Mitigation Results

## Summary

| Model             | Accuracy | DP Diff | EO Diff | Interpretation                                                        |
|-------------------|:--------:|:-------:|:-------:|-----------------------------------------------------------------------|
| **MLP Baseline**  | 0.8587   | 0.3396  | 0.1562  | Decent accuracy but **substantial DP disparity** (~0.34) and **EO gap** (~0.16). |
| **MLP + EG (EO)** | 0.8478   | 0.3396  | 0.1667  | Accuracy ↓; **EO worsens**; DP unchanged.                             |
| **MLP + EG (DP)** | 0.8478   | 0.3396  | 0.1667  | Same as EG(EO): accuracy ↓; **EO worsens**; DP unchanged.             |

## Key Points
- **Exponentiated Gradient (EG)** failed to improve fairness:  
  - **DP disparity** stayed high (~0.34).  
  - **EO disparity** worsened (0.156 → 0.167).  
  - **Accuracy dropped** slightly (0.859 → 0.848).  
- Identical outcomes for EG(EO) and EG(DP) suggest the algorithm converged to the **same frontier solution**, showing **low sensitivity** of MLP to these fairness constraints.  
- In the **CVD gender bias** context, this means disparities remain unresolved and mitigation attempts were ineffective.  

---

### Bias mitigation MLP: Inprocessing: Grid Search

In [32]:
from sklearn.base import clone
from fairlearn.reductions import GridSearch, EqualizedOdds, DemographicParity

# 1) GridSearch with Equalized Odds (MLP)
gs_eo_mlp = GridSearch(
    estimator=clone(adammlp),                 
    constraints=EqualizedOdds(),
    selection_rule="tradeoff_optimization",
    constraint_weight=0.5,                   
    grid_size=15                             
)
gs_eo_mlp.fit(X_train_ready, y_train, sensitive_features=A_train)
# Some Fairlearn versions support random_state in predict; fall back if not
try:
    y_pred_gs_eo_mlp = gs_eo_mlp.predict(X_test_ready, random_state=42)
except TypeError:
    y_pred_gs_eo_mlp = gs_eo_mlp.predict(X_test_ready)

m_gs_eo_mlp = eval_fairness(y_test, y_pred_gs_eo_mlp, A_test)
print("\n=== In-processing MLP: GridSearch (Equalized Odds) ===")
print(m_gs_eo_mlp["by_group"])
print(f"Accuracy: {m_gs_eo_mlp['acc']:.4f} | DP diff: {m_gs_eo_mlp['dp']:.4f} | EO diff: {m_gs_eo_mlp['eo']:.4f}")

# 2) GridSearch with Demographic Parity (MLP)
gs_dp_mlp = GridSearch(
    estimator=clone(adammlp),
    constraints=DemographicParity(),
    selection_rule="tradeoff_optimization",
    constraint_weight=0.5,
    grid_size=15
)
gs_dp_mlp.fit(X_train_ready, y_train, sensitive_features=A_train)
try:
    y_pred_gs_dp_mlp = gs_dp_mlp.predict(X_test_ready, random_state=42)
except TypeError:
    y_pred_gs_dp_mlp = gs_dp_mlp.predict(X_test_ready)

m_gs_dp_mlp = eval_fairness(y_test, y_pred_gs_dp_mlp, A_test)
print("\n=== In-processing MLP: GridSearch (Demographic Parity) ===")
print(m_gs_dp_mlp["by_group"])
print(f"Accuracy: {m_gs_dp_mlp['acc']:.4f} | DP diff: {m_gs_dp_mlp['dp']:.4f} | EO diff: {m_gs_dp_mlp['eo']:.4f}")

# 3) Compare with existing MLP runs (baseline + EG)
summary_mlp = pd.concat([
    summary_mlp,
    pd.DataFrame([
        {"model":"MLP + GS (EO)", "accuracy":m_gs_eo_mlp["acc"], "dp_diff":m_gs_eo_mlp["dp"], "eo_diff":m_gs_eo_mlp["eo"]},
        {"model":"MLP + GS (DP)", "accuracy":m_gs_dp_mlp["acc"], "dp_diff":m_gs_dp_mlp["dp"], "eo_diff":m_gs_dp_mlp["eo"]},
    ]).round(4)
], ignore_index=True)

print("\n=== MLP: Baseline vs EG vs GS ===")
print(summary_mlp)


=== In-processing MLP: GridSearch (Equalized Odds) ===
          TPR      FPR    Recall  SelectionRate  Accuracy
Sex                                                      
0    1.000000  0.09375  1.000000       0.236842  0.921053
1    0.833333  0.18000  0.833333       0.609589  0.828767
Accuracy: 0.8478 | DP diff: 0.3727 | EO diff: 0.1667

=== In-processing MLP: GridSearch (Demographic Parity) ===
          TPR    FPR    Recall  SelectionRate  Accuracy
Sex                                                    
0    1.000000  0.125  1.000000       0.263158  0.894737
1    0.822917  0.160  0.822917       0.595890  0.828767
Accuracy: 0.8424 | DP diff: 0.3327 | EO diff: 0.1771

=== MLP: Baseline vs EG vs GS ===
           model  accuracy  dp_diff  eo_diff
0   MLP Baseline    0.8587   0.3396   0.1562
1  MLP + EG (EO)    0.8478   0.3396   0.1667
2  MLP + EG (DP)    0.8478   0.3396   0.1667
3  MLP + GS (EO)    0.8478   0.3727   0.1667
4  MLP + GS (DP)    0.8424   0.3327   0.1771
5  MLP + GS (EO) 

### MLP — In-Processing (GridSearch vs EG)

### Comparative Results

| Model              | Accuracy | DP Diff | EO Diff | Notes |
|--------------------|:--------:|:-------:|:-------:|-------|
| **MLP Baseline**   | 0.8587   | 0.3396  | 0.1562  | Reference: good accuracy but clear DP & EO disparities |
| **EG (EO)**        | 0.8478   | 0.3396  | 0.1667  | Accuracy ↓; DP unchanged; EO worsens |
| **EG (DP)**        | 0.8478   | 0.3396  | 0.1667  | Same as EG(EO): no fairness gain |
| **GS (EO)**        | 0.8478   | 0.3727  | 0.1667  | Accuracy ↓; DP ↑; EO worsens (driven by TPR/FPR gaps) |
| **GS (DP)**        | 0.8424   | 0.3327  | 0.1771  | Accuracy ↓; tiny DP improvement; EO worsens |

### Interpretation
- **Baseline (MLP):** Accuracy **0.8587** with **moderate DP (0.34)** and **EO (0.16)** disparities.  
- **EG methods:** Both EO and DP constraints produced **identical models** — no DP improvement, EO worsened slightly, and accuracy dropped.  
- **GS methods:**  
  - **GS (EO)** worsened both disparities (DP ↑, EO ↑) while lowering accuracy.  
  - **GS (DP)** gave a **tiny DP reduction**, but EO worsened and accuracy dropped more.  

### Takeaway
Neither **Exponentiated Gradient** nor **GridSearch** improved fairness for MLP.  
- **EO consistently worsened** in all constrained models.  
- **DP improved only trivially** under GS(DP).  
- All fairness constraints **reduced accuracy**, making the **baseline MLP the best option** in this setup.  

---

### Bias mitigation MLP: Postprocessing: Threshold Optimizer

In [33]:
from fairlearn.postprocessing import ThresholdOptimizer
import pandas as pd

# 0) Baseline MLP
adammlp.fit(X_train_ready, y_train)
y_mlp_base = adammlp.predict(X_test_ready)
m_mlp_base = eval_fairness(y_test, y_mlp_base, A_test)

print("=== Baseline (MLP) ===")
print(m_mlp_base["by_group"])
print(f"Accuracy: {m_mlp_base['acc']:.4f} | DP diff: {m_mlp_base['dp']:.4f} | EO diff: {m_mlp_base['eo']:.4f}")

# 1) Post-processing: Equalized Odds
post_mlp_eo = ThresholdOptimizer(
    estimator=adammlp,
    constraints="equalized_odds",
    predict_method="predict_proba",
    grid_size=200,
    flip=True
)
post_mlp_eo.fit(X_train_ready, y_train, sensitive_features=A_train)
y_mlp_eo = post_mlp_eo.predict(X_test_ready, sensitive_features=A_test)
m_mlp_eo = eval_fairness(y_test, y_mlp_eo, A_test)

print("\n=== MLP + Post-processing (Equalized Odds) ===")
print(m_mlp_eo["by_group"])
print(f"Accuracy: {m_mlp_eo['acc']:.4f} | DP diff: {m_mlp_eo['dp']:.4f} | EO diff: {m_mlp_eo['eo']:.4f}")

# 2) Post-processing: Demographic Parity
post_mlp_dp = ThresholdOptimizer(
    estimator=adammlp,
    constraints="demographic_parity",
    predict_method="predict_proba",
    grid_size=200,
    flip=True
)
post_mlp_dp.fit(X_train_ready, y_train, sensitive_features=A_train)
y_mlp_dp = post_mlp_dp.predict(X_test_ready, sensitive_features=A_test)
m_mlp_dp = eval_fairness(y_test, y_mlp_dp, A_test)

print("\n=== MLP + Post-processing (Demographic Parity) ===")
print(m_mlp_dp["by_group"])
print(f"Accuracy: {m_mlp_dp['acc']:.4f} | DP diff: {m_mlp_dp['dp']:.4f} | EO diff: {m_mlp_dp['eo']:.4f}")

# 3) Summary Table
summary_mlp_post = pd.DataFrame([
    {"model":"MLP Baseline",       "accuracy":m_mlp_base["acc"], "dp_diff":m_mlp_base["dp"], "eo_diff":m_mlp_base["eo"]},
    {"model":"MLP + Post (EO)",    "accuracy":m_mlp_eo["acc"],   "dp_diff":m_mlp_eo["dp"],   "eo_diff":m_mlp_eo["eo"]},
    {"model":"MLP + Post (DP)",    "accuracy":m_mlp_dp["acc"],   "dp_diff":m_mlp_dp["dp"],   "eo_diff":m_mlp_dp["eo"]},
]).round(4)

print("\n=== MLP: Baseline vs Post-processing ===")
print(summary_mlp_post)

=== Baseline (MLP) ===
         TPR    FPR   Recall  SelectionRate  Accuracy
Sex                                                  
0    1.00000  0.125  1.00000       0.263158  0.894737
1    0.84375  0.140  0.84375       0.602740  0.849315
Accuracy: 0.8587 | DP diff: 0.3396 | EO diff: 0.1562

=== MLP + Post-processing (Equalized Odds) ===
          TPR      FPR    Recall  SelectionRate  Accuracy
Sex                                                      
0    1.000000  0.15625  1.000000       0.289474  0.868421
1    0.947917  0.30000  0.947917       0.726027  0.863014
Accuracy: 0.8641 | DP diff: 0.4366 | EO diff: 0.1437

=== MLP + Post-processing (Demographic Parity) ===
          TPR    FPR    Recall  SelectionRate  Accuracy
Sex                                                    
0    0.833333  0.125  0.833333       0.236842  0.868421
1    0.843750  0.140  0.843750       0.602740  0.849315
Accuracy: 0.8533 | DP diff: 0.3659 | EO diff: 0.0150

=== MLP: Baseline vs Post-processing ===
    

### MLP — Post-Processing: Threshold Optimizer 

#### Summary

| Model             | Accuracy | DP diff | EO diff | Notes                                                                 |
|-------------------|:--------:|:-------:|:-------:|------------------------------------------------------------------------|
| **Baseline**      | 0.8587   | 0.3396  | 0.1562  | Solid accuracy; large outcome disparity (DP ≈ 0.34), moderate EO gap.  |
| **Post (EO)**     | 0.8641   | 0.4366  | 0.1437  | Accuracy ↑ (+0.0054); **EO improves slightly** (−0.0125); **DP worsens sharply** (+0.097). |
| **Post (DP)**     | 0.8533   | 0.3659  | 0.0150  | Accuracy ↓ (−0.0054); **EO improves strongly** (−0.1412); **DP worsens slightly** (+0.0263). |

#### Interpretation
- **Equalized Odds (Post EO):**  
  - Brings a **tiny EO gain** (0.156 → 0.144), improves accuracy,  
  - But **DP disparity rises substantially** (0.340 → 0.437), meaning sex groups get flagged at very different rates.  

- **Demographic Parity (Post DP):**  
  - Produces a **dramatic EO improvement** (0.156 → 0.015), nearly equalizing error rates,  
  - But comes with a small accuracy drop and a **slight DP increase**.  

**Takeaway:**  
- If the priority is **balanced error rates** (minimizing sex-based gaps in TPR/FPR), **Post (DP)** is preferable despite modest DP worsening.  
- If **accuracy** is prioritized and a slight EO gain is enough, **Post (EO)** is the better choice, though it increases outcome disparities.  

---

## Overall Comparison:

# Gender Bias Mitigation in CVD Prediction — Overall Interpretation

**Fairness metrics:**  
- **DP diff** (Demographic Parity): selection-rate gap across sexes (lower = more equal triage/alerts).  
- **EO diff** (Equalized Odds): error-rate gap (combined TPR/FPR gap) across sexes (lower = more equal misses/false alarms).

---

## One-glance summary (best observed per family)

| Model family | Best config in your runs | Accuracy | DP diff | EO diff | Why it’s “best” |
|---|---:|---:|---:|---:|---|
| **PCA+KNN** | *(none effective)* | — | — | — | Post-processing made **0% flips**; CR slightly ↓DP but ↑EO and ↓acc. |
| **Decision Tree** | **EG (DP)** | 0.8098 | 0.2880 | **0.0300** | **Strongest EO reduction** with **no accuracy loss**; DP still high. |
|  | **Post (DP)** | 0.8152 | **0.2422** | 0.1125 | **Best DP** for DT with slight **acc ↑**; EO worsens. |
| **Random Forest** | **GS (DP), i=30** | 0.8804 | **0.3886** | **0.0729** | **Balanced improvement** vs baseline: same acc, **DP↓**, **EO↓**. |
|  | GS (DP), i=20 | **0.8967** | 0.4149 | **0.0625** | **Highest acc** + **EO↓**; DP slightly worse than baseline. |
|  | GS (DP), i=26 | 0.8750 | **0.3818** | 0.0833 | **Lowest DP** among RF with small acc cost; EO ≈ baseline. |
| **MLP** | **Post (DP)** | 0.8533 | 0.3659 | **0.0150** | **Best EO overall** (near-equal error rates); DP ↑ and acc ↓ slightly. |
|  | *(EG / GS)* | — | — | — | EG and GS **did not help**; some runs worsened fairness and/or accuracy. |

> Baselines to remember: **DT** (acc 0.8098, DP 0.2549, EO 0.0812) • **RF** (acc 0.8804, DP 0.4081, EO 0.0833) • **MLP** (acc 0.8587, DP 0.3396, EO 0.1562) • **PCA+KNN** (acc 0.8859, DP 0.3796, EO 0.0521)

---

## What worked 

### When the clinical priority is **error-rate parity (EO)**  

- **Decision Tree + EG (DP)**: EO **0.0812 → 0.0300** with **no acc loss**; DP rises to 0.2880.  
- **MLP + Post (DP)**: EO **0.1562 → 0.0150** (best EO overall) with **small acc drop** and **DP ↑**.  
- **Random Forest + GS (DP, i=20)**: EO **0.0833 → 0.0625** with **highest acc** (0.8967), DP slightly worse.

**Pick:**  
- Want **strongest EO** and can tolerate some DP/accuracy trade-off → **MLP + Post(DP)**.  
- Want **EO ↓ with no acc loss** in a simple model → **DT + EG(DP)**.  
- Want **EO ↓ with best acc** in an ensemble → **RF + GS(DP, i=20)**.

---

### When the priority is **outcome parity (DP)**  

- **Decision Tree + Post (DP)**: DP **0.2549 → 0.2422**, **acc ↑** slightly; EO worsens.  
- **Random Forest + GS (DP, i=26)**: DP **0.4081 → 0.3818** (lowest DP for RF) with small acc cost; EO ≈ baseline.  
- **Random Forest + GS (DP, i=30)**: **Joint DP↓ & EO↓** at **same acc** — best *balanced* RF improvement.

**Pick:**  
- Need **best DP** on DT with minimal disruption → **DT + Post(DP)**.  
- Need **RF with DP focus** → **GS(DP, i=26)** (lowest DP) or **i=30** for **joint DP↓ & EO↓** without acc loss.

---

### Approaches that were **ineffective or harmful** in this study

- **PCA+KNN Post-processing (DP/EO):** **0% label flips** → no effect. **CR**: ↓acc, ↓DP, but **↑EO**.  
- **Random Forest EG (EO/DP):** **No change** from baseline.  
- **Random Forest Post (EO/DP):** **Acc ↓** and **EO worsened**; DP ≈ baseline.  
- **MLP EG / GS:** Generally **↓acc** and **EO ↑**; only **Post(DP)** helped EO (with DP/acc trade-offs).  
- **GridSearch (EO)** for DT and RF: repeatedly **baseline-equivalent** or **worse**; some EO runs **degenerate**.

---

## Practical guidance for CVD deployment

1. **First decide the fairness target**  
   - **Clinical safety (minimize sex-based error-rate gaps)** → **EO target** (e.g., EO ≤ 0.05).  
   - **Access equity (equalize intervention rates)** → **DP target** (e.g., DP ≤ 0.10–0.15 given your baselines).

2. **Pick the model/mitigation to match that target**  
   - **EO-driven:**  
     - **DT + EG(DP)** (stable acc, EO ~0.03), or  
     - **RF + GS(DP, i=20)** (best acc, EO ~0.063), or  
     - **MLP + Post(DP)** (EO ~0.015, accept DP↑ and small acc↓).  
   - **DP-driven:**  
     - **DT + Post(DP)** (lowest DP for DT; EO trade-off),  
     - **RF + GS(DP, i=26)** (lowest DP in RF), or **i=30** for **joint DP↓ & EO↓** without acc loss.

3. **Lock selection rules before finalizing**  
   - Use a **held-out selection set** with pre-declared thresholds (e.g., **EO ≤ 0.05 & DP ≤ 0.30 & Acc ≥ baseline − 0.5 pp**).  
   - Prefer **frontier points** that meet all thresholds rather than optimizing a single metric in isolation.


---

## Summary:

- **There is no one-size-fits-all fix**: each model category offers a different **fairness–utility** profile.  
- For **EO parity** (clinical risk symmetry), your strongest choices are **DT + EG(DP)** (no acc loss) and **RF + GS(DP, i=20)** (best acc), with **MLP + Post(DP)** delivering the **lowest EO** if you accept a DP/accuracy trade-off.  
- For **DP parity** (equal access), **DT + Post(DP)** and **RF + GS(DP, i=26/30)** are the most appropriate.  
- **Avoid** PCA+KNN post-processing (no effect), **RF EG** (no effect), and any configurations that **inflate EO** without clear benefit.  


---