### Bias Detection and Fairness Evaluation on CVD Prediction (Mendeley Dataset) using FairMLhealth
Source: https://data.mendeley.com/datasets/dzz48mvjht/1

In [1]:
import pandas as pd

# Load X_test set
X_test = pd.read_csv("./data_splits/X_test.csv")
y_test = pd.read_csv("./data_splits/y_test.csv")

In [2]:
import fairmlhealth
import aif360
print("Environment setup successful")

Environment setup successful


In [3]:
#have a look at the details of fairmlhealth - especially the version
!pip show fairmlhealth

Name: fairmlhealth
Version: 1.0.2
Summary: Health-centered variation analysis
Home-page: https://github.com/KenSciResearch/fairMLHealth
Author: Christine Allen
Author-email: ca.magallen@gmail.com
License: 
Location: c:\users\patri\appdata\roaming\python\python310\site-packages
Requires: aif360, ipython, jupyter, numpy, pandas, requests, scikit-learn, scipy
Required-by: 


In [4]:
#have a look at the modules that are within fairmlhealth

print(dir(fairmlhealth))

['__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__']


In [5]:
#load necessary modules 

#import module measure to use measure.summary for bias detection
from fairmlhealth import measure

#import module for investigation of individual cohorts 
from fairmlhealth.__utils import iterate_cohorts

#import FairRanges to flag high values
from fairmlhealth.__utils import FairRanges

# Wrap the fairness summary function for cohort-wise analysis
@iterate_cohorts
def cohort_summary(**kwargs):
    return measure.summary(**kwargs)

pip install 'aif360[AdversarialDebiasing]'
pip install 'aif360[AdversarialDebiasing]'
  vect_normalized_discounted_cumulative_gain = vmap(
  monte_carlo_vect_ndcg = vmap(vect_normalized_discounted_cumulative_gain, in_dims=(0,))


### Traditional Machine Learning Models - KNN & DT

#### K-nearest neighbors - KNN

In [6]:
import pandas as pd

# Load KNN results
knn_df = pd.read_csv("MendeleyData_75M25F_KNN_best_predictions.csv")

print(knn_df.head())

   gender  y_true    y_prob  y_pred
0       0       0  0.331268       0
1       1       0  0.000000       0
2       1       1  1.000000       1
3       1       1  1.000000       1
4       1       0  0.000000       0


In [7]:
# Extract common columns
y_true_knn = knn_df["y_true"].values
y_prob_knn = knn_df["y_prob"].values
y_pred_knn = knn_df["y_pred"].values
gender_knn = knn_df["gender"].values

# Use gender_knn as the protected attribute (0/1 as in your CSV)
protected_attr_knn = gender_knn

In [8]:
knn_bias = measure.summary(
    X=X_test,
    y_true=y_true_knn,
    y_pred=y_pred_knn,
    y_prob=y_prob_knn,
    prtc_attr=protected_attr_knn,
    pred_type="classification",
    priv_grp=1,
    sig_fig=4,
    skip_if=True,   # skip inconsistency metrics that cause NearestNeighbors error
    skip_performance=True
)

print(knn_bias)

  X.loc[:, col] = pd.to_numeric(X[col], errors="ignore")
  X.loc[:, col] = pd.to_numeric(X[col], errors="ignore")


                                                        Value
Metric         Measure                                       
Group Fairness AUC Difference                         -0.0556
               Balanced Accuracy Difference           -0.1298
               Balanced Accuracy Ratio                 0.8654
               Disparate Impact Ratio                  0.8564
               Equal Odds Difference                  -0.1752
               Equal Odds Ratio                        6.4000
               Positive Predictive Parity Difference  -0.0793
               Positive Predictive Parity Ratio        0.9198
               Statistical Parity Difference          -0.0802
Data Metrics   Prevalence of Privileged Class (%)     77.0000


  prev = prev[0]


In [9]:
# 2) Custom scenario oriented bounds

custom_ranges = {
    "tpr diff": (-0.03, 0.03),
    "fpr diff": (-0.03, 0.03),
    "equal odds difference": (-0.04, 0.04),
    "statistical parity difference": (-0.05, 0.05),
    "disparate impact ratio": (0.9, 1.1),
    "selection ratio": (0.9, 1.1),
    "auc difference": (-0.02, 0.02),
    "balanced accuracy difference": (-0.02, 0.02),
}

bounds = FairRanges().load_fair_ranges(custom_ranges=custom_ranges)

In [10]:
#  restore Styler.set_precision to adjust the highlighting color in the styled table
import pandas as pd, numpy as np

Styler = type(pd.DataFrame({"_":[0]}).style)  

if not hasattr(Styler, "set_precision"):
    def _set_precision(self, precision=4):
        try:
            return self.format(precision=precision)
        except TypeError:
            return self.format(formatter=lambda x:
                f"{x:.{precision}g}" if isinstance(x, (int, float, np.floating)) else x
            )
    setattr(Styler, "set_precision", _set_precision)

In [11]:
#Flag metrics outside acceptable fairness bounds in current table 

from fairmlhealth.__utils import Flagger

class MyFlagger(Flagger):
    def reset(self):
        super().reset()
        self.flag_color = "#491ee6"   
        self.flag_type = "background-color"

styled_knn = MyFlagger().apply_flag(
    df=knn_bias,
    caption="KNN Fairness (Gender)",
    boundaries=bounds,
    sig_fig=4,
    as_styler=True
)
styled_knn

Unnamed: 0_level_0,Unnamed: 1_level_0,Value
Metric,Measure,Unnamed: 2_level_1
Group Fairness,AUC Difference,-0.0556
Group Fairness,Balanced Accuracy Difference,-0.1298
Group Fairness,Balanced Accuracy Ratio,0.8654
Group Fairness,Disparate Impact Ratio,0.8564
Group Fairness,Equal Odds Difference,-0.1752
Group Fairness,Equal Odds Ratio,6.4
Group Fairness,Positive Predictive Parity Difference,-0.0793
Group Fairness,Positive Predictive Parity Ratio,0.9198
Group Fairness,Statistical Parity Difference,-0.0802
Data Metrics,Prevalence of Privileged Class (%),77.0


## Gender Bias Detection Results for KNN Model 

---

### 1. Group Fairness Metrics

- **AUC Difference (-0.0556)**: The ROC AUC is lower for females, indicating weaker ranking performance compared to males.  
- **Balanced Accuracy Difference (-0.1298)** and **Ratio (0.8654)**: Substantial disparity, with females experiencing significantly worse balanced accuracy than males.  
- **Disparate Impact Ratio (0.8564)**: Below the common fairness threshold of 0.80–1.25, suggesting unequal selection rates that disadvantage females.  
- **Equal Odds Difference (-0.1752)** and **Equal Odds Ratio (6.4000)**: Large disparities in error rates (TPR and FPR) across genders, heavily favoring males.  
- **Positive Predictive Parity Difference (-0.0793)** and **Ratio (0.9198)**: Predictions are less reliable for females, with lower precision compared to males.  
- **Statistical Parity Difference (-0.0802)**: Females are selected at a lower rate than males, reinforcing evidence of imbalance.  

---

### 2. Interpretation

- The KNN model shows **marked fairness concerns**:  
  - Females face **lower AUC and balanced accuracy**, indicating poorer overall predictive performance.  
  - Error rates (equal odds) are highly skewed, with a **large disparity** suggesting males are treated much more favorably.  
  - Females also experience **lower precision and reduced selection rates**, confirming consistent disadvantages across multiple fairness measures.  

---

### **Summary**
The KNN model demonstrates **systematic gender bias**, strongly favoring males (privileged group) at the expense of females (unprivileged group).  
Compared to fairness thresholds, disparities in **balanced accuracy, equal odds, and selection rates** are substantial and indicate that KNN is **quite an unfair model**, requiring mitigation if considered for deployment.

---

In [12]:
print("FairMLHealth Stratified Bias Table - KNN")
measure.bias(X_test, y_test, y_pred_knn, features=['gender'], flag_oor=False)

FairMLHealth Stratified Bias Table - KNN


  X.loc[:, col] = pd.to_numeric(X[col], errors="ignore")
  X.loc[:, col] = pd.to_numeric(X[col], errors="ignore")


Unnamed: 0,Feature Name,Feature Value,Balanced Accuracy Difference,Balanced Accuracy Ratio,FPR Diff,FPR Ratio,PPV Diff,PPV Ratio,Selection Diff,Selection Ratio,TPR Diff,TPR Ratio
0,gender,0,0.1298,1.1555,-0.0844,0.1562,0.0793,1.0872,0.0802,1.1677,0.1752,1.2278
1,gender,1,-0.1298,0.8654,0.0844,6.4,-0.0793,0.9198,-0.0802,0.8564,-0.1752,0.8145


## Stratified Bias Analysis – KNN by Gender

This table presents **group-specific fairness metrics** for the KNN model, stratified by gender.  
Here, **0 = Female (unprivileged)** and **1 = Male (privileged)**.

---

### 1. Balanced Accuracy
- **Females (0):** Balanced Accuracy Difference = **+0.1298**, Ratio = **1.1555**  
- **Males (1):** Balanced Accuracy Difference = **−0.1298**, Ratio = **0.8654**  
- ➝ The model is **more balanced and accurate for females**, while males are disadvantaged.

---

### 2. False Positive Rate (FPR)
- **Females (0):** FPR Difference = **−0.0844**, Ratio = **0.1562**  
- **Males (1):** FPR Difference = **+0.0844**, Ratio = **6.4000**  
- ➝ Females face a **much lower false positive rate**, while males experience a disproportionally higher FPR.

---

### 3. Positive Predictive Value (PPV / Precision)
- **Females (0):** PPV Difference = **+0.0793**, Ratio = **1.0872**  
- **Males (1):** PPV Difference = **−0.0793**, Ratio = **0.9198**  
- ➝ Predictions are **more reliable for females**, while males see reduced precision.

---

### 4. Selection Rate
- **Females (0):** Selection Difference = **+0.0802**, Ratio = **1.1677**  
- **Males (1):** Selection Difference = **−0.0802**, Ratio = **0.8564**  
- ➝ Females are **selected more often** than expected, whereas males are under-selected.

---

### 5. True Positive Rate (TPR / Sensitivity)
- **Females (0):** TPR Difference = **+0.1752**, Ratio = **1.2278**  
- **Males (1):** TPR Difference = **−0.1752**, Ratio = **0.8145**  
- ➝ Females have a **much higher sensitivity**, meaning their true cases are almost always detected, while males face a significant risk of missed detections.

---

### **Summary**
- The KNN model appears to **favor females (unprivileged group)** across nearly all metrics: higher balanced accuracy, lower false positive rate, higher precision, more favorable selection rates, and much stronger sensitivity.  
- **Males (privileged group)** are consistently disadvantaged, with higher false positives, lower precision, under-selection, and a markedly lower true positive rate.  
- This suggests that, unlike other models, KNN introduces a **reverse bias**, systematically favoring females over males.  

While the model performs well for females, the **large disparities (especially in FPR and TPR)** highlight a fairness concern that should be addressed before deployment.


In [13]:
from fairmlhealth import measure
import pandas as pd
from IPython.display import display  

# convert gender into DataFrame with a clear column name to get a nice table as output
gender_df = pd.DataFrame({"gender": X_test["gender"].astype(int)})


# Get the stratified table
perf_table_knn = measure.performance(
    X=gender_df,
    y_true=y_test,
    y_pred=y_pred_knn,
    y_prob=y_prob_knn
)

# Replace NaN with a dash
perf_table_knn = perf_table_knn.fillna("—")

# display pretty table
display(perf_table_knn)

  X.loc[:, col] = pd.to_numeric(X[col], errors="ignore")
  X.loc[:, col] = pd.to_numeric(X[col], errors="ignore")

  warn(f"Possible error in column(s) {cols}. {wr}\n")


Unnamed: 0,Feature Name,Feature Value,Obs.,Mean Target,Mean Prediction,Accuracy,F1-Score,FPR,PR AUC,Precision,ROC AUC,TPR
0,ALL FEATURES,ALL VALUES,200.0,0.58,0.54,0.93,0.9375,0.0357,—,0.9722,0.9829,0.9052
1,gender,0,46.0,0.5652,0.4783,0.8261,0.8333,0.1,—,0.9091,0.9365,0.7692
2,gender,1,154.0,0.5844,0.5584,0.961,0.9659,0.0156,—,0.9884,0.9922,0.9444


## Stratified Performance Analysis – KNN by Gender

This table shows the **stratified performance metrics** of the KNN model across gender groups.  
Here, **0 = Female (unprivileged)** and **1 = Male (privileged)**.

---

### 1. Overall Performance (All Features)
- **Accuracy (0.9300)** and **F1-score (0.9375)** indicate good overall classification performance.  
- **ROC AUC (0.9829)** shows excellent discriminatory power.  
- **Precision (0.9722)** is very high, suggesting predictions are generally reliable.  
- **TPR (0.9052)** reflects strong sensitivity overall, though subgroup breakdowns reveal disparities.  
- **Note**: PR AUC is not available (“—”) due to insufficient subgroup size for reliable computation.

---

### 2. Subgroup Comparison

| Metric        | Female (0) | Male (1) | Interpretation |
|---------------|------------|----------|----------------|
| **Accuracy**  | 0.8261     | 0.9610   | Accuracy is substantially higher for males. |
| **F1-Score**  | 0.8333     | 0.9659   | Model is more effective for males. |
| **FPR**       | 0.1000     | 0.0156   | Females experience far more false positives. |
| **Precision** | 0.9091     | 0.9884   | Predictions for males are more reliable. |
| **ROC AUC**   | 0.9365     | 0.9922   | Strong disparity; males benefit from much better ranking performance. |
| **TPR**       | 0.7692     | 0.9444   | Females are more likely to be missed (lower sensitivity). |

---

### 3. Interpretation
- **Females (unprivileged)**:  
  - Face substantially worse performance: lower accuracy, F1, and ROC AUC.  
  - Have a **much higher false positive rate (10%)** and a significantly lower **true positive rate (76.9%)**, meaning more missed CVD cases.  
  - Precision (0.9091) is lower, so positive predictions for females are less trustworthy.  

- **Males (privileged)**:  
  - Benefit from consistently better metrics across the board.  
  - High accuracy (96.1%), excellent sensitivity (94.4%), and very low false positive rate (1.56%).  
  - Predictions are extremely reliable, with near-perfect precision (0.9884) and ROC AUC (0.9922).  

---

### **Summary**
The KNN model demonstrates a **systematic disadvantage for females**.  
- Females are more likely to be misclassified, both in terms of **missed true cases (low TPR)** and **false alarms (high FPR)**.  
- Males receive far more favorable outcomes across all key metrics, including accuracy, F1, precision, and AUC.  

This performance disparity aligns with the fairness metrics, confirming that KNN introduces **strong gender bias in favor of males**.

In [14]:
#group specific error analysis

from fairmlhealth import performance_metrics as pm

# Define group masks with clear names
female_mask = (protected_attr_knn == 0)  # unprivileged group (female)
male_mask   = (protected_attr_knn == 1)  # privileged group (male)

# Function to evaluate group-specific metrics
def evaluate_group_performance(group_name, mask):
    tpr = pm.true_positive_rate(y_test[mask], y_pred_knn[mask])
    fpr = pm.false_positive_rate(y_test[mask], y_pred_knn[mask])
    print(f"{group_name} Results:")
    print(f"  True Positive Rate (TPR): {tpr:.4f}")
    print(f"  False Positive Rate (FPR): {fpr:.4f}")
    print("-" * 40)

# Evaluate for each group
evaluate_group_performance("Female (Unprivileged)", female_mask)
evaluate_group_performance("Male (Privileged)", male_mask)


Female (Unprivileged) Results:
  True Positive Rate (TPR): 0.7692
  False Positive Rate (FPR): 0.1000
----------------------------------------
Male (Privileged) Results:
  True Positive Rate (TPR): 0.9444
  False Positive Rate (FPR): 0.0156
----------------------------------------


### Group-Specific Error Analysis – KNN Model

To further examine fairness at the subgroup level, we compared the **True Positive Rate (TPR)** and **False Positive Rate (FPR)** for the unprivileged (female) and privileged (male) groups.

#### Results by Gender Group

| Group                  | TPR     | FPR     |
|------------------------|---------|---------|
| Unprivileged (Female)  | 0.7692  | 0.1000  |
| Privileged (Male)      | 0.9444  | 0.0156  |

#### Interpretation

- The **privileged group (male)** has a considerably higher **TPR (94.44%)** than the unprivileged group (76.92%), showing that males are far more likely to be correctly identified when they have CVD.  
- At the same time, the **FPR is much higher for females (10.00%)** compared to males (1.56%), meaning women are also more likely to be incorrectly flagged as having CVD.  
- This double disparity indicates that the model is **both more sensitive and more specific for males**, while females face a greater risk of missed diagnoses and false alarms.  
- These subgroup-level imbalances correspond to the fairness metrics, where the **Equal Odds Difference and Ratio** confirm significant disparities in error distributions between genders.  

#### Summary

The results reveal a **systematic disadvantage for the unprivileged group (females)**: they suffer from both lower sensitivity (missed true cases) and higher false positive rates. This highlights a critical fairness concern and underscores the need to apply **bias mitigation strategies** to ensure more equitable performance across genders.

---

### Decision Tree - DT

In [15]:
import pandas as pd

# Load KNN results
dt_df = pd.read_csv("MendeleyData_75M25F_DT_tuned_predictions.csv")

print(dt_df.head())

   gender  y_true  y_pred_dt    y_prob
0       0       0          0  0.125000
1       1       0          0  0.125000
2       1       1          1  0.975845
3       1       1          1  0.849057
4       1       0          0  0.005236


In [16]:
import re

# Extract common columns
y_true_dt = dt_df["y_true"].values
y_prob_dt = dt_df["y_prob"].values
y_pred_dt = dt_df["y_pred_dt"].values
gender_dt = dt_df["gender"].values


# Use gender_knn as the protected attribute (0/1 as in your CSV)
protected_attr_dt = gender_dt


In [17]:
# Decision Tree Gender Bias Report
print("\n--- Decision Tree Gender Bias Report ---")

dt_bias = measure.summary(
    X=X_test,
    y_true=y_test,
    y_pred=y_pred_dt,
    y_prob=y_prob_dt,
    prtc_attr=protected_attr_dt,
    pred_type="classification",
    priv_grp=1,  # 1 = Male = Privileged
    sig_fig=4,
    skip_if=True,  
    skip_performance = True
)

print(dt_bias)


--- Decision Tree Gender Bias Report ---


  X.loc[:, col] = pd.to_numeric(X[col], errors="ignore")
  X.loc[:, col] = pd.to_numeric(X[col], errors="ignore")


                                                        Value
Metric         Measure                                       
Group Fairness AUC Difference                         -0.0294
               Balanced Accuracy Difference            0.0152
               Balanced Accuracy Ratio                 1.0169
               Disparate Impact Ratio                  0.9360
               Equal Odds Difference                  -0.0406
               Equal Odds Ratio                        0.7111
               Positive Predictive Parity Difference   0.0199
               Positive Predictive Parity Ratio        1.0220
               Statistical Parity Difference          -0.0387
Data Metrics   Prevalence of Privileged Class (%)     77.0000


  prev = prev[0]


In [18]:
#Flag metrics outside acceptable fairness bounds in current table 

styled_dt = MyFlagger().apply_flag(
    df=dt_bias,
    caption="Decision Tree Fairness (Gender)",
    boundaries=bounds,
    sig_fig=4,
    as_styler=True
)
styled_dt

Unnamed: 0_level_0,Unnamed: 1_level_0,Value
Metric,Measure,Unnamed: 2_level_1
Group Fairness,AUC Difference,-0.0294
Group Fairness,Balanced Accuracy Difference,0.0152
Group Fairness,Balanced Accuracy Ratio,1.0169
Group Fairness,Disparate Impact Ratio,0.936
Group Fairness,Equal Odds Difference,-0.0406
Group Fairness,Equal Odds Ratio,0.7111
Group Fairness,Positive Predictive Parity Difference,0.0199
Group Fairness,Positive Predictive Parity Ratio,1.022
Group Fairness,Statistical Parity Difference,-0.0387
Data Metrics,Prevalence of Privileged Class (%),77.0


## Fairness Evaluation – Decision Tree by Gender
---

### 1. Group Fairness Metrics

- **AUC Difference (-0.0294):** Females have slightly lower AUC compared to males, but the difference is modest.  
- **Balanced Accuracy Difference (0.0152)** and **Ratio (1.0169):** Balanced accuracy is very similar across groups, with a minor advantage for females.  
- **Disparate Impact Ratio (0.9360):** Slightly below the ideal range (0.8–1.25), suggesting females are selected at somewhat lower rates than males.  
- **Equal Odds Difference (-0.0406)** and **Equal Odds Ratio (0.7111):** Notable disparity in error rates (TPR and FPR), with females experiencing worse error balance compared to males.  
- **Positive Predictive Parity Difference (0.0199)** and **Ratio (1.0220):** Predictions for females are slightly more reliable (better precision).  
- **Statistical Parity Difference (-0.0387):** Indicates a small disadvantage for females in overall selection rates.  

---

### 2. Interpretation
- The Decision Tree model shows **mixed fairness outcomes**:  
  - **Advantages for females**: Slightly better balanced accuracy and predictive parity.  
  - **Disadvantages for females**: Lower AUC, reduced selection rates, and higher disparity in error distribution (equal odds).  
- The **Equal Odds Ratio (0.7111)** highlights that males benefit from more balanced error rates, while females face less equitable treatment in terms of sensitivity and specificity.  

---

### **Summary**
Overall, the Decision Tree demonstrates **moderate gender disparities**.  
While females enjoy marginal gains in predictive precision and balanced accuracy, they are disadvantaged in terms of **error distribution and selection rates**, raising fairness concerns. The disparities are not extreme but point to a systematic imbalance that may require mitigation.

---


In [19]:
print("FairMLHealth Stratified Bias Table - DT")
measure.bias(X_test, y_test, y_pred_dt, features=['gender'], flag_oor=False)

  X.loc[:, col] = pd.to_numeric(X[col], errors="ignore")
  X.loc[:, col] = pd.to_numeric(X[col], errors="ignore")


FairMLHealth Stratified Bias Table - DT


Unnamed: 0,Feature Name,Feature Value,Balanced Accuracy Difference,Balanced Accuracy Ratio,FPR Diff,FPR Ratio,PPV Diff,PPV Ratio,Selection Diff,Selection Ratio,TPR Diff,TPR Ratio
0,gender,0,-0.0152,0.9833,0.0406,1.4062,-0.0199,0.9785,0.0387,1.0684,0.0103,1.0111
1,gender,1,0.0152,1.0169,-0.0406,0.7111,0.0199,1.022,-0.0387,0.936,-0.0103,0.989


## Stratified Bias Analysis – Decision Tree by Gender

This table presents **group-specific fairness metrics** for the Decision Tree model, stratified by gender.  
Here, **0 = Female (unprivileged)** and **1 = Male (privileged)**.

---

### 1. Balanced Accuracy
- **Females (0):** Balanced Accuracy Difference = −0.0152, Ratio = 0.9833  
- **Males (1):** Balanced Accuracy Difference = +0.0152, Ratio = 1.0169  
- ➝ Balanced accuracy is very similar across genders, with a **slight advantage for males**.

---

### 2. False Positive Rate (FPR)
- **Females (0):** FPR Diff = +0.0406, Ratio = 1.4062  
- **Males (1):** FPR Diff = −0.0406, Ratio = 0.7111  
- ➝ Females experience a **higher false positive rate**, meaning they are more likely to be incorrectly flagged with CVD.

---

### 3. Positive Predictive Value (PPV / Precision)
- **Females (0):** PPV Diff = −0.0199, Ratio = 0.9785  
- **Males (1):** PPV Diff = +0.0199, Ratio = 1.0220  
- ➝ Males benefit from **slightly higher precision**, with more reliable positive predictions.

---

### 4. Selection Rate
- **Females (0):** Selection Diff = +0.0387, Ratio = 1.0684  
- **Males (1):** Selection Diff = −0.0387, Ratio = 0.9360  
- ➝ Females are **selected slightly more often** than males, despite lower precision.

---

### 5. True Positive Rate (TPR / Sensitivity)
- **Females (0):** TPR Diff = +0.0103, Ratio = 1.0111  
- **Males (1):** TPR Diff = −0.0103, Ratio = 0.9890  
- ➝ Sensitivity is nearly equal, with **females having a marginal advantage**.

---

### **Summary**
- **Females (unprivileged)**: Slightly higher selection rates and sensitivity, but disadvantaged by **higher false positive rates** and marginally lower precision.  
- **Males (privileged)**: Benefit from **lower false positives and higher precision**, but are selected less often overall.  
- The disparities are **small but meaningful**: the Decision Tree model shows a **mild imbalance**, where females face more false alarms, while males enjoy greater reliability in predictions.  

Overall, fairness concerns exist but are **less severe** than those observed in KNN, suggesting the Decision Tree is comparatively more balanced across gender groups.

---

In [20]:
# Get the stratified performance table
perf_table_dt = measure.performance(
    X=gender_df,
    y_true=y_test,
    y_pred=y_pred_dt,
    y_prob=y_prob_dt
)

# Replace NaN with a dash
perf_table_dt = perf_table_dt.fillna("—")

# Display pretty table
display(perf_table_dt)

  X.loc[:, col] = pd.to_numeric(X[col], errors="ignore")
  X.loc[:, col] = pd.to_numeric(X[col], errors="ignore")

  warn(f"Possible error in column(s) {cols}. {wr}\n")


Unnamed: 0,Feature Name,Feature Value,Obs.,Mean Target,Mean Prediction,Accuracy,F1-Score,FPR,PR AUC,Precision,ROC AUC,TPR
0,ALL FEATURES,ALL VALUES,200.0,0.58,0.595,0.905,0.9191,0.131,—,0.9076,0.9258,0.931
1,gender,0,46.0,0.5652,0.5652,0.913,0.9231,0.1,—,0.9231,0.9058,0.9231
2,gender,1,154.0,0.5844,0.6039,0.9026,0.918,0.1406,—,0.9032,0.9352,0.9333


## Stratified Performance Analysis – Decision Tree by Gender

This table shows the **stratified performance metrics** of the Decision Tree model across gender groups.  
Here, **0 = Female (unprivileged)** and **1 = Male (privileged)**.

---

### 1. Overall Performance (All Features)
- **Accuracy (0.9050)** and **F1-score (0.9191)** indicate strong overall classification performance.  
- **ROC AUC (0.9258)** demonstrates good discriminatory ability.  
- **Precision (0.9076)** and **TPR (0.9310)** show that the model balances predictive reliability and sensitivity well.  
- **Note**: PR AUC is not available (“—”), likely due to subgroup sample size limitations.

---

### 2. Subgroup Comparison

| Metric        | Female (0) | Male (1) | Interpretation |
|---------------|------------|----------|----------------|
| **Accuracy**  | 0.9130     | 0.9026   | Accuracy is slightly higher for females. |
| **F1-Score**  | 0.9231     | 0.9180   | Performance is nearly balanced, with a small edge for females. |
| **FPR**       | 0.1000     | 0.1406   | Females face fewer false positives compared to males. |
| **Precision** | 0.9231     | 0.9032   | Predictions are more reliable for females. |
| **ROC AUC**   | 0.9058     | 0.9352   | Males benefit from stronger ranking ability. |
| **TPR**       | 0.9231     | 0.9333   | Sensitivity is slightly higher for males. |

---

### 3. Interpretation
- **Females (unprivileged)**:  
  - Show marginally higher accuracy, F1, and precision compared to males.  
  - Experience a **lower false positive rate (10% vs. 14.06%)**, which reduces unnecessary false alarms.  
  - Slightly weaker ROC AUC, suggesting less effective ranking compared to males.  

- **Males (privileged)**:  
  - Benefit from higher ROC AUC and slightly higher sensitivity (TPR).  
  - However, they experience a **higher false positive rate** and somewhat weaker precision.  
  - Overall, their predictions are still reliable, but less balanced compared to females.  

---

### **Summary**
The Decision Tree model shows **relatively balanced performance across genders**, with **females enjoying advantages in precision, accuracy, and lower false positives**, while **males benefit from stronger ROC AUC and slightly higher sensitivity**.  
The disparities are modest and indicate that the Decision Tree is more equitable than models such as KNN, though small trade-offs remain between sensitivity and specificity across groups.

---

In [21]:
from fairmlhealth import performance_metrics as pm

# Define group masks with clear names
female_mask = (protected_attr_dt == 0)  # female = unprivileged group
male_mask   = (protected_attr_dt == 1)  # male = privileged group 

# Function to evaluate group-specific metrics
def evaluate_group_performance(group_name, mask):
    tpr = pm.true_positive_rate(y_test[mask], y_pred_dt[mask])
    fpr = pm.false_positive_rate(y_test[mask], y_pred_dt[mask])
    print(f"{group_name} Results:")
    print(f"  True Positive Rate (TPR): {tpr:.4f}")
    print(f"  False Positive Rate (FPR): {fpr:.4f}")
    print("-" * 40)

# Evaluate for each group
evaluate_group_performance("Female (Unprivileged)", female_mask)
evaluate_group_performance("Male (Privileged)", male_mask)

Female (Unprivileged) Results:
  True Positive Rate (TPR): 0.9231
  False Positive Rate (FPR): 0.1000
----------------------------------------
Male (Privileged) Results:
  True Positive Rate (TPR): 0.9333
  False Positive Rate (FPR): 0.1406
----------------------------------------


### Group-Specific Error Analysis – Decision Tree

This section analyzes the classification performance of the Decision Tree model across gender groups using **True Positive Rate (TPR)** and **False Positive Rate (FPR)**.

#### Results by Gender Group

| Group                        | TPR     | FPR     |
|------------------------------|---------|---------|
| Unprivileged (female = 0)    | 0.9231  | 0.1000  |
| Privileged (male = 1)        | 0.9333  | 0.1406  |

#### Interpretation

- **True Positive Rate (TPR)** is **high and very similar** across genders:  
  - Females (unprivileged): **92.31%**  
  - Males (privileged): **93.33%**  
  - This shows the model is nearly equally effective in detecting true CVD cases for both groups.  

- **False Positive Rate (FPR)** differs more noticeably:  
  - Females: **10.00%**  
  - Males: **14.06%**  
  - Males are therefore **more frequently misclassified as having CVD**, indicating a disadvantage for the privileged group in terms of specificity.  

#### Implications

- The **Decision Tree achieves balanced sensitivity (TPR)** across genders.  
- However, the **higher FPR for males** suggests a trade-off: while both groups benefit from strong detection, males face more false alarms.  
- These subgroup disparities align with fairness metrics such as the **Equal Odds Difference** and **Equal Odds Ratio**, which capture uneven error distributions.  

#### Summary

Overall, the Decision Tree model delivers **balanced sensitivity across genders**, but the **elevated false positive rate for males** introduces a fairness concern. In practice, this means men may face a greater burden of unnecessary follow-ups, while women benefit from slightly better specificity.

---

### Ensemble Model - Random Forest - RF

In [22]:
rf_df = pd.read_csv("MendeleyData_75M25F_RF_tuned_predictions.csv")
print(rf_df.head())

   gender  y_true  y_pred_rf_tuned    y_prob
0       0       0                0  0.344558
1       1       0                0  0.010588
2       1       1                1  0.902502
3       1       1                1  0.937066
4       1       0                0  0.000000


In [23]:
# Extract common columns
y_true_rf = rf_df["y_true"].values
y_pred_rf = rf_df["y_pred_rf_tuned"].values
y_prob_rf = rf_df["y_prob"].values
gender_rf = rf_df["gender"].values


# Use gender_knn as the protected attribute (0/1 as in your CSV)
protected_attr_rf = gender_rf

In [24]:
# Random Forest Gender Bias Report
print("\n--- Random Forest Gender Bias Report ---")

rf_bias = measure.summary(
    X=X_test,
    y_true=y_test,
    y_pred=y_pred_rf,
    y_prob=y_prob_rf,
    prtc_attr=protected_attr_rf,
    pred_type="classification",
    priv_grp=1,  # 1 = Male = Privileged
    sig_fig=4,
    skip_if=True,
    skip_performance = True
)

print(rf_bias)


--- Random Forest Gender Bias Report ---


  X.loc[:, col] = pd.to_numeric(X[col], errors="ignore")
  X.loc[:, col] = pd.to_numeric(X[col], errors="ignore")


                                                        Value
Metric         Measure                                       
Group Fairness AUC Difference                          0.0104
               Balanced Accuracy Difference           -0.0066
               Balanced Accuracy Ratio                 0.9931
               Disparate Impact Ratio                  1.0775
               Equal Odds Difference                   0.0688
               Equal Odds Ratio                        3.2000
               Positive Predictive Parity Difference  -0.0484
               Positive Predictive Parity Ratio        0.9504
               Statistical Parity Difference           0.0438
Data Metrics   Prevalence of Privileged Class (%)     77.0000


  prev = prev[0]


In [25]:
# Flagged fairness table for Random Forest
styled_rf = MyFlagger().apply_flag(
    df=rf_bias,
    caption="Random Forest Fairness (Gender)",
    boundaries=bounds,
    sig_fig=4,
    as_styler=True
)
styled_rf

Unnamed: 0_level_0,Unnamed: 1_level_0,Value
Metric,Measure,Unnamed: 2_level_1
Group Fairness,AUC Difference,0.0104
Group Fairness,Balanced Accuracy Difference,-0.0066
Group Fairness,Balanced Accuracy Ratio,0.9931
Group Fairness,Disparate Impact Ratio,1.0775
Group Fairness,Equal Odds Difference,0.0688
Group Fairness,Equal Odds Ratio,3.2
Group Fairness,Positive Predictive Parity Difference,-0.0484
Group Fairness,Positive Predictive Parity Ratio,0.9504
Group Fairness,Statistical Parity Difference,0.0438
Data Metrics,Prevalence of Privileged Class (%),77.0


## Fairness Evaluation – Random Forest by Gender

The table summarizes fairness metrics for the Random Forest model, with gender as the protected attribute  
(**0 = Female / unprivileged, 1 = Male / privileged**).

---

### 1. Group Fairness Metrics

- **AUC Difference (0.0104):** Very small gap in overall ranking performance (slightly favoring females).  
- **Balanced Accuracy Difference (−0.0066)** and **Ratio (0.9931):** Balanced accuracy is nearly equal across genders, showing minimal disparity.  
- **Disparate Impact Ratio (1.0775):** Within the generally accepted range (0.8–1.25), indicating relatively fair selection rates between groups.  
- **Equal Odds Difference (0.0688)** and **Equal Odds Ratio (3.2000):** A more notable imbalance, pointing to uneven error rates (TPR/FPR) between males and females.  
- **Positive Predictive Parity Difference (−0.0484)** and **Ratio (0.9504):** Males have slightly higher precision (positive predictions are more reliable for them).  
- **Statistical Parity Difference (0.0438):** Suggests a modest advantage for females in selection rates.  

---

### 2. Interpretation

- The Random Forest model is **largely fair across most metrics**: AUC, balanced accuracy, and disparate impact show only minor differences.  
- However, **equal odds metrics reveal a stronger disparity**, meaning error distributions (sensitivity and false positive rates) differ more noticeably between genders.  
- Precision is slightly higher for males, while females appear more frequently selected.  

---

### **Summary**
The Random Forest model achieves **mostly balanced performance across genders**, but fairness concerns emerge in the **equal odds metrics**, which suggest that one gender (likely males) benefits from more favorable error trade-offs. Overall, disparities are **moderate** , highlighting the need to monitor and potentially mitigate bias in error distribution.

--

In [26]:
print("FairMLHealth Stratified Bias Table - RF")
measure.bias(X_test, y_test, y_pred_rf, features=['gender'], flag_oor=False)

  X.loc[:, col] = pd.to_numeric(X[col], errors="ignore")
  X.loc[:, col] = pd.to_numeric(X[col], errors="ignore")


FairMLHealth Stratified Bias Table - RF


Unnamed: 0,Feature Name,Feature Value,Balanced Accuracy Difference,Balanced Accuracy Ratio,FPR Diff,FPR Ratio,PPV Diff,PPV Ratio,Selection Diff,Selection Ratio,TPR Diff,TPR Ratio
0,gender,0,0.0066,1.0069,-0.0688,0.3125,0.0484,1.0522,-0.0438,0.9281,-0.0556,0.9444
1,gender,1,-0.0066,0.9931,0.0688,3.2,-0.0484,0.9504,0.0438,1.0775,0.0556,1.0588


## Stratified Bias Analysis – Random Forest by Gender

This table presents **group-specific fairness metrics** for the Random Forest model, stratified by gender.  
Here, **0 = Female (unprivileged)** and **1 = Male (privileged)**.

---

### 1. Balanced Accuracy
- **Females (0):** Difference = +0.0066, Ratio = 1.0069  
- **Males (1):** Difference = −0.0066, Ratio = 0.9931  
- ➝ Balanced accuracy is nearly equal, with a very slight advantage for females.

---

### 2. False Positive Rate (FPR)
- **Females (0):** Diff = −0.0688, Ratio = 0.3125  
- **Males (1):** Diff = +0.0688, Ratio = 3.2000  
- ➝ Females have a **much lower false positive rate**, while males are disproportionately more likely to be incorrectly flagged as positive.

---

### 3. Positive Predictive Value (PPV / Precision)
- **Females (0):** Diff = +0.0484, Ratio = 1.0522  
- **Males (1):** Diff = −0.0484, Ratio = 0.9504  
- ➝ Predictions are **more reliable for females**, with higher precision compared to males.

---

### 4. Selection Rate
- **Females (0):** Diff = −0.0438, Ratio = 0.9281  
- **Males (1):** Diff = +0.0438, Ratio = 1.0775  
- ➝ Males are **selected more often**, while females are under-selected.

---

### 5. True Positive Rate (TPR / Sensitivity)
- **Females (0):** Diff = −0.0556, Ratio = 0.9444  
- **Males (1):** Diff = +0.0556, Ratio = 1.0588  
- ➝ Males enjoy **higher sensitivity**, meaning they are more likely to be correctly identified when they have CVD, whereas females face more missed cases.

---

### **Summary**
- **Females (unprivileged)**: Benefit from **lower false positive rates** and **higher precision**, but are disadvantaged by **lower sensitivity** (TPR) and lower selection rates.  
- **Males (privileged)**: Benefit from **higher sensitivity and more frequent selection**, but at the cost of **more false positives** and slightly lower precision.  
- The Random Forest model shows a **trade-off in fairness**: females experience fewer false alarms but risk missed detections, while males are more often detected but face more false positives.  

Overall, disparities are moderate but noticeable, consistent with the **Equal Odds metrics** reported earlier, confirming uneven error distributions across gender.

In [27]:
# Get the stratified performance table
perf_table_rf = measure.performance(
    X=gender_df,
    y_true=y_test,
    y_pred=y_pred_rf,
    y_prob=y_prob_rf
)

# Replace NaN with a dash
perf_table_rf = perf_table_rf.fillna("—")

# display pretty table
display(perf_table_rf)

  X.loc[:, col] = pd.to_numeric(X[col], errors="ignore")
  X.loc[:, col] = pd.to_numeric(X[col], errors="ignore")

  warn(f"Possible error in column(s) {cols}. {wr}\n")


Unnamed: 0,Feature Name,Feature Value,Obs.,Mean Target,Mean Prediction,Accuracy,F1-Score,FPR,PR AUC,Precision,ROC AUC,TPR
0,ALL FEATURES,ALL VALUES,200.0,0.58,0.575,0.955,0.961,0.0476,—,0.9652,0.9881,0.9569
1,gender,0,46.0,0.5652,0.6087,0.9565,0.963,0.1,—,0.9286,0.9962,1.0
2,gender,1,154.0,0.5844,0.5649,0.9545,0.9605,0.0312,—,0.977,0.9858,0.9444


## Stratified Performance Analysis – Random Forest by Gender

This table presents the **stratified performance metrics** of the Random Forest model across gender groups.  
Here, **0 = Female (unprivileged)** and **1 = Male (privileged)**.

---

### 1. Overall Performance (All Features)
- **Accuracy (0.9550)** and **F1-score (0.9610)** indicate excellent overall performance.  
- **ROC AUC (0.9881)** demonstrates very strong discriminatory ability.  
- **Precision (0.9652)** and **TPR (0.9569)** show that the model achieves a good balance between predictive reliability and sensitivity.  
- **Note**: For some subgroups, **PR AUC is reported as “—”** because the subgroup sample size did not allow reliable calculation of a precision–recall curve. This does not affect the validity of the other metrics.

---

### 2. Subgroup Comparison

| Metric        | Female (0) | Male (1) | Interpretation |
|---------------|------------|----------|----------------|
| **Accuracy**  | 0.9565     | 0.9545   | Accuracy is almost identical across genders, with a slight edge for females. |
| **F1-Score**  | 0.9630     | 0.9605   | Females perform marginally better. |
| **FPR**       | 0.1000     | 0.0312   | Males face fewer false positives, while females are more often incorrectly flagged. |
| **Precision** | 0.9286     | 0.9770   | Predictions are more reliable for males. |
| **ROC AUC**   | 0.9962     | 0.9858   | Both groups achieve excellent discrimination, with females slightly ahead. |
| **TPR**       | 1.0000     | 0.9444   | Females are perfectly identified when they have CVD, while males have slightly lower sensitivity. |

---

### 3. Interpretation
- **Females (unprivileged)**:  
  - Achieve **perfect sensitivity (TPR = 1.0000)**, meaning no missed CVD cases.  
  - Benefit from slightly higher accuracy, F1, and ROC AUC.  
  - However, they experience a **higher false positive rate (10% vs. 3.1%)**, resulting in more false alarms.  
  - Precision (0.9286) is lower, so positive predictions for females are less reliable than for males.  

- **Males (privileged)**:  
  - Benefit from **lower false positive rates** and **higher precision**, making their predictions more trustworthy.  
  - Sensitivity is slightly weaker than for females, meaning some true cases are missed.  
  - Despite minor trade-offs, performance remains very strong across all metrics.  

---

### **Summary**
The Random Forest model performs **very well for both genders**, with only modest disparities.  
- **Females** enjoy stronger sensitivity and slightly better overall accuracy and AUC.  
- **Males** benefit from more reliable predictions and fewer false alarms.  
These results suggest a **trade-off rather than a clear systematic bias**: females are more likely to be over-diagnosed (higher FPR, lower precision), while males are more likely to be under-diagnosed (lower TPR).

---

In [28]:
from fairmlhealth import performance_metrics as pm

# Define group masks with clear names
female_mask = (protected_attr_rf == 0)  # female = unprivileged group
male_mask   = (protected_attr_rf == 1)  # male = privileged group 

# Function to evaluate group-specific metrics
def evaluate_group_performance(group_name, mask):
    tpr = pm.true_positive_rate(y_test[mask], y_pred_rf[mask])
    fpr = pm.false_positive_rate(y_test[mask], y_pred_rf[mask])
    print(f"{group_name} Results:")
    print(f"  True Positive Rate (TPR): {tpr:.4f}")
    print(f"  False Positive Rate (FPR): {fpr:.4f}")
    print("-" * 40)

# Evaluate for each group
evaluate_group_performance("Female (Unprivileged)", female_mask)
evaluate_group_performance("Male (Privileged)", male_mask)

Female (Unprivileged) Results:
  True Positive Rate (TPR): 1.0000
  False Positive Rate (FPR): 0.1000
----------------------------------------
Male (Privileged) Results:
  True Positive Rate (TPR): 0.9444
  False Positive Rate (FPR): 0.0312
----------------------------------------


### Group-Specific Error Analysis – Random Forest

This section presents the performance of the Random Forest model across gender groups, focusing on **True Positive Rate (TPR)** and **False Positive Rate (FPR)**.

#### Results by Gender Group

| Group                        | TPR     | FPR     |
|------------------------------|---------|---------|
| Unprivileged (female = 0)    | 1.0000  | 0.1000  |
| Privileged (male = 1)        | 0.9444  | 0.0312  |

#### Interpretation

- **True Positive Rate (TPR)** is **perfect for females (100%)**, compared to **94.44% for males**.  
  - This means the model correctly identifies all positive CVD cases among females, but misses a small proportion of cases among males.  

- **False Positive Rate (FPR)** is **higher for females (10.00%)** than for males (3.12%).  
  - This indicates that females are more likely to receive **false alarms**, being incorrectly flagged as having CVD.  

#### Implications

- The model shows a **gender-based trade-off**:  
  - **Females (unprivileged)**: enjoy higher sensitivity (no missed cases), but at the cost of more false positives.  
  - **Males (privileged)**: benefit from better specificity (fewer false positives), but experience slightly reduced sensitivity.  

- This asymmetry highlights that the Random Forest does not consistently favor one group, but rather distributes errors differently:  
  - **Females are over-diagnosed** (more false positives).  
  - **Males are under-diagnosed** (slightly more missed true cases).  

- Depending on the clinical use case, these imbalances could have different consequences: females may face unnecessary follow-ups, while males risk missed diagnoses—both raising fairness considerations.

---

### Deep Learning Model - Feed Forward Network (MLP)

In [29]:
mlp_df = pd.read_csv("MendeleyData_75M25F_MLP_lbfgs_predictions.csv")
print(mlp_df.head())

   gender  y_true  y_pred        y_prob
0       0       0       0  2.048477e-13
1       1       0       0  1.560384e-15
2       1       1       1  1.000000e+00
3       1       1       1  1.000000e+00
4       1       0       0  3.068694e-17


In [30]:
# Extract common columns 
y_true_mlp = mlp_df["y_true"].values 
y_prob_mlp = mlp_df["y_prob"].values
y_pred_mlp = mlp_df["y_pred"].values
gender_mlp = mlp_df["gender"].values 

# Use gender_mlp as the protected attribute
protected_attr_mlp = gender_mlp 

In [31]:
#Run fairmlhealth bias detection for MLP 

mlp_bias = measure.summary(
    X=X_test,
    y_true=y_test,
    y_pred=y_pred_mlp,
    y_prob=y_prob_mlp,
    prtc_attr=protected_attr_mlp,
    pred_type="classification",
    priv_grp=1,
    sig_fig=4,
    skip_if=True,
    skip_performance = True
)

print(mlp_bias)

  X.loc[:, col] = pd.to_numeric(X[col], errors="ignore")
  X.loc[:, col] = pd.to_numeric(X[col], errors="ignore")
  prev = prev[0]


                                                        Value
Metric         Measure                                       
Group Fairness AUC Difference                         -0.0560
               Balanced Accuracy Difference           -0.0308
               Balanced Accuracy Ratio                 0.9667
               Disparate Impact Ratio                  0.9000
               Equal Odds Difference                  -0.0709
               Equal Odds Ratio                        0.9143
               Positive Predictive Parity Difference  -0.0047
               Positive Predictive Parity Ratio        0.9949
               Statistical Parity Difference          -0.0604
Data Metrics   Prevalence of Privileged Class (%)     77.0000


In [32]:
# Flagged fairness table for MLP
styled_mlp = MyFlagger().apply_flag(
    df=mlp_bias,
    caption="MLP Fairness (Gender)",
    boundaries=bounds,
    sig_fig=4,
    as_styler=True
)
styled_mlp

Unnamed: 0_level_0,Unnamed: 1_level_0,Value
Metric,Measure,Unnamed: 2_level_1
Group Fairness,AUC Difference,-0.056
Group Fairness,Balanced Accuracy Difference,-0.0308
Group Fairness,Balanced Accuracy Ratio,0.9667
Group Fairness,Disparate Impact Ratio,0.9
Group Fairness,Equal Odds Difference,-0.0709
Group Fairness,Equal Odds Ratio,0.9143
Group Fairness,Positive Predictive Parity Difference,-0.0047
Group Fairness,Positive Predictive Parity Ratio,0.9949
Group Fairness,Statistical Parity Difference,-0.0604
Data Metrics,Prevalence of Privileged Class (%),77.0


## Fairness Evaluation – MLP by Gender

---

### 1. Group Fairness Metrics

- **AUC Difference (−0.0560):** Females have lower AUC than males, reflecting weaker ranking performance in distinguishing positive from negative cases.  
- **Balanced Accuracy Difference (−0.0308)** and **Ratio (0.9667):** Balanced accuracy is slightly lower for females, suggesting modest disparities in classification accuracy across groups.  
- **Disparate Impact Ratio (0.9000):** Below the ideal fairness range (0.8–1.25) but close to 1, indicating that females are selected at somewhat lower rates than males.  
- **Equal Odds Difference (−0.0709)** and **Ratio (0.9143):** Reveal imbalances in error rates (TPR/FPR), with females experiencing less favorable outcomes compared to males.  
- **Positive Predictive Parity Difference (−0.0047)** and **Ratio (0.9949):** Precision is nearly identical across genders, meaning predictive reliability is balanced.  
- **Statistical Parity Difference (−0.0604):** Indicates a small disadvantage for females in overall selection rates.  

---

### Interpretation:

- The MLP model shows **modest but consistent disparities**, with males (privileged group) benefiting from slightly better AUC, balanced accuracy, and error distributions.  
- Females (unprivileged group) face disadvantages in ranking ability, balanced accuracy, and selection rates, though differences remain relatively small.  
- On the positive side, **predictive precision is nearly equal** across genders, reducing concerns about reliability of positive predictions.  

---

### **Summary**
The MLP model demonstrates **generally balanced fairness**, but with small systematic disadvantages for females in terms of accuracy, AUC, and error rate distributions.  
While the disparities are not extreme, they highlight areas where bias mitigation could further improve gender equity in predictions.

---

In [33]:
print("FairMLHealth Stratified Bias Table - MLP")
measure.bias(X_test, y_test, y_pred_mlp, features=['gender'], flag_oor=False)

FairMLHealth Stratified Bias Table - MLP


  X.loc[:, col] = pd.to_numeric(X[col], errors="ignore")
  X.loc[:, col] = pd.to_numeric(X[col], errors="ignore")


Unnamed: 0,Feature Name,Feature Value,Balanced Accuracy Difference,Balanced Accuracy Ratio,FPR Diff,FPR Ratio,PPV Diff,PPV Ratio,Selection Diff,Selection Ratio,TPR Diff,TPR Ratio
0,gender,0,0.0308,1.0345,0.0094,1.0938,0.0047,1.0051,0.0604,1.1112,0.0709,1.0802
1,gender,1,-0.0308,0.9667,-0.0094,0.9143,-0.0047,0.9949,-0.0604,0.9,-0.0709,0.9258


## Stratified Bias Analysis – MLP by Gender

This table presents **group-specific fairness metrics** for the MLP model, stratified by gender.  
Here, **0 = Female (unprivileged)** and **1 = Male (privileged)**.

---

### 1. Balanced Accuracy
- **Females (0):** Difference = +0.0308, Ratio = 1.0345  
- **Males (1):** Difference = −0.0308, Ratio = 0.9667  
- ➝ Females benefit from slightly higher balanced accuracy compared to males.

---

### 2. False Positive Rate (FPR)
- **Females (0):** Diff = +0.0094, Ratio = 1.0938  
- **Males (1):** Diff = −0.0094, Ratio = 0.9143  
- ➝ Females experience a marginally higher false positive rate, while males have fewer false alarms.

---

### 3. Positive Predictive Value (PPV / Precision)
- **Females (0):** Diff = +0.0047, Ratio = 1.0051  
- **Males (1):** Diff = −0.0047, Ratio = 0.9949  
- ➝ Predictions are slightly more reliable for females, though the difference is negligible.

---

### 4. Selection Rate
- **Females (0):** Diff = +0.0604, Ratio = 1.1112  
- **Males (1):** Diff = −0.0604, Ratio = 0.9000  
- ➝ Females are selected more often, whereas males are under-selected relative to the baseline.

---

### 5. True Positive Rate (TPR / Sensitivity)
- **Females (0):** Diff = +0.0709, Ratio = 1.0802  
- **Males (1):** Diff = −0.0709, Ratio = 0.9258  
- ➝ Females enjoy a higher sensitivity, meaning more of their true cases are correctly detected compared to males.

---

### **Summary**
- **Females (unprivileged):** Benefit from slightly higher balanced accuracy, precision, selection rates, and sensitivity, but also face a marginally higher false positive rate.  
- **Males (privileged):** Experience fewer false positives but are disadvantaged by lower balanced accuracy, reduced sensitivity, and lower selection rates.  

Overall, the disparities are **small**, but the MLP model shows a mild tendency to **favor females** in terms of sensitivity and overall detection, while males benefit slightly from fewer false positives. This trade-off indicates relatively balanced performance with only modest fairness concerns.

---

In [34]:
# Get the stratified performance table
perf_table_mlp = measure.performance(
    X=gender_df,
    y_true=y_test,
    y_pred=y_pred_mlp,
    y_prob=y_prob_mlp
)

# Replace NaN with a dash
perf_table_mlp = perf_table_mlp.fillna("—")

# display pretty table
display(perf_table_mlp)

  X.loc[:, col] = pd.to_numeric(X[col], errors="ignore")
  X.loc[:, col] = pd.to_numeric(X[col], errors="ignore")

  warn(f"Possible error in column(s) {cols}. {wr}\n")


Unnamed: 0,Feature Name,Feature Value,Obs.,Mean Target,Mean Prediction,Accuracy,F1-Score,FPR,PR AUC,Precision,ROC AUC,TPR
0,ALL FEATURES,ALL VALUES,200.0,0.58,0.59,0.92,0.9316,0.1071,—,0.9237,0.9663,0.9397
1,gender,0,46.0,0.5652,0.5435,0.8913,0.902,0.1,—,0.92,0.926,0.8846
2,gender,1,154.0,0.5844,0.6039,0.9286,0.9399,0.1094,—,0.9247,0.9819,0.9556


## Stratified Performance Analysis – MLP by Gender

This table shows the **stratified performance metrics** of the MLP model across gender groups.  
Here, **0 = Female (unprivileged)** and **1 = Male (privileged)**.

---

### 1. Overall Performance (All Features)
- **Accuracy (0.9200)** and **F1-score (0.9316)** indicate strong overall classification performance.  
- **ROC AUC (0.9663)** shows excellent discriminatory ability.  
- **Precision (0.9237)** and **TPR (0.9397)** suggest the model achieves a good balance between predictive reliability and sensitivity.  
- **Note**: PR AUC is reported as “—” because subgroup sample sizes did not allow reliable calculation of precision–recall curves.

---

### 2. Subgroup Comparison

| Metric        | Female (0) | Male (1) | Interpretation |
|---------------|------------|----------|----------------|
| **Accuracy**  | 0.8913     | 0.9286   | Accuracy is higher for males. |
| **F1-Score**  | 0.9020     | 0.9399   | The model performs better for males. |
| **FPR**       | 0.1000     | 0.1094   | Females experience slightly fewer false positives. |
| **Precision** | 0.9200     | 0.9247   | Predictions are marginally more reliable for males. |
| **ROC AUC**   | 0.9260     | 0.9819   | Males benefit from stronger ranking performance. |
| **TPR**       | 0.8846     | 0.9556   | Males are more likely to be correctly identified when they have CVD. |

---

### 3. Interpretation
- **Females (unprivileged)**:  
  - Show lower accuracy, F1, and ROC AUC compared to males.  
  - Slightly fewer false positives (FPR = 10.00%), but this comes with reduced sensitivity (TPR = 88.46%), meaning more missed cases.  
  - Predictions are reliable, but less favorable overall than for males.  

- **Males (privileged)**:  
  - Enjoy higher accuracy, F1, precision, and significantly better ROC AUC.  
  - Higher sensitivity (TPR = 95.56%) means fewer missed cases, though they face slightly more false positives (10.94%).  
  - Overall, outcomes for males are more favorable, reflecting stronger model performance.  

---

### **Summary**
The MLP model demonstrates **gender disparities in performance**.  
- **Males** benefit from higher accuracy, stronger recall, and much better ROC AUC, making predictions more favorable for this group.  
- **Females** have slightly fewer false positives but are disadvantaged by lower sensitivity and weaker overall performance.  

These findings suggest the MLP model may be **biased in favor of males**, as they consistently receive more reliable and accurate outcomes.

---

In [35]:
from fairmlhealth import performance_metrics as pm

# Define group masks with clear names
female_mask = (protected_attr_mlp == 0)  # female = unprivileged group
male_mask   = (protected_attr_mlp == 1)  # male = privileged group 

# Function to evaluate group-specific metrics
def evaluate_group_performance(group_name, mask):
    tpr = pm.true_positive_rate(y_test[mask], y_pred_mlp[mask])
    fpr = pm.false_positive_rate(y_test[mask], y_pred_mlp[mask])
    print(f"{group_name} Results:")
    print(f"  True Positive Rate (TPR): {tpr:.4f}")
    print(f"  False Positive Rate (FPR): {fpr:.4f}")
    print("-" * 40)

# Evaluate for each group
evaluate_group_performance("Female (Unprivileged)", female_mask)
evaluate_group_performance("Male (Privileged)", male_mask)

Female (Unprivileged) Results:
  True Positive Rate (TPR): 0.8846
  False Positive Rate (FPR): 0.1000
----------------------------------------
Male (Privileged) Results:
  True Positive Rate (TPR): 0.9556
  False Positive Rate (FPR): 0.1094
----------------------------------------


### Group-Specific Error Analysis – MLP Model

This section breaks down the classification performance of the MLP model across gender groups, using **True Positive Rate (TPR)** and **False Positive Rate (FPR)**.

#### Results by Gender Group

| Group                        | TPR     | FPR     |
|------------------------------|---------|---------|
| Unprivileged (female = 0)    | 0.8846  | 0.1000  |
| Privileged (male = 1)        | 0.9556  | 0.1094  |

#### Interpretation

- **True Positive Rate (TPR)** is higher for males (95.56%) compared to females (88.46%).  
  - This means the model is **better at correctly identifying true positive cases for males**, while females face more missed detections.  

- **False Positive Rate (FPR)** is slightly lower for females (10.00%) than for males (10.94%).  
  - This suggests that females are **less likely to receive false alarms** compared to males.  

#### Implications

- The MLP model shows a **gender-based trade-off**:  
  - **Females (unprivileged)**: face reduced sensitivity (lower TPR), meaning more true cases are missed, but benefit from fewer false positives.  
  - **Males (privileged)**: enjoy stronger sensitivity (higher TPR), but at the cost of a slightly higher false positive rate.  

- These asymmetries are consistent with the fairness metrics (e.g., **Equal Odds Difference = −0.0709** and **Equal Odds Ratio = 0.9143**), which reflect uneven error distributions between groups.  

#### Recommendation

- While the disparities are not extreme, the model tends to **favor males in terms of sensitivity**, while **females benefit from fewer false positives**.  
- Depending on the clinical context, these trade-offs could matter:  
  - **For early detection**, higher sensitivity for males is advantageous.  
  - **For reducing unnecessary interventions**, lower false positives for females are beneficial.  
- This highlights the importance of considering **fairness mitigation strategies** to balance sensitivity and specificity across genders.

---