### Bias Detection and Fairness Evaluation on CVD Prediction (Mendeley Dataset) using FairMLhealth
Source: https://data.mendeley.com/datasets/dzz48mvjht/1

In [1]:
import pandas as pd

# Load X_test set
X_test = pd.read_csv("./data_splits/X_test.csv")
y_test = pd.read_csv("./data_splits/y_test.csv")

In [2]:
import fairmlhealth
import aif360
print("Environment setup successful")

Environment setup successful


In [3]:
#have a look at the details of fairmlhealth - especially the version
!pip show fairmlhealth

Name: fairmlhealth
Version: 1.0.2
Summary: Health-centered variation analysis
Home-page: https://github.com/KenSciResearch/fairMLHealth
Author: Christine Allen
Author-email: ca.magallen@gmail.com
License: 
Location: c:\users\patri\appdata\roaming\python\python310\site-packages
Requires: aif360, ipython, jupyter, numpy, pandas, requests, scikit-learn, scipy
Required-by: 


In [4]:
#have a look at the modules that are within fairmlhealth

print(dir(fairmlhealth))

['__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__']


In [5]:
#load necessary modules 

#import module measure to use measure.summary for bias detection
from fairmlhealth import measure

#import module for investigation of individual cohorts 
from fairmlhealth.__utils import iterate_cohorts

#import FairRanges to flag high values
from fairmlhealth.__utils import FairRanges

# Wrap the fairness summary function for cohort-wise analysis
@iterate_cohorts
def cohort_summary(**kwargs):
    return measure.summary(**kwargs)

pip install 'aif360[AdversarialDebiasing]'
pip install 'aif360[AdversarialDebiasing]'
  vect_normalized_discounted_cumulative_gain = vmap(
  monte_carlo_vect_ndcg = vmap(vect_normalized_discounted_cumulative_gain, in_dims=(0,))


In [6]:
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", module="inFairness")
warnings.filterwarnings("ignore", message="AdversarialDebiasing will be unavailable")

During the execution of FairMLHealth and AIF360, several runtime warnings were raised (e.g., “AdversarialDebiasing will be unavailable” due to the absence of TensorFlow, and deprecation warnings from the inFairness package regarding PyTorch’s functorch.vmap). These warnings do not affect the fairness metrics or results presented in this study, as the unavailable components were not used. To maintain clarity of output, the warnings were silenced programmatically, and the analysis was conducted without issue.

### Traditional Machine Learning Models - KNN & DT

#### K-nearest neighbors - KNN

In [7]:
import pandas as pd

# Load KNN results
knn_df = pd.read_csv("MendeleyData_50_50_KNN_best_predictions.csv")

print(knn_df.head())

   gender  y_true  y_prob  y_pred
0       0       0     0.0       0
1       1       0     0.0       0
2       1       1     1.0       1
3       1       1     1.0       1
4       1       0     0.0       0


In [8]:
# Extract common columns
y_true_knn = knn_df["y_true"].values
y_prob_knn = knn_df["y_prob"].values
y_pred_knn = knn_df["y_pred"].values
gender_knn = knn_df["gender"].values

# Use gender_knn as the protected attribute (0/1 as in your CSV)
protected_attr_knn = gender_knn

In [9]:
knn_bias = measure.summary(
    X=X_test,
    y_true=y_true_knn,
    y_pred=y_pred_knn,
    y_prob=y_prob_knn,
    prtc_attr=protected_attr_knn,
    pred_type="classification",
    priv_grp=1,
    sig_fig=4,
    skip_if=True,   # skip inconsistency metrics that cause NearestNeighbors error
    skip_performance=True
)

print(knn_bias)

                                                        Value
Metric         Measure                                       
Group Fairness AUC Difference                         -0.0587
               Balanced Accuracy Difference           -0.0587
               Balanced Accuracy Ratio                 0.9382
               Disparate Impact Ratio                  0.9732
               Equal Odds Difference                   0.0688
               Equal Odds Ratio                        3.2000
               Positive Predictive Parity Difference  -0.0567
               Positive Predictive Parity Ratio        0.9419
               Statistical Parity Difference          -0.0150
Data Metrics   Prevalence of Privileged Class (%)     77.0000


In [10]:
# 2) Custom scenario oriented bounds

custom_ranges = {
    "tpr diff": (-0.03, 0.03),
    "fpr diff": (-0.03, 0.03),
    "equal odds difference": (-0.04, 0.04),
    "statistical parity difference": (-0.05, 0.05),
    "disparate impact ratio": (0.9, 1.1),
    "selection ratio": (0.9, 1.1),
    "auc difference": (-0.02, 0.02),
    "balanced accuracy difference": (-0.02, 0.02),
}

bounds = FairRanges().load_fair_ranges(custom_ranges=custom_ranges)

In [11]:
#  restore Styler.set_precision to adjust the highlighting color in the styled table
import pandas as pd, numpy as np

Styler = type(pd.DataFrame({"_":[0]}).style)  

if not hasattr(Styler, "set_precision"):
    def _set_precision(self, precision=4):
        try:
            return self.format(precision=precision)
        except TypeError:
            return self.format(formatter=lambda x:
                f"{x:.{precision}g}" if isinstance(x, (int, float, np.floating)) else x
            )
    setattr(Styler, "set_precision", _set_precision)

In [12]:
#Flag metrics outside acceptable fairness bounds in current table 

from fairmlhealth.__utils import Flagger

class MyFlagger(Flagger):
    def reset(self):
        super().reset()
        self.flag_color = "#491ee6"   
        self.flag_type = "background-color"

styled_knn = MyFlagger().apply_flag(
    df=knn_bias,
    caption="KNN Fairness (Gender)",
    boundaries=bounds,
    sig_fig=4,
    as_styler=True
)
styled_knn

Unnamed: 0_level_0,Unnamed: 1_level_0,Value
Metric,Measure,Unnamed: 2_level_1
Group Fairness,AUC Difference,-0.0587
Group Fairness,Balanced Accuracy Difference,-0.0587
Group Fairness,Balanced Accuracy Ratio,0.9382
Group Fairness,Disparate Impact Ratio,0.9732
Group Fairness,Equal Odds Difference,0.0688
Group Fairness,Equal Odds Ratio,3.2
Group Fairness,Positive Predictive Parity Difference,-0.0567
Group Fairness,Positive Predictive Parity Ratio,0.9419
Group Fairness,Statistical Parity Difference,-0.015
Data Metrics,Prevalence of Privileged Class (%),77.0


## Fairness Evaluation – KNN by Gender

The table summarizes fairness metrics for the KNN model, with gender as the protected attribute  
(**0 = Female / unprivileged, 1 = Male / privileged**).

---

### 1. Group Fairness Metrics

- **AUC Difference (−0.0587):** Females have lower AUC than males, suggesting weaker ability to distinguish between positive and negative cases.  
- **Balanced Accuracy Difference (−0.0587)** and **Ratio (0.9382):** Balanced accuracy is lower for females, indicating reduced classification quality for the unprivileged group.  
- **Disparate Impact Ratio (0.9732):** Close to 1, suggesting selection rates between genders are fairly similar, with only a small disparity.  
- **Equal Odds Difference (0.0688)** and **Ratio (3.2000):** A noticeable imbalance in error rates (TPR/FPR), showing that outcomes differ more substantially between genders.  
- **Positive Predictive Parity Difference (−0.0567)** and **Ratio (0.9419):** Precision is slightly lower for females, meaning their positive predictions are less reliable.  
- **Statistical Parity Difference (−0.0150):** Indicates a very small under-selection of females compared to males.  

---

### 2. Interpretation

- The KNN model shows **moderate gender disparities**:  
  - Females are disadvantaged in terms of AUC, balanced accuracy, and precision, reflecting weaker performance quality.  
  - Equal Odds metrics highlight **notable disparities in error rates**, suggesting uneven treatment of males and females.  
  - Disparate Impact and Statistical Parity remain close to ideal, meaning selection rates are relatively fair despite differences in prediction quality.  

---

### **Summary**
The KNN model appears to be **less fair across genders**, particularly disadvantaging females through lower accuracy, weaker ranking ability, and less reliable positive predictions. While selection rates are relatively balanced, the **error distribution (Equal Odds)** shows a more substantial fairness concern, indicating that this model may reinforce gender bias in predictions.

---

In [13]:
print("FairMLHealth Stratified Bias Table - KNN")
measure.bias(X_test, y_test, y_pred_knn, features=['gender'], flag_oor=False)

FairMLHealth Stratified Bias Table - KNN


Unnamed: 0,Feature Name,Feature Value,Balanced Accuracy Difference,Balanced Accuracy Ratio,FPR Diff,FPR Ratio,PPV Diff,PPV Ratio,Selection Diff,Selection Ratio,TPR Diff,TPR Ratio
0,gender,0,0.0587,1.0658,-0.0688,0.3125,0.0567,1.0617,0.015,1.0275,0.0487,1.0551
1,gender,1,-0.0587,0.9382,0.0688,3.2,-0.0567,0.9419,-0.015,0.9732,-0.0487,0.9478


## Stratified Bias Analysis – KNN by Gender

---

### 1. Balanced Accuracy
- **Females (0):** Difference = +0.0587, Ratio = 1.0658  
- **Males (1):** Difference = −0.0587, Ratio = 0.9382  
- ➝ Balanced accuracy is higher for females, suggesting they benefit from stronger classification quality.

---

### 2. False Positive Rate (FPR)
- **Females (0):** Diff = −0.0688, Ratio = 0.3125  
- **Males (1):** Diff = +0.0688, Ratio = 3.2000  
- ➝ Females have a **much lower false positive rate**, while males are more frequently misclassified as having CVD.

---

### 3. Positive Predictive Value (PPV / Precision)
- **Females (0):** Diff = +0.0567, Ratio = 1.0617  
- **Males (1):** Diff = −0.0567, Ratio = 0.9419  
- ➝ Predictions are **more reliable for females**, while males experience lower precision.

---

### 4. Selection Rate
- **Females (0):** Diff = +0.0150, Ratio = 1.0275  
- **Males (1):** Diff = −0.0150, Ratio = 0.9732  
- ➝ Females are selected slightly more often, while males are under-selected.

---

### 5. True Positive Rate (TPR / Sensitivity)
- **Females (0):** Diff = +0.0487, Ratio = 1.0551  
- **Males (1):** Diff = −0.0487, Ratio = 0.9478  
- ➝ Females have higher sensitivity, meaning they are more likely to be correctly identified when they have CVD.

---

### **Summary**
- **Females (unprivileged):** Benefit across most metrics — higher balanced accuracy, lower false positive rate, better precision, higher sensitivity, and slightly higher selection rates.  
- **Males (privileged):** Are disadvantaged, with more false positives, lower precision, and lower sensitivity.  

Overall, the KNN model appears to be **biased in favor of females**, reversing the more common pattern where privileged groups (males) benefit. This indicates that error distribution is uneven and requires consideration for fairness mitigation, as males experience comparatively worse outcomes.

---

In [14]:
from fairmlhealth import measure
import pandas as pd
from IPython.display import display  

# convert gender into DataFrame with a clear column name to get a nice table as output
gender_df = pd.DataFrame({"gender": X_test["gender"].astype(int)})


# Get the stratified table
perf_table_knn = measure.performance(
    X=gender_df,
    y_true=y_test,
    y_pred=y_pred_knn,
    y_prob=y_prob_knn
)


# display pretty table
display(perf_table_knn)


  warn(f"Possible error in column(s) {cols}. {wr}\n")


Unnamed: 0,Feature Name,Feature Value,Obs.,Mean Target,Mean Prediction,Accuracy,F1-Score,FPR,PR AUC,Precision,ROC AUC,TPR
0,ALL FEATURES,ALL VALUES,200.0,0.58,0.555,0.935,0.9427,0.0476,0.3857,0.964,0.9374,0.9224
1,gender,0,46.0,0.5652,0.5435,0.8913,0.902,0.1,0.3697,0.92,0.8923,0.8846
2,gender,1,154.0,0.5844,0.5584,0.9481,0.9545,0.0312,0.3901,0.9767,0.951,0.9333


## Stratified Performance Analysis – KNN by Gender

This table shows the **stratified performance metrics** of the KNN model across gender groups.  
Here, **0 = Female (unprivileged)** and **1 = Male (privileged)**.

---

### 1. Overall Performance (All Features)
- **Accuracy (0.9350)** and **F1-score (0.9427)** indicate strong overall model performance.  
- **ROC AUC (0.9374)** demonstrates good discriminatory ability, though not as high as some other models.  
- **Precision (0.9640)** and **TPR (0.9224)** show a solid balance between predictive reliability and sensitivity.  
- **PR AUC (0.3857)** is relatively low, which may be due to the dataset’s class distribution.

---

### 2. Subgroup Comparison

| Metric        | Female (0) | Male (1) | Interpretation |
|---------------|------------|----------|----------------|
| **Accuracy**  | 0.8913     | 0.9481   | Accuracy is notably higher for males. |
| **F1-Score**  | 0.9020     | 0.9545   | Model performs better for males. |
| **FPR**       | 0.1000     | 0.0312   | Females experience many more false positives than males. |
| **Precision** | 0.9200     | 0.9767   | Predictions are more reliable for males. |
| **ROC AUC**   | 0.8923     | 0.9510   | Males benefit from stronger ranking performance. |
| **TPR**       | 0.8846     | 0.9333   | Males are more likely to be correctly identified when they have CVD. |

---

### 3. Interpretation
- **Females (unprivileged)**:  
  - Disadvantaged across almost all metrics: lower accuracy, F1, ROC AUC, and precision.  
  - They also suffer from a **much higher false positive rate (10% vs. 3.12%)**, meaning more healthy females are incorrectly flagged as having CVD.  
  - Sensitivity (TPR = 88.46%) is weaker, increasing the risk of missed diagnoses.  

- **Males (privileged)**:  
  - Benefit from **higher accuracy, F1, precision, and ROC AUC**.  
  - Enjoy lower false positives and stronger sensitivity (93.33%), indicating more consistent and favorable performance across all metrics.  

---

### **Summary**
The KNN model exhibits a **systematic bias in favor of males**.  
- **Males** consistently receive more favorable outcomes, with higher accuracy, reliability, and sensitivity, as well as fewer false positives.  
- **Females** face both **higher false alarms** and **more missed cases**, highlighting significant fairness concerns.  

This performance disparity aligns with the fairness metrics, confirming that KNN is the **least balanced model** across gender and requires mitigation.

---

In [15]:
#group specific error analysis

from fairmlhealth import performance_metrics as pm

# Define group masks with clear names
female_mask = (protected_attr_knn == 0)  # unprivileged group (female)
male_mask   = (protected_attr_knn == 1)  # privileged group (male)

# Function to evaluate group-specific metrics
def evaluate_group_performance(group_name, mask):
    tpr = pm.true_positive_rate(y_test[mask], y_pred_knn[mask])
    fpr = pm.false_positive_rate(y_test[mask], y_pred_knn[mask])
    print(f"{group_name} Results:")
    print(f"  True Positive Rate (TPR): {tpr:.4f}")
    print(f"  False Positive Rate (FPR): {fpr:.4f}")
    print("-" * 40)

# Evaluate for each group
evaluate_group_performance("Female (Unprivileged)", female_mask)
evaluate_group_performance("Male (Privileged)", male_mask)


Female (Unprivileged) Results:
  True Positive Rate (TPR): 0.8846
  False Positive Rate (FPR): 0.1000
----------------------------------------
Male (Privileged) Results:
  True Positive Rate (TPR): 0.9333
  False Positive Rate (FPR): 0.0312
----------------------------------------


### Group-Specific Error Analysis – KNN Model

To further examine fairness at the subgroup level, we compared the **True Positive Rate (TPR)** and **False Positive Rate (FPR)** for the unprivileged (female) and privileged (male) groups.

#### Results by Gender Group

| Group                  | TPR     | FPR     |
|------------------------|---------|---------|
| Unprivileged (Female)  | 0.8846  | 0.1000  |
| Privileged (Male)      | 0.9333  | 0.0312  |

#### Interpretation

- The **privileged group (male)** achieves a higher **TPR (93.33%)** than the unprivileged group (88.46%), meaning males are more often correctly identified when they truly have CVD.  
- At the same time, the **FPR for females (10.00%)** is more than three times higher than for males (3.12%), indicating that women are more likely to be incorrectly flagged as having CVD.  
- This imbalance highlights that the model is both **less sensitive** for females (missing more true cases) and **less specific** (producing more false alarms).  
- These subgroup disparities correspond with fairness metrics such as the **Equal Odds Difference and Ratio**, which capture unequal error distributions between genders.  

#### Summary

The results reveal a **systematic disadvantage for the unprivileged group (females)**:  
- They face more missed detections (lower TPR) and more false alarms (higher FPR).  
- In contrast, males benefit from stronger sensitivity and better specificity.  

This outcome underscores the need for **fairness mitigation strategies** to reduce error rate disparities and improve equity in model predictions.

---

### Decision Tree - DT

In [16]:
import pandas as pd

# Load KNN results
dt_df = pd.read_csv("MendeleyData_50_50_DT_pruned_tuned_predictions.csv")

print(dt_df.head())

   gender  y_true  y_pred    y_prob
0       0       0       0  0.000000
1       1       0       0  0.000000
2       1       1       1  0.993939
3       1       1       1  1.000000
4       1       0       0  0.000000


In [17]:
import re

# Extract common columns
y_true_dt = dt_df["y_true"].values
y_prob_dt = dt_df["y_prob"].values
y_pred_dt = dt_df["y_pred"].values
gender_dt = dt_df["gender"].values


# Use gender_knn as the protected attribute (0/1 as in your CSV)
protected_attr_dt = gender_dt


In [18]:
# Decision Tree Gender Bias Report
print("\n--- Decision Tree Gender Bias Report ---")

dt_bias = measure.summary(
    X=X_test,
    y_true=y_test,
    y_pred=y_pred_dt,
    y_prob=y_prob_dt,
    prtc_attr=protected_attr_dt,
    pred_type="classification",
    priv_grp=1,  # 1 = Male = Privileged
    sig_fig=4,
    skip_if=True,  
    skip_performance = True
)

print(dt_bias)


--- Decision Tree Gender Bias Report ---
                                                        Value
Metric         Measure                                       
Group Fairness AUC Difference                          0.0482
               Balanced Accuracy Difference            0.0191
               Balanced Accuracy Ratio                 1.0205
               Disparate Impact Ratio                  1.0189
               Equal Odds Difference                   0.0444
               Equal Odds Ratio                        1.0667
               Positive Predictive Parity Difference  -0.0062
               Positive Predictive Parity Ratio        0.9934
               Statistical Parity Difference           0.0113
Data Metrics   Prevalence of Privileged Class (%)     77.0000


In [19]:
#Flag metrics outside acceptable fairness bounds in current table 

styled_dt = MyFlagger().apply_flag(
    df=dt_bias,
    caption="Decision Tree Fairness (Gender)",
    boundaries=bounds,
    sig_fig=4,
    as_styler=True
)
styled_dt

Unnamed: 0_level_0,Unnamed: 1_level_0,Value
Metric,Measure,Unnamed: 2_level_1
Group Fairness,AUC Difference,0.0482
Group Fairness,Balanced Accuracy Difference,0.0191
Group Fairness,Balanced Accuracy Ratio,1.0205
Group Fairness,Disparate Impact Ratio,1.0189
Group Fairness,Equal Odds Difference,0.0444
Group Fairness,Equal Odds Ratio,1.0667
Group Fairness,Positive Predictive Parity Difference,-0.0062
Group Fairness,Positive Predictive Parity Ratio,0.9934
Group Fairness,Statistical Parity Difference,0.0113
Data Metrics,Prevalence of Privileged Class (%),77.0


## Fairness Evaluation – Decision Tree by Gender

---

### 1. Group Fairness Metrics

- **AUC Difference (0.0482):** Males show slightly higher AUC, meaning they benefit from stronger ranking performance compared to females.  
- **Balanced Accuracy Difference (0.0191)** and **Ratio (1.0205):** Balanced accuracy is marginally higher for males, indicating a small performance gap.  
- **Disparate Impact Ratio (1.0189):** Very close to 1, suggesting nearly equal selection rates across genders.  
- **Equal Odds Difference (0.0444)** and **Ratio (1.0667):** Some disparity exists in error rates (TPR/FPR), with males experiencing somewhat more favorable outcomes.  
- **Positive Predictive Parity Difference (−0.0062)** and **Ratio (0.9934):** Precision is nearly identical across genders, with a very small disadvantage for females.  
- **Statistical Parity Difference (0.0113):** Selection rates are slightly higher for males, though the gap is negligible.  
---

### 2. Interpretation

- Overall, the Decision Tree model demonstrates **relatively balanced performance across genders**.  
- Most fairness measures are very close to parity, but **males (privileged group)** hold a slight advantage in terms of AUC, balanced accuracy, and error rate distribution.  
- Females show nearly equal predictive precision and selection rates, meaning disparities are not severe.  

---

### **Summary**
The Decision Tree model exhibits **modest fairness disparities**, with a slight tendency to favor males in ranking ability, balanced accuracy, and error distributions.  
However, differences are small, and the model remains more balanced compared to models like KNN, which show stronger systematic bias.

---

In [20]:
print("FairMLHealth Stratified Bias Table - DT")
measure.bias(X_test, y_test, y_pred_dt, features=['gender'], flag_oor=False)

FairMLHealth Stratified Bias Table - DT


Unnamed: 0,Feature Name,Feature Value,Balanced Accuracy Difference,Balanced Accuracy Ratio,FPR Diff,FPR Ratio,PPV Diff,PPV Ratio,Selection Diff,Selection Ratio,TPR Diff,TPR Ratio
0,gender,0,-0.0191,0.9799,-0.0063,0.9375,0.0062,1.0067,-0.0113,0.9814,-0.0444,0.9556
1,gender,1,0.0191,1.0205,0.0063,1.0667,-0.0062,0.9934,0.0113,1.0189,0.0444,1.0465


## Stratified Bias Analysis – Decision Tree by Gender

This table presents **group-specific fairness metrics** for the Decision Tree model, stratified by gender.  
Here, **0 = Female (unprivileged)** and **1 = Male (privileged)**.

---

### 1. Balanced Accuracy
- **Females (0):** Difference = −0.0191, Ratio = 0.9799  
- **Males (1):** Difference = +0.0191, Ratio = 1.0205  
- ➝ Males benefit from slightly higher balanced accuracy, but the disparity is very small.

---

### 2. False Positive Rate (FPR)
- **Females (0):** Diff = −0.0063, Ratio = 0.9375  
- **Males (1):** Diff = +0.0063, Ratio = 1.0667  
- ➝ Females have a slightly lower false positive rate, while males face marginally more false alarms.

---

### 3. Positive Predictive Value (PPV / Precision)
- **Females (0):** Diff = +0.0062, Ratio = 1.0067  
- **Males (1):** Diff = −0.0062, Ratio = 0.9934  
- ➝ Precision is marginally higher for females, though the difference is negligible.

---

### 4. Selection Rate
- **Females (0):** Diff = −0.0113, Ratio = 0.9814  
- **Males (1):** Diff = +0.0113, Ratio = 1.0189  
- ➝ Males are selected slightly more often than females.

---

### 5. True Positive Rate (TPR / Sensitivity)
- **Females (0):** Diff = −0.0444, Ratio = 0.9556  
- **Males (1):** Diff = +0.0444, Ratio = 1.0465  
- ➝ Males enjoy higher sensitivity, meaning they are more likely to be correctly identified when they truly have CVD.

---

### **Summary**
- **Females (unprivileged):** Benefit from a marginally lower false positive rate and slightly higher precision, but are disadvantaged by lower sensitivity (TPR) and balanced accuracy.  
- **Males (privileged):** Enjoy higher sensitivity and balanced accuracy, but face a slightly higher false positive rate.  

Overall, disparities are **modest**. The Decision Tree model is **relatively balanced**, with only minor tendencies:  
- **Males** are favored in sensitivity and accuracy.  
- **Females** benefit slightly in precision and specificity.  

This suggests the Decision Tree is comparatively fair across gender, with no severe systematic bias.

---

In [21]:
# Get the stratified performance table
perf_table_dt = measure.performance(
    X=gender_df,
    y_true=y_test,
    y_pred=y_pred_dt,
    y_prob=y_prob_dt
)

# Replace NaN with a dash
perf_table_dt = perf_table_dt.fillna("—")

# Display pretty table
display(perf_table_dt)


  warn(f"Possible error in column(s) {cols}. {wr}\n")


Unnamed: 0,Feature Name,Feature Value,Obs.,Mean Target,Mean Prediction,Accuracy,F1-Score,FPR,PR AUC,Precision,ROC AUC,TPR
0,ALL FEATURES,ALL VALUES,200.0,0.58,0.6,0.94,0.9492,0.0952,—,0.9333,0.9357,0.9655
1,gender,0,46.0,0.5652,0.6087,0.9565,0.963,0.1,0.4101,0.9286,0.9712,1.0
2,gender,1,154.0,0.5844,0.5974,0.9351,0.9451,0.0938,—,0.9348,0.923,0.9556


## Stratified Performance Analysis – Random Forest by Gender

This table shows the **stratified performance metrics** of the Random Forest model across gender groups.  
Here, **0 = Female (unprivileged)** and **1 = Male (privileged)**.

---

### 1. Overall Performance (All Features)
- **Accuracy (0.9400)** and **F1-score (0.9492)** indicate excellent overall classification performance.  
- **ROC AUC (0.9357)** confirms strong discriminatory ability between positive and negative cases.  
- **Precision (0.9333)** and **TPR (0.9655)** show the model balances predictive reliability and sensitivity effectively.  
- **Note**: PR AUC is reported as “—” for the overall set, though subgroup values are shown where possible.

---

### 2. Subgroup Comparison

| Metric        | Female (0) | Male (1) | Interpretation |
|---------------|------------|----------|----------------|
| **Accuracy**  | 0.9565     | 0.9351   | Accuracy is higher for females. |
| **F1-Score**  | 0.9630     | 0.9451   | Females achieve stronger F1 performance. |
| **FPR**       | 0.1000     | 0.0938   | Females experience slightly more false positives. |
| **Precision** | 0.9286     | 0.9348   | Predictions are marginally more reliable for males. |
| **ROC AUC**   | 0.9712     | 0.9230   | Females benefit from a higher ranking performance. |
| **TPR**       | 1.0000     | 0.9556   | Females are perfectly identified when they have CVD, while males have slightly lower sensitivity. |

---

### 3. Interpretation
- **Females (unprivileged)**:  
  - Achieve **perfect sensitivity (TPR = 1.0000)**, ensuring no missed positive cases.  
  - Benefit from higher accuracy, F1, and ROC AUC compared to males.  
  - However, they face a slightly higher false positive rate (10% vs. 9.38%) and slightly weaker precision.  

- **Males (privileged)**:  
  - Enjoy slightly higher precision and fewer false positives, making their predictions more reliable.  
  - Their sensitivity (95.56%) is strong but lower than females, resulting in a few more missed cases.  
  - Overall, performance remains excellent but marginally less favorable than for females.  

---

### **Summary**
The Random Forest model performs **well for both genders**, but females enjoy more favorable outcomes in several key metrics (accuracy, F1, ROC AUC, and sensitivity).  
- **Females**: Higher recall and discrimination but at the cost of slightly more false alarms.  
- **Males**: Strong precision and fewer false positives, but lower sensitivity.  

Overall, the model shows a **mild bias in favor of females**, giving them more comprehensive detection coverage, though both groups achieve high-quality results.

---

In [22]:
from fairmlhealth import performance_metrics as pm

# Define group masks with clear names
female_mask = (protected_attr_dt == 0)  # female = unprivileged group
male_mask   = (protected_attr_dt == 1)  # male = privileged group 

# Function to evaluate group-specific metrics
def evaluate_group_performance(group_name, mask):
    tpr = pm.true_positive_rate(y_test[mask], y_pred_dt[mask])
    fpr = pm.false_positive_rate(y_test[mask], y_pred_dt[mask])
    print(f"{group_name} Results:")
    print(f"  True Positive Rate (TPR): {tpr:.4f}")
    print(f"  False Positive Rate (FPR): {fpr:.4f}")
    print("-" * 40)

# Evaluate for each group
evaluate_group_performance("Female (Unprivileged)", female_mask)
evaluate_group_performance("Male (Privileged)", male_mask)

Female (Unprivileged) Results:
  True Positive Rate (TPR): 1.0000
  False Positive Rate (FPR): 0.1000
----------------------------------------
Male (Privileged) Results:
  True Positive Rate (TPR): 0.9556
  False Positive Rate (FPR): 0.0938
----------------------------------------


### Group-Specific Error Analysis – Decision Tree

This section analyzes the classification performance of the Decision Tree model across gender groups using **True Positive Rate (TPR)** and **False Positive Rate (FPR)**.

#### Results by Gender Group

| Group                        | TPR     | FPR     |
|------------------------------|---------|---------|
| Unprivileged (female = 0)    | 1.0000  | 0.1000  |
| Privileged (male = 1)        | 0.9556  | 0.0938  |

#### Interpretation

- **True Positive Rate (TPR)** shows that females achieve **perfect sensitivity (100%)**, meaning all true CVD cases are detected.  
  - Males, while still high, have a slightly lower TPR (**95.56%**), indicating a small proportion of missed cases.  

- **False Positive Rate (FPR)** is fairly similar across groups:  
  - Females: **10.00%**  
  - Males: **9.38%**  
  - This suggests that both groups experience false alarms, with females only marginally more affected.  

#### Implications

- The **Decision Tree achieves excellent sensitivity for females**, ensuring no missed detections in this group.  
- Males maintain strong but slightly weaker sensitivity, while benefiting from a marginally lower false positive rate.  
- These differences reflect a **mild trade-off**: females are fully protected from missed diagnoses but face a touch more false alarms, whereas males enjoy slightly better specificity but risk a few missed cases.  

#### Summary

Overall, the Decision Tree model demonstrates **strong performance across genders** with only small disparities.  
- **Females (unprivileged):** favored in sensitivity (perfect recall) but incur a slightly higher false positive rate.  
- **Males (privileged):** experience fewer false positives but slightly lower sensitivity.  

This indicates a **balanced but not identical error distribution**, where the model leans slightly in favor of females in terms of detection coverage.

---

### Ensemble Model - Random Forest - RF

In [23]:
rf_df = pd.read_csv("MendeleyData_50_50_baselineRF_predictions.csv")
print(rf_df.head())

   gender  y_true  y_pred  y_prob
0       0       0       0    0.11
1       1       0       0    0.01
2       1       1       1    0.96
3       1       1       1    0.88
4       1       0       0    0.01


In [24]:
# Extract common columns
y_true_rf = rf_df["y_true"].values
y_pred_rf = rf_df["y_pred"].values
y_prob_rf = rf_df["y_prob"].values
gender_rf = rf_df["gender"].values


# Use gender_knn as the protected attribute (0/1 as in your CSV)
protected_attr_rf = gender_rf

In [25]:
# Random Forest Gender Bias Report
print("\n--- Random Forest Gender Bias Report ---")

rf_bias = measure.summary(
    X=X_test,
    y_true=y_test,
    y_pred=y_pred_rf,
    y_prob=y_prob_rf,
    prtc_attr=protected_attr_rf,
    pred_type="classification",
    priv_grp=1,  # 1 = Male = Privileged
    sig_fig=4,
    skip_if=True,
    skip_performance = True
)

print(rf_bias)


--- Random Forest Gender Bias Report ---
                                                        Value
Metric         Measure                                       
Group Fairness AUC Difference                          0.0003
               Balanced Accuracy Difference            0.0146
               Balanced Accuracy Ratio                 1.0156
               Disparate Impact Ratio                  1.0652
               Equal Odds Difference                   0.0667
               Equal Odds Ratio                        1.6000
               Positive Predictive Parity Difference  -0.0260
               Positive Predictive Parity Ratio        0.9728
               Statistical Parity Difference           0.0373
Data Metrics   Prevalence of Privileged Class (%)     77.0000


In [26]:
# Flagged fairness table for Random Forest
styled_rf = MyFlagger().apply_flag(
    df=rf_bias,
    caption="Random Forest Fairness (Gender)",
    boundaries=bounds,
    sig_fig=4,
    as_styler=True
)
styled_rf

Unnamed: 0_level_0,Unnamed: 1_level_0,Value
Metric,Measure,Unnamed: 2_level_1
Group Fairness,AUC Difference,0.0003
Group Fairness,Balanced Accuracy Difference,0.0146
Group Fairness,Balanced Accuracy Ratio,1.0156
Group Fairness,Disparate Impact Ratio,1.0652
Group Fairness,Equal Odds Difference,0.0667
Group Fairness,Equal Odds Ratio,1.6
Group Fairness,Positive Predictive Parity Difference,-0.026
Group Fairness,Positive Predictive Parity Ratio,0.9728
Group Fairness,Statistical Parity Difference,0.0373
Data Metrics,Prevalence of Privileged Class (%),77.0


## Fairness Evaluation – Random Forest by Gender

---

### 1. Group Fairness Metrics

- **AUC Difference (−0.0020):** Negligible difference in ranking performance across genders, suggesting both groups are treated similarly in terms of separability.  
- **Balanced Accuracy Difference (0.0146)** and **Ratio (1.0156):** Slight advantage for males, but the difference remains very small.  
- **Disparate Impact Ratio (1.0652):** Within the fairness guideline range (0.8–1.25), showing reasonably balanced selection rates across genders.  
- **Equal Odds Difference (0.0667)** and **Ratio (1.6000):** A more pronounced disparity in error distributions (TPR and FPR), indicating uneven error trade-offs across genders.  
- **Positive Predictive Parity Difference (−0.0260)** and **Ratio (0.9728):** Precision is slightly lower for females, meaning positive predictions are a bit less reliable for the unprivileged group.  
- **Statistical Parity Difference (0.0373):** Males are selected somewhat more often than females, though the difference is modest.  


### 2. Interpretation

- The Random Forest model shows **overall balanced fairness across most metrics**, with negligible differences in AUC and balanced accuracy.  
- However, the **Equal Odds metrics highlight a notable disparity**, meaning that males and females experience different error rates (false positives and false negatives).  
- Precision is slightly higher for males, and selection rates also favor the privileged group.  

---

### **Summary**
The Random Forest model demonstrates **generally fair performance across genders**, but with **uneven error distributions** that give a mild advantage to males in terms of precision and error trade-offs.  
While disparities are smaller than in KNN, they are more visible than in the MLP model, suggesting that fairness monitoring and potential mitigation remain important if this model were to be used in practice.

---

In [27]:
print("FairMLHealth Stratified Bias Table - RF")
measure.bias(X_test, y_test, y_pred_rf, features=['gender'], flag_oor=False)

FairMLHealth Stratified Bias Table - RF


Unnamed: 0,Feature Name,Feature Value,Balanced Accuracy Difference,Balanced Accuracy Ratio,FPR Diff,FPR Ratio,PPV Diff,PPV Ratio,Selection Diff,Selection Ratio,TPR Diff,TPR Ratio
0,gender,0,-0.0146,0.9846,-0.0375,0.625,0.026,1.028,-0.0373,0.9388,-0.0667,0.9333
1,gender,1,0.0146,1.0156,0.0375,1.6,-0.026,0.9728,0.0373,1.0652,0.0667,1.0714


## Stratified Bias Analysis – Random Forest by Gender

This table presents **group-specific fairness metrics** for the Random Forest model, stratified by gender.  
Here, **0 = Female (unprivileged)** and **1 = Male (privileged)**.

---

### 1. Balanced Accuracy
- **Females (0):** Difference = −0.0146, Ratio = 0.9846  
- **Males (1):** Difference = +0.0146, Ratio = 1.0156  
- ➝ Males benefit from slightly higher balanced accuracy, but the gap is very small.

---

### 2. False Positive Rate (FPR)
- **Females (0):** Diff = −0.0375, Ratio = 0.6250  
- **Males (1):** Diff = +0.0375, Ratio = 1.6000  
- ➝ Females face a lower false positive rate, while males are more likely to be incorrectly classified as having CVD.

---

### 3. Positive Predictive Value (PPV / Precision)
- **Females (0):** Diff = +0.0260, Ratio = 1.0280  
- **Males (1):** Diff = −0.0260, Ratio = 0.9728  
- ➝ Precision is slightly higher for females, meaning their positive predictions are marginally more reliable.

---

### 4. Selection Rate
- **Females (0):** Diff = −0.0373, Ratio = 0.9388  
- **Males (1):** Diff = +0.0373, Ratio = 1.0652  
- ➝ Males are selected more frequently, while females are slightly under-selected.

---

### 5. True Positive Rate (TPR / Sensitivity)
- **Females (0):** Diff = −0.0667, Ratio = 0.9333  
- **Males (1):** Diff = +0.0667, Ratio = 1.0714  
- ➝ Males enjoy higher sensitivity, meaning they are more likely to be correctly identified when they have CVD, while females face more missed cases.

---

### **Summary**
- **Females (unprivileged):** Benefit from **fewer false positives** and slightly **higher precision**, but are disadvantaged by **lower sensitivity (TPR)** and somewhat lower selection rates.  
- **Males (privileged):** Enjoy **higher sensitivity and selection rates**, but face more false positives and slightly lower precision.  

Overall, the Random Forest model shows a **trade-off**:  
- **Females** are less frequently selected and miss more true cases, but when predicted positive, their results are more precise and with fewer false alarms.  
- **Males** are detected more often (higher TPR) but at the cost of **more false positives** and slightly weaker prediction reliability.  

This indicates a **mixed fairness profile**, with each gender advantaged in different aspects of performance.

---

In [28]:
# Get the stratified performance table
perf_table_rf = measure.performance(
    X=gender_df,
    y_true=y_test,
    y_pred=y_pred_rf,
    y_prob=y_prob_rf
)

# Replace NaN with a dash
perf_table_rf = perf_table_rf.fillna("—")

# display pretty table
display(perf_table_rf)


  warn(f"Possible error in column(s) {cols}. {wr}\n")


Unnamed: 0,Feature Name,Feature Value,Obs.,Mean Target,Mean Prediction,Accuracy,F1-Score,FPR,PR AUC,Precision,ROC AUC,TPR
0,ALL FEATURES,ALL VALUES,200.0,0.58,0.58,0.94,0.9483,0.0714,—,0.9483,0.9859,0.9483
1,gender,0,46.0,0.5652,0.6087,0.9565,0.963,0.1,—,0.9286,0.9856,1.0
2,gender,1,154.0,0.5844,0.5714,0.9351,0.9438,0.0625,—,0.9545,0.9852,0.9333


## Stratified Performance Analysis – Random Forest by Gender

This table shows the **stratified performance metrics** of the Random Forest model across gender groups.  
Here, **0 = Female (unprivileged)** and **1 = Male (privileged)**.

---

### 1. Overall Performance (All Features)
- **Accuracy (0.9400)** and **F1-score (0.9483)** indicate excellent overall classification performance.  
- **ROC AUC (0.9841)** confirms very strong discriminatory ability.  
- **Precision (0.9483)** and **TPR (0.9483)** show the model balances predictive reliability and sensitivity well.  
- **Note**: PR AUC is not available (“—”) due to subgroup size limitations.

---

### 2. Subgroup Comparison

| Metric        | Female (0) | Male (1) | Interpretation |
|---------------|------------|----------|----------------|
| **Accuracy**  | 0.9565     | 0.9351   | Females achieve slightly higher accuracy. |
| **F1-Score**  | 0.9630     | 0.9438   | F1 performance is stronger for females. |
| **FPR**       | 0.1000     | 0.0625   | Females experience more false positives than males. |
| **Precision** | 0.9286     | 0.9545   | Predictions are more reliable for males. |
| **ROC AUC**   | 0.9817     | 0.9838   | Both groups have excellent ranking performance, with males slightly ahead. |
| **TPR**       | 1.0000     | 0.9333   | Females are perfectly identified when they have CVD, while males are slightly less likely to be detected. |

---

### 3. Interpretation
- **Females (unprivileged)**:  
  - Benefit from higher accuracy, F1, and perfect sensitivity (TPR = 1.0000), meaning no missed positive cases.  
  - However, they face more false alarms (FPR = 10%) and slightly lower precision than males.  

- **Males (privileged)**:  
  - Enjoy stronger precision (0.9545) and fewer false positives (6.25%).  
  - Their sensitivity (TPR = 93.33%) is slightly weaker, leading to a small number of missed cases.  
  - Overall, predictions for males are more reliable but less comprehensive in terms of detection.  

---

### **Summary**
The Random Forest model performs **very well for both genders**, but distributes errors differently:  
- **Females** benefit from stronger sensitivity and higher overall accuracy but are penalized with more false positives and weaker precision.  
- **Males** gain more reliable predictions and fewer false positives but face slightly lower sensitivity.  

This reflects a **trade-off rather than systematic bias**: females are more comprehensively detected, while males are more accurately confirmed.

---

In [29]:
from fairmlhealth import performance_metrics as pm

# Define group masks with clear names
female_mask = (protected_attr_rf == 0)  # female = unprivileged group
male_mask   = (protected_attr_rf == 1)  # male = privileged group 

# Function to evaluate group-specific metrics
def evaluate_group_performance(group_name, mask):
    tpr = pm.true_positive_rate(y_test[mask], y_pred_rf[mask])
    fpr = pm.false_positive_rate(y_test[mask], y_pred_rf[mask])
    print(f"{group_name} Results:")
    print(f"  True Positive Rate (TPR): {tpr:.4f}")
    print(f"  False Positive Rate (FPR): {fpr:.4f}")
    print("-" * 40)

# Evaluate for each group
evaluate_group_performance("Female (Unprivileged)", female_mask)
evaluate_group_performance("Male (Privileged)", male_mask)

Female (Unprivileged) Results:
  True Positive Rate (TPR): 1.0000
  False Positive Rate (FPR): 0.1000
----------------------------------------
Male (Privileged) Results:
  True Positive Rate (TPR): 0.9333
  False Positive Rate (FPR): 0.0625
----------------------------------------


### Group-Specific Error Analysis – Random Forest

This section presents the performance of the Random Forest model across gender groups, focusing on **True Positive Rate (TPR)** and **False Positive Rate (FPR)**.

#### Results by Gender Group

| Group                        | TPR     | FPR     |
|------------------------------|---------|---------|
| Unprivileged (female = 0)    | 1.0000  | 0.1000  |
| Privileged (male = 1)        | 0.9333  | 0.0625  |

#### Interpretation

- **True Positive Rate (TPR)** is **perfect for females (100%)**, compared to **93.33% for males**.  
  - This shows the model successfully identifies all true positive cases among females, while a small proportion of male cases are missed.  

- **False Positive Rate (FPR)** is **higher for females (10.00%)** than for males (6.25%).  
  - This indicates that females are more likely to be incorrectly flagged as having CVD.  

#### Implications

- The model exhibits a **gender-based trade-off**:  
  - **Females (unprivileged):** benefit from perfect sensitivity (no missed cases), but face more false alarms.  
  - **Males (privileged):** enjoy fewer false positives, but at the cost of slightly lower sensitivity.  

- This asymmetry suggests that the Random Forest model distributes errors differently across groups rather than consistently favoring one gender.  
  - **Females are over-diagnosed** (higher FPR) but fully detected (perfect TPR).  
  - **Males are under-diagnosed** (lower TPR) but less likely to be misclassified when healthy.  

- Depending on the clinical context, these imbalances may have different consequences:  
  - For females, more unnecessary follow-ups due to false alarms.  
  - For males, greater risk of missed diagnoses.  

#### Summary

The Random Forest model provides **excellent performance for both genders**, but error trade-offs differ:  
- **Females** gain in sensitivity but lose in specificity.  
- **Males** gain in specificity but lose in sensitivity.  

This pattern highlights the importance of evaluating not only overall accuracy but also **fairness in error distributions** when applying the model in healthcare settings.

---

### Deep Learning Model - Feed Forward Network (MLP)

In [30]:
mlp_df = pd.read_csv("MendeleyData_50_50_MLP_recallfirst_predictions.csv")
print(mlp_df.head())

   gender  y_true  y_pred        y_prob
0       0       0       0  3.821277e-04
1       1       0       0  2.510854e-09
2       1       1       1  9.999984e-01
3       1       1       1  9.999645e-01
4       1       0       0  1.178902e-05


In [31]:
# Extract common columns 
y_true_mlp = mlp_df["y_true"].values 
y_prob_mlp = mlp_df["y_prob"].values
y_pred_mlp = mlp_df["y_pred"].values
gender_mlp = mlp_df["gender"].values 

# Use gender_mlp as the protected attribute
protected_attr_mlp = gender_mlp 

In [32]:
#Run fairmlhealth bias detection for MLP 

mlp_bias = measure.summary(
    X=X_test,
    y_true=y_test,
    y_pred=y_pred_mlp,
    y_prob=y_prob_mlp,
    prtc_attr=protected_attr_mlp,
    pred_type="classification",
    priv_grp=1,
    sig_fig=4,
    skip_if=True,
    skip_performance = True
)

print(mlp_bias)

                                                        Value
Metric         Measure                                       
Group Fairness AUC Difference                         -0.0075
               Balanced Accuracy Difference           -0.0621
               Balanced Accuracy Ratio                 0.9340
               Disparate Impact Ratio                  0.8276
               Equal Odds Difference                  -0.1368
               Equal Odds Ratio                        0.8000
               Positive Predictive Parity Difference  -0.0005
               Positive Predictive Parity Ratio        0.9995
               Statistical Parity Difference          -0.0997
Data Metrics   Prevalence of Privileged Class (%)     77.0000


In [33]:
# Flagged fairness table for MLP
styled_mlp = MyFlagger().apply_flag(
    df=mlp_bias,
    caption="MLP Fairness (Gender)",
    boundaries=bounds,
    sig_fig=4,
    as_styler=True
)
styled_mlp

Unnamed: 0_level_0,Unnamed: 1_level_0,Value
Metric,Measure,Unnamed: 2_level_1
Group Fairness,AUC Difference,-0.0075
Group Fairness,Balanced Accuracy Difference,-0.0621
Group Fairness,Balanced Accuracy Ratio,0.934
Group Fairness,Disparate Impact Ratio,0.8276
Group Fairness,Equal Odds Difference,-0.1368
Group Fairness,Equal Odds Ratio,0.8
Group Fairness,Positive Predictive Parity Difference,-0.0005
Group Fairness,Positive Predictive Parity Ratio,0.9995
Group Fairness,Statistical Parity Difference,-0.0997
Data Metrics,Prevalence of Privileged Class (%),77.0


## Fairness Evaluation – MLP by Gender

---

### 1. Group Fairness Metrics

- **AUC Difference (−0.0075):** Very small difference in ranking performance across genders, suggesting that overall discrimination between positive and negative cases is balanced.  
- **Balanced Accuracy Difference (−0.0621)** and **Ratio (0.9340):** Females have lower balanced accuracy, indicating weaker classification performance compared to males.  
- **Disparate Impact Ratio (0.8276):** Slightly below the fairness guideline threshold of 0.80–1.25, showing that females are selected at noticeably lower rates.  
- **Equal Odds Difference (−0.1368)** and **Ratio (0.8000):** Clear disparities in error distribution (TPR/FPR), indicating that males are treated more favorably in terms of sensitivity and false positives.  
- **Positive Predictive Parity Difference (−0.0005)** and **Ratio (0.9995):** Precision is nearly identical across genders, meaning predictive reliability of positive cases is consistent.  
- **Statistical Parity Difference (−0.0997):** Females are significantly under-selected compared to males, suggesting unequal treatment in overall prediction outcomes.  

---

### 2. Interpretation

- The MLP model shows **mixed fairness results**:  
  - On the positive side, **AUC difference is negligible** and **precision is nearly equal** across genders.  
  - However, females face disadvantages in **balanced accuracy, selection rates, and error distributions (Equal Odds and Statistical Parity)**.  
- The **Equal Odds metrics** in particular reveal that males benefit from more favorable sensitivity and false positive balances, while females are disadvantaged.  

---

### **Summary**
The MLP model demonstrates **unequal treatment across genders**, with a tendency to favor males (privileged group).  
While precision is balanced and AUC differences are minimal, females face **lower balanced accuracy, reduced selection rates, and less favorable error distributions**.  
This suggests that without mitigation, the MLP model risks reinforcing bias against the unprivileged group (females).

---

In [34]:
print("FairMLHealth Stratified Bias Table - MLP")
measure.bias(X_test, y_test, y_pred_mlp, features=['gender'], flag_oor=False)

FairMLHealth Stratified Bias Table - MLP


Unnamed: 0,Feature Name,Feature Value,Balanced Accuracy Difference,Balanced Accuracy Ratio,FPR Diff,FPR Ratio,PPV Diff,PPV Ratio,Selection Diff,Selection Ratio,TPR Diff,TPR Ratio
0,gender,0,0.0621,1.0707,0.0125,1.25,0.0005,1.0005,0.0997,1.2084,0.1368,1.1693
1,gender,1,-0.0621,0.934,-0.0125,0.8,-0.0005,0.9995,-0.0997,0.8276,-0.1368,0.8552


## Stratified Bias Analysis – MLP by Gender
## Stratified Bias Analysis – MLP by Gender

This table presents **group-specific fairness metrics** for the MLP model, stratified by gender.  

---

### 1. Balanced Accuracy
- **Females (0):** Difference = +0.0621, Ratio = 1.0707  
- **Males (1):** Difference = −0.0621, Ratio = 0.9340  
- ➝ Females benefit from higher balanced accuracy, while males are disadvantaged.

---

### 2. False Positive Rate (FPR)
- **Females (0):** Diff = +0.0125, Ratio = 1.25  
- **Males (1):** Diff = −0.0125, Ratio = 0.80  
- ➝ Females experience a slightly higher false positive rate, whereas males benefit from fewer false alarms.

---

### 3. Positive Predictive Value (PPV / Precision)
- **Females (0):** Diff = +0.0005, Ratio = 1.0005  
- **Males (1):** Diff = −0.0005, Ratio = 0.9995  
- ➝ Precision is essentially equal across genders, with no meaningful disparity.

---

### 4. Selection Rate
- **Females (0):** Diff = +0.0997, Ratio = 1.2084  
- **Males (1):** Diff = −0.0997, Ratio = 0.8276  
- ➝ Females are selected substantially more often, while males are under-selected.

---

### 5. True Positive Rate (TPR / Sensitivity)
- **Females (0):** Diff = +0.1368, Ratio = 1.1693  
- **Males (1):** Diff = −0.1368, Ratio = 0.8552  
- ➝ Females enjoy much higher sensitivity, meaning their positive cases are detected more reliably. Males face a greater risk of missed detections.

---

### **Summary**
- **Females (unprivileged):** Gain advantages in balanced accuracy, selection rate, and sensitivity (TPR), but at the cost of a slightly higher false positive rate.  
- **Males (privileged):** Are disadvantaged in sensitivity and balanced accuracy, though they benefit from fewer false positives.  
- Precision (PPV) is virtually identical across genders, reducing concerns about prediction reliability.  

Overall, the MLP model shows a **clear tilt in favor of females**, who receive more favorable treatment in detection and selection, though they are also more prone to false alarms. Males, in contrast, face more missed diagnoses, which is a critical fairness concern in healthcare contexts.
## Stratified Bias Analysis – MLP by Gender

This table presents **group-specific fairness metrics** for the MLP model, stratified by gender.  
Here, **0 = Female (unprivileged)** and **1 = Male (privileged)**.

---

### 1. Balanced Accuracy
- **Females (0):** Difference = +0.0621, Ratio = 1.0707  
- **Males (1):** Difference = −0.0621, Ratio = 0.9340  
- ➝ Females benefit from higher balanced accuracy, while males are disadvantaged.

---

### 2. False Positive Rate (FPR)
- **Females (0):** Diff = +0.0125, Ratio = 1.25  
- **Males (1):** Diff = −0.0125, Ratio = 0.80  
- ➝ Females experience a slightly higher false positive rate, whereas males benefit from fewer false alarms.

---

### 3. Positive Predictive Value (PPV / Precision)
- **Females (0):** Diff = +0.0005, Ratio = 1.0005  
- **Males (1):** Diff = −0.0005, Ratio = 0.9995  
- ➝ Precision is essentially equal across genders, with no meaningful disparity.

---

### 4. Selection Rate
- **Females (0):** Diff = +0.0997, Ratio = 1.2084  
- **Males (1):** Diff = −0.0997, Ratio = 0.8276  
- ➝ Females are selected substantially more often, while males are under-selected.

---

### 5. True Positive Rate (TPR / Sensitivity)
- **Females (0):** Diff = +0.1368, Ratio = 1.1693  
- **Males (1):** Diff = −0.1368, Ratio = 0.8552  
- ➝ Females enjoy much higher sensitivity, meaning their positive cases are detected more reliably. Males face a greater risk of missed detections.

---

### **Summary**
- **Females (unprivileged):** Gain advantages in balanced accuracy, selection rate, and sensitivity (TPR), but at the cost of a slightly higher false positive rate.  
- **Males (privileged):** Are disadvantaged in sensitivity and balanced accuracy, though they benefit from fewer false positives.  
- Precision (PPV) is virtually identical across genders, reducing concerns about prediction reliability.  

Overall, the MLP model shows a **clear tilt in favor of females**, who receive more favorable treatment in detection and selection, though they are also more prone to false alarms. Males, in contrast, face more missed diagnoses, which is a critical fairness concern in healthcare contexts.

---

---

### 1. Balanced Accuracy
- **Females (0):** Difference = +0.0621, Ratio = 1.0707  
- **Males (1):** Difference = −0.0621, Ratio = 0.9340  
- ➝ Females benefit from higher balanced accuracy, while males are disadvantaged.

---

### 2. False Positive Rate (FPR)
- **Females (0):** Diff = +0.0125, Ratio = 1.25  
- **Males (1):** Diff = −0.0125, Ratio = 0.80  
- ➝ Females experience a slightly higher false positive rate, whereas males benefit from fewer false alarms.

---

### 3. Positive Predictive Value (PPV / Precision)
- **Females (0):** Diff = +0.0005, Ratio = 1.0005  
- **Males (1):** Diff = −0.0005, Ratio = 0.9995  
- ➝ Precision is essentially equal across genders, with no meaningful disparity.

---

### 4. Selection Rate
- **Females (0):** Diff = +0.0997, Ratio = 1.2084  
- **Males (1):** Diff = −0.0997, Ratio = 0.8276  
- ➝ Females are selected substantially more often, while males are under-selected.

---

### 5. True Positive Rate (TPR / Sensitivity)
- **Females (0):** Diff = +0.1368, Ratio = 1.1693  
- **Males (1):** Diff = −0.1368, Ratio = 0.8552  
- ➝ Females enjoy much higher sensitivity, meaning their positive cases are detected more reliably. Males face a greater risk of missed detections.

---

### **Summary**
- **Females (unprivileged):** Gain advantages in balanced accuracy, selection rate, and sensitivity (TPR), but at the cost of a slightly higher false positive rate.  
- **Males (privileged):** Are disadvantaged in sensitivity and balanced accuracy, though they benefit from fewer false positives.  
- Precision (PPV) is virtually identical across genders, reducing concerns about prediction reliability.  

Overall, the MLP model shows a **clear tilt in favor of females**, who receive more favorable treatment in detection and selection, though they are also more prone to false alarms. Males, in contrast, face more missed diagnoses, which is a critical fairness concern in healthcare contexts.

---

In [35]:
# Get the stratified performance table
perf_table_mlp = measure.performance(
    X=gender_df,
    y_true=y_test,
    y_pred=y_pred_mlp,
    y_prob=y_prob_mlp
)

# Replace NaN with a dash
perf_table_mlp = perf_table_mlp.fillna("—")

# display pretty table
display(perf_table_mlp)


  warn(f"Possible error in column(s) {cols}. {wr}\n")


Unnamed: 0,Feature Name,Feature Value,Obs.,Mean Target,Mean Prediction,Accuracy,F1-Score,FPR,PR AUC,Precision,ROC AUC,TPR
0,ALL FEATURES,ALL VALUES,200.0,0.58,0.555,0.925,0.9339,0.0595,—,0.955,0.9778,0.9138
1,gender,0,46.0,0.5652,0.4783,0.8696,0.875,0.05,—,0.9545,0.9731,0.8077
2,gender,1,154.0,0.5844,0.5779,0.9416,0.9497,0.0625,—,0.9551,0.9806,0.9444


## Stratified Performance Analysis – MLP by Gender

This table shows the **stratified performance metrics** of the MLP model across gender groups.  

---

### 1. Overall Performance (All Features)
- **Accuracy (0.9250)** and **F1-score (0.9339)** indicate strong overall performance.  
- **ROC AUC (0.9778)** demonstrates excellent discriminatory ability.  
- **Precision (0.9550)** is high, meaning positive predictions are reliable.  
- **TPR (0.9138)** suggests that the model correctly detects most positive cases, though subgroup analysis reveals imbalances.

---

### 2. Subgroup Comparison

| Metric        | Female (0) | Male (1) | Interpretation |
|---------------|------------|----------|----------------|
| **Accuracy**  | 0.8696     | 0.9416   | Accuracy is much higher for males. |
| **F1-Score**  | 0.8750     | 0.9497   | Model performs better for males. |
| **FPR**       | 0.0500     | 0.0625   | Females experience slightly fewer false positives. |
| **Precision** | 0.9545     | 0.9551   | Precision is nearly identical across genders. |
| **ROC AUC**   | 0.9731     | 0.9806   | Males benefit from stronger ranking performance. |
| **TPR**       | 0.8077     | 0.9444   | Females are more likely to be missed (lower sensitivity). |

---

### 3. Interpretation
- **Females (unprivileged):**  
  - Disadvantaged in accuracy, F1, ROC AUC, and especially sensitivity (TPR = 80.77%).  
  - Benefit slightly from fewer false positives (5% vs. 6.25%).  
  - Precision is nearly equal to males, showing prediction reliability is consistent.  

- **Males (privileged):**  
  - Enjoy higher accuracy, stronger F1, and better ranking ability.  
  - Higher sensitivity (94.44%) means more male cases are correctly identified.  
  - They face a slightly higher false positive rate, but the trade-off is favorable overall.  

---

### **Summary**
The MLP model demonstrates a **systematic disadvantage for females**, who suffer from lower sensitivity and weaker overall performance despite having slightly fewer false positives.  
- **Males** benefit from higher recall, accuracy, and AUC, making predictions more favorable for them.  
- **Females** risk more missed diagnoses, which is a critical fairness concern in clinical applications.  

This indicates that the MLP model may be **biased in favor of males**, requiring fairness-aware adjustments to reduce disparities.

---

In [36]:
from fairmlhealth import performance_metrics as pm

# Define group masks with clear names
female_mask = (protected_attr_mlp == 0)  # female = unprivileged group
male_mask   = (protected_attr_mlp == 1)  # male = privileged group 

# Function to evaluate group-specific metrics
def evaluate_group_performance(group_name, mask):
    tpr = pm.true_positive_rate(y_test[mask], y_pred_mlp[mask])
    fpr = pm.false_positive_rate(y_test[mask], y_pred_mlp[mask])
    print(f"{group_name} Results:")
    print(f"  True Positive Rate (TPR): {tpr:.4f}")
    print(f"  False Positive Rate (FPR): {fpr:.4f}")
    print("-" * 40)

# Evaluate for each group
evaluate_group_performance("Female (Unprivileged)", female_mask)
evaluate_group_performance("Male (Privileged)", male_mask)

Female (Unprivileged) Results:
  True Positive Rate (TPR): 0.8077
  False Positive Rate (FPR): 0.0500
----------------------------------------
Male (Privileged) Results:
  True Positive Rate (TPR): 0.9444
  False Positive Rate (FPR): 0.0625
----------------------------------------


### Group-Specific Error Analysis – MLP Model

This section breaks down the classification performance of the MLP model across gender groups, using **True Positive Rate (TPR)** and **False Positive Rate (FPR)**.

#### Results by Gender Group

| Group                        | TPR     | FPR     |
|------------------------------|---------|---------|
| Unprivileged (female = 0)    | 0.8077  | 0.0500  |
| Privileged (male = 1)        | 0.9444  | 0.0625  |

#### Interpretation

- **True Positive Rate (TPR)** is higher for males (94.44%) compared to females (80.77%).  
  - This means the model is **better at correctly identifying true positive cases for males**, while females face more missed detections.  

- **False Positive Rate (FPR)** is lower for females (5.00%) compared to males (6.25%).  
  - This indicates that females are **less likely to receive false alarms** than males.  

#### Implications

- The MLP model shows a **gender-based trade-off**:  
  - **Females (unprivileged):** benefit from fewer false positives (greater specificity) but suffer from lower sensitivity, meaning more true cases are missed.  
  - **Males (privileged):** benefit from stronger sensitivity (higher TPR), but this comes with slightly more false alarms.  

- These asymmetries align with the fairness metrics (e.g., **Equal Odds Difference = −0.1368** and **Equal Odds Ratio = 0.8000**), which reflect uneven error distributions across genders.  

#### Recommendation

- While both groups perform relatively well overall, the model tends to **favor males in sensitivity** (recall), while **females gain in specificity** (fewer false alarms).  
- In a clinical setting:  
  - **For early detection**, the male advantage in TPR is beneficial.  
  - **For reducing unnecessary interventions**, the female advantage in FPR is preferable.  
- These trade-offs highlight the need to consider **fairness-aware approaches** to balance error rates across genders.

---