### Bias Detection and Fairness Evaluation on CVD Prediction (Mendeley Dataset) using FairMLhealth
Source: https://data.mendeley.com/datasets/dzz48mvjht/1

In [37]:
import pandas as pd

# Load X_test set
X_test = pd.read_csv("./data_splits/X_test.csv")
y_test = pd.read_csv("./data_splits/y_test.csv")

In [38]:
import fairmlhealth
import aif360
print("Environment setup successful")

Environment setup successful


In [39]:
#have a look at the details of fairmlhealth - especially the version
!pip show fairmlhealth

Name: fairmlhealth
Version: 1.0.2
Summary: Health-centered variation analysis
Home-page: https://github.com/KenSciResearch/fairMLHealth
Author: Christine Allen
Author-email: ca.magallen@gmail.com
License: 
Location: c:\users\patri\appdata\roaming\python\python310\site-packages
Requires: aif360, ipython, jupyter, numpy, pandas, requests, scikit-learn, scipy
Required-by: 


In [40]:
#have a look at the modules that are within fairmlhealth

print(dir(fairmlhealth))

['__builtins__', '__cached__', '__doc__', '__fairness_metrics', '__file__', '__loader__', '__name__', '__package__', '__path__', '__preprocessing', '__spec__', '__utils', '__validation', 'measure', 'performance_metrics']


In [41]:
#load necessary modules 

#import module measure to use measure.summary for bias detection
from fairmlhealth import measure

#import module for investigation of individual cohorts 
from fairmlhealth.__utils import iterate_cohorts

#import FairRanges to flag high values
from fairmlhealth.__utils import FairRanges

# Wrap the fairness summary function for cohort-wise analysis
@iterate_cohorts
def cohort_summary(**kwargs):
    return measure.summary(**kwargs)

During the execution of FairMLHealth and AIF360, several runtime warnings were raised (e.g., “AdversarialDebiasing will be unavailable” due to the absence of TensorFlow, and deprecation warnings from the inFairness package regarding PyTorch’s functorch.vmap). These warnings do not affect the fairness metrics or results presented in this study, as the unavailable components were not used. To maintain clarity of output, the warnings were silenced programmatically, and the analysis was conducted without issue.

In [42]:
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", module="inFairness")
warnings.filterwarnings("ignore", message="AdversarialDebiasing will be unavailable")

### Traditional Machine Learning Models - KNN & DT

#### K-nearest neighbors - KNN

In [43]:
import pandas as pd

# Load KNN results
knn_df = pd.read_csv("MendeleyData_75F25M_KNN_predictions.csv")

print(knn_df.head())

   gender  y_true  y_pred_knn  y_prob_knn
0       0       0           1         0.6
1       1       0           0         0.0
2       1       1           1         0.6
3       1       1           1         0.6
4       1       0           0         0.0


In [44]:
# Extract common columns
y_true_knn = knn_df["y_true"].values
y_prob_knn = knn_df["y_prob_knn"].values
y_pred_knn = knn_df["y_pred_knn"].values
gender_knn = knn_df["gender"].values

# Use gender_knn as the protected attribute (0/1 as in your CSV)
protected_attr_knn = gender_knn

In [45]:
knn_bias = measure.summary(
    X=X_test,
    y_true=y_true_knn,
    y_pred=y_pred_knn,
    y_prob=y_prob_knn,
    prtc_attr=protected_attr_knn,
    pred_type="classification",
    priv_grp=1,
    sig_fig=4,
    skip_if=True,   # skip inconsistency metrics that cause NearestNeighbors error
    skip_performance=True
)

print(knn_bias)

                                                        Value
Metric         Measure                                       
Group Fairness AUC Difference                         -0.0075
               Balanced Accuracy Difference           -0.0608
               Balanced Accuracy Ratio                 0.9327
               Disparate Impact Ratio                  1.0390
               Equal Odds Difference                   0.1062
               Equal Odds Ratio                        2.1333
               Positive Predictive Parity Difference  -0.0792
               Positive Predictive Parity Ratio        0.9150
               Statistical Parity Difference           0.0220
Data Metrics   Prevalence of Privileged Class (%)     77.0000


In [46]:
# 2) Custom scenario oriented bounds

custom_ranges = {
    "tpr diff": (-0.03, 0.03),
    "fpr diff": (-0.03, 0.03),
    "equal odds difference": (-0.04, 0.04),
    "statistical parity difference": (-0.05, 0.05),
    "disparate impact ratio": (0.9, 1.1),
    "selection ratio": (0.9, 1.1),
    "auc difference": (-0.02, 0.02),
    "balanced accuracy difference": (-0.02, 0.02),
}

bounds = FairRanges().load_fair_ranges(custom_ranges=custom_ranges)

In [47]:
#  restore Styler.set_precision to adjust the highlighting color in the styled table
import pandas as pd, numpy as np

Styler = type(pd.DataFrame({"_":[0]}).style)  

if not hasattr(Styler, "set_precision"):
    def _set_precision(self, precision=4):
        try:
            return self.format(precision=precision)
        except TypeError:
            return self.format(formatter=lambda x:
                f"{x:.{precision}g}" if isinstance(x, (int, float, np.floating)) else x
            )
    setattr(Styler, "set_precision", _set_precision)

In [48]:
#Flag metrics outside acceptable fairness bounds in current table 

from fairmlhealth.__utils import Flagger

class MyFlagger(Flagger):
    def reset(self):
        super().reset()
        self.flag_color = "#491ee6"   
        self.flag_type = "background-color"

styled_knn = MyFlagger().apply_flag(
    df=knn_bias,
    caption="KNN Fairness (Gender)",
    boundaries=bounds,
    sig_fig=4,
    as_styler=True
)
styled_knn

Unnamed: 0_level_0,Unnamed: 1_level_0,Value
Metric,Measure,Unnamed: 2_level_1
Group Fairness,AUC Difference,-0.0075
Group Fairness,Balanced Accuracy Difference,-0.0608
Group Fairness,Balanced Accuracy Ratio,0.9327
Group Fairness,Disparate Impact Ratio,1.039
Group Fairness,Equal Odds Difference,0.1062
Group Fairness,Equal Odds Ratio,2.1333
Group Fairness,Positive Predictive Parity Difference,-0.0792
Group Fairness,Positive Predictive Parity Ratio,0.915
Group Fairness,Statistical Parity Difference,0.022
Data Metrics,Prevalence of Privileged Class (%),77.0


## Gender Bias Detection Results for KNN Model 

The fairness evaluation of the **KNN model** with respect to **gender** reveals the following insights:

### 1. Group Fairness  
- **AUC Difference (-0.0075)**: Very small, indicating that the model's ranking ability is nearly identical across genders.  
- **Balanced Accuracy Difference (-0.0608)**: Suggests a noticeable gap in balanced accuracy, with one gender receiving less accurate predictions.  
- **Balanced Accuracy Ratio (0.9327)**: Below the ideal value of 1, confirming reduced fairness in balanced accuracy.  
- **Equal Odds Difference (0.1062)**: Indicates that the model’s error rates (false positives/false negatives) differ between genders, which is a fairness concern.  
- **Equal Odds Ratio (2.1333)**: A high ratio, showing unequal treatment in predictive performance across groups.  
- **Disparate Impact Ratio (1.0390)**: Close to 1, suggesting that the likelihood of receiving a positive prediction is fairly balanced across genders.  
- **Positive Predictive Parity Difference (-0.0792)**: Negative difference shows lower predictive precision for one gender.  
- **Positive Predictive Parity Ratio (0.9150)**: Below 1, confirming disparity in predictive precision.  
- **Statistical Parity Difference (0.0220)**: Very small, meaning overall prediction rates are nearly balanced across groups.  

---

### **Summary**
- The **AUC** and **statistical parity** metrics suggest near-fair performance.  
- However, **balanced accuracy difference** and **equal odds difference** highlight substantial fairness gaps, meaning the model treats genders unequally in terms of predictive errors.  
- The imbalance in the dataset (77% privileged class) likely contributes to these disparities.  

---

In [49]:
print("FairMLHealth Stratified Bias Table - KNN")
measure.bias(X_test, y_test, y_pred_knn, features=['gender'], flag_oor=False)

FairMLHealth Stratified Bias Table - KNN


Unnamed: 0,Feature Name,Feature Value,Balanced Accuracy Difference,Balanced Accuracy Ratio,FPR Diff,FPR Ratio,PPV Diff,PPV Ratio,Selection Diff,Selection Ratio,TPR Diff,TPR Ratio
0,gender,0,0.0608,1.0722,-0.1062,0.4688,0.0792,1.093,-0.022,0.9625,0.0154,1.0174
1,gender,1,-0.0608,0.9327,0.1062,2.1333,-0.0792,0.915,0.022,1.039,-0.0154,0.9829


## Stratified Fairness Metrics by Gender

The table provides group-specific fairness metrics for the **KNN model**, separated by gender.  
Here, **0 = Female (unprivileged)** and **1 = Male (privileged)**.

---

### 1. Balanced Accuracy
- **Female (0): 0.0608 higher than baseline** (Balanced Accuracy Ratio = 1.0722)  
- **Male (1): 0.0608 lower than baseline** (Balanced Accuracy Ratio = 0.9327)  
➡️ The model is **more balanced for females** than for males.

---

### 2. False Positive Rate (FPR)
- **Females (0): FPR Diff = -0.1062, FPR Ratio = 0.4688**  
- **Males (1): FPR Diff = 0.1062, FPR Ratio = 2.1333**  
➡️ **Females have a much higher false positive rate**, meaning they are more often incorrectly flagged as positive cases compared to males.

---

### 3. Positive Predictive Value (PPV / Precision)
- **Females (0): PPV Diff = 0.0792, PPV Ratio = 1.093**  
- **Males (1): PPV Diff = -0.0792, PPV Ratio = 0.915**  
➡️ Precision is **better for females** (when predicted positive, it is more likely to be correct) but **worse for males**.

---

### 4. Selection Rate
- **Females (0): Selection Ratio = 0.9625**  
- **Males (1): Selection Ratio = 1.0390**  
Males are selected for positive predictions slightly more often than females.

---

### 5. True Positive Rate (TPR / Sensitivity)
- **Females (0): TPR Diff = 0.0154, TPR Ratio = 1.0174**  
- **Males (1): TPR Diff = -0.0154, TPR Ratio = 0.9829**  
➡️ Sensitivity is very similar, with females slightly advantaged.

---

### **Summary**
- **Females (unprivileged)**: suffer from **higher false positive rates** but benefit from **higher precision and slightly higher TPR**.  
- **Males (privileged)**: have **lower false positives** but **lower precision**.  
- Overall, the model’s fairness trade-off shows a **systematic disadvantage for females in specificity (FPR)**, even though they gain slightly in precision and sensitivity.  

This imbalance reflects why fairness metrics such as **Equal Odds Difference (0.1062)** and **Equal Odds Ratio (2.1333)** flagged disparities in the earlier evaluation.

---


In [50]:
from fairmlhealth import measure
import pandas as pd
from IPython.display import display  

# convert gender into DataFrame with a clear column name to get a nice table as output
gender_df = pd.DataFrame({"gender": X_test["gender"].astype(int)})


# Get the stratified table
perf_table_knn = measure.performance(
    X=gender_df,
    y_true=y_test,
    y_pred=y_pred_knn,
    y_prob=y_prob_knn
)


# display pretty table
display(perf_table_knn)


  warn(f"Possible error in column(s) {cols}. {wr}\n")


Unnamed: 0,Feature Name,Feature Value,Obs.,Mean Target,Mean Prediction,Accuracy,F1-Score,FPR,PR AUC,Precision,ROC AUC,TPR
0,ALL FEATURES,ALL VALUES,200.0,0.58,0.57,0.89,0.9043,0.119,0.3706,0.9123,0.9342,0.8966
1,gender,0,46.0,0.5652,0.587,0.8478,0.8679,0.2,0.3725,0.8519,0.9279,0.8846
2,gender,1,154.0,0.5844,0.5649,0.9026,0.9153,0.0938,0.3695,0.931,0.9354,0.9


## FairMLHealth Stratified Bias Analysis (KNN by Gender)

This table reports the **stratified performance of the KNN model** across gender subgroups, with  
- **0 = Female**  
- **1 = Male**  
---

### 1. Overall Performance
- **Accuracy (0.890)** and **F1-score (0.9043)** indicate strong overall model performance.  
- **ROC AUC (0.9342)** confirms the model’s ability to distinguish between CVD and non-CVD cases across genders.  

---

### 2. Subgroup Comparison

| Metric        | Female (0) | Male (1) | Observations |
|---------------|------------|----------|--------------|
| Accuracy      | **0.8478** | **0.9026** | Accuracy is lower for females |
| F1-Score      | 0.8679     | 0.9153   | Worse for females |
| False Positive Rate (FPR) | **0.2000** | **0.0938** | Females suffer double the false positives |
| Precision     | 0.8519     | 0.9310   | Predictions are less reliable for females |
| ROC AUC       | 0.9279     | 0.9354   | Similar across genders |
| True Positive Rate (TPR) | 0.8846     | 0.9000   | Slightly lower for females |
| Observations  | 46 (23%)   | 154 (77%) | Male-dominant test sample |

---

### 3. Interpretation
- **Female patients (minority in the test set)** experience **systematic disadvantage**:  
  - Lower accuracy and F1-score.  
  - **False positives are twice as common** compared to males (20% vs. 9.4%).  
  - Precision is significantly lower, meaning when the model predicts CVD for females, it is less often correct.  

- **Male patients** benefit from higher accuracy, F1, and precision.  

- Despite these disparities, **ROC AUC values are similar**, showing that the model distinguishes positive and negative cases fairly consistently, but **error distribution is biased against females**.  

---

### **Summary**
The KNN model, although strong overall, is **less fair to female patients**. It tends to over-predict CVD for women, leading to more false positives, while men receive more accurate and reliable predictions. 

---

In [51]:
#group specific error analysis

from fairmlhealth import performance_metrics as pm

# Define group masks with clear names
female_mask = (protected_attr_knn == 0)  # unprivileged group (female)
male_mask   = (protected_attr_knn == 1)  # privileged group (male)

# Function to evaluate group-specific metrics
def evaluate_group_performance(group_name, mask):
    tpr = pm.true_positive_rate(y_test[mask], y_pred_knn[mask])
    fpr = pm.false_positive_rate(y_test[mask], y_pred_knn[mask])
    print(f"{group_name} Results:")
    print(f"  True Positive Rate (TPR): {tpr:.4f}")
    print(f"  False Positive Rate (FPR): {fpr:.4f}")
    print("-" * 40)

# Evaluate for each group
evaluate_group_performance("Female (Unprivileged)", female_mask)
evaluate_group_performance("Male (Privileged)", male_mask)


Female (Unprivileged) Results:
  True Positive Rate (TPR): 0.8846
  False Positive Rate (FPR): 0.2000
----------------------------------------
Male (Privileged) Results:
  True Positive Rate (TPR): 0.9000
  False Positive Rate (FPR): 0.0938
----------------------------------------


### Group-Specific Error Analysis

To further examine fairness at the subgroup level, we compared the **True Positive Rate (TPR)** and **False Positive Rate (FPR)** for the unprivileged (female) and privileged (male) groups.

#### Results by Gender Group

| Group                  | TPR     | FPR     |
|------------------------|---------|---------|
| Unprivileged (Female)  | 0.8846  | 0.2000  |
| Privileged (Male)      | 0.9000  | 0.0938  |

#### Interpretation

- The **privileged group (male)** shows a slightly higher **TPR (90.00%)** than the unprivileged group (88.46%), meaning the model is marginally better at correctly identifying positives for males.  
- However, the **FPR is more than twice as high for females (20.00%) compared to males (9.38%)**, indicating that women are more likely to be incorrectly flagged as having CVD.  
- This imbalance suggests that, while overall sensitivity is fairly similar, the **specificity of the model disproportionately disadvantages females**, as they suffer from a higher rate of false positives.  
- These subgroup disparities align with the fairness metrics, where the **Equal Odds Difference and Ratio highlight unequal error distributions** between genders.  
- In summary, the results reveal a **systematic disadvantage for the unprivileged group (females)**, underscoring the importance of applying **fairness mitigation techniques** to reduce bias in error rates.

---


### Decision Tree - DT

In [52]:
import pandas as pd

# Load KNN results
dt_df = pd.read_csv("MendeleyData_75F25M_DT_classweightedtuned_predictions.csv")

print(dt_df.head())

   gender  y_true  y_pred_dt  y_prob_dt
0       0       0          0   0.000000
1       1       0          0   0.000000
2       1       1          1   0.960486
3       1       1          1   0.960486
4       1       0          0   0.000000


In [53]:
import re

# Extract common columns
y_true_dt = dt_df["y_true"].values
y_prob_dt = dt_df["y_prob_dt"].values
y_pred_dt = dt_df["y_pred_dt"].values
gender_dt = dt_df["gender"].values


# Use gender_knn as the protected attribute (0/1 as in your CSV)
protected_attr_dt = gender_dt


In [None]:
# Decision Tree Gender Bias Report
print("\nDecision Tree Gender Bias Report")

dt_bias = measure.summary(
    X=X_test,
    y_true=y_test,
    y_pred=y_pred_dt,
    y_prob=y_prob_dt,
    prtc_attr=protected_attr_dt,
    pred_type="classification",
    priv_grp=1,  # 1 = Male = Privileged
    sig_fig=4,
    skip_if=True,  
    skip_performance = True
)

print(dt_bias)


--- Decision Tree Gender Bias Report ---
                                                        Value
Metric         Measure                                       
Group Fairness AUC Difference                         -0.1194
               Balanced Accuracy Difference           -0.0404
               Balanced Accuracy Ratio                 0.9552
               Disparate Impact Ratio                  0.9972
               Equal Odds Difference                   0.0594
               Equal Odds Ratio                        1.4222
               Positive Predictive Parity Difference  -0.0471
               Positive Predictive Parity Ratio        0.9479
               Statistical Parity Difference          -0.0017
Data Metrics   Prevalence of Privileged Class (%)     77.0000


In [55]:
#Flag metrics outside acceptable fairness bounds in current table 

styled_dt = MyFlagger().apply_flag(
    df=dt_bias,
    caption="Decision Tree Fairness (Gender) — Custom Clinical Bounds",
    boundaries=bounds,
    sig_fig=4,
    as_styler=True
)
styled_dt

Unnamed: 0_level_0,Unnamed: 1_level_0,Value
Metric,Measure,Unnamed: 2_level_1
Group Fairness,AUC Difference,-0.1194
Group Fairness,Balanced Accuracy Difference,-0.0404
Group Fairness,Balanced Accuracy Ratio,0.9552
Group Fairness,Disparate Impact Ratio,0.9972
Group Fairness,Equal Odds Difference,0.0594
Group Fairness,Equal Odds Ratio,1.4222
Group Fairness,Positive Predictive Parity Difference,-0.0471
Group Fairness,Positive Predictive Parity Ratio,0.9479
Group Fairness,Statistical Parity Difference,-0.0017
Data Metrics,Prevalence of Privileged Class (%),77.0


## Fairness Metrics Interpretation for Decision Tree 

The fairness evaluation of the **Decision Tree model** with respect to gender provides the following insights:

### 1. Group Fairness
- **AUC Difference (-0.1194)**: Indicates a noticeable disparity in ranking quality between males and females, with the model performing worse for one group.  
- **Balanced Accuracy Difference (-0.0404)**: Suggests that balanced accuracy is moderately higher for one gender.  
- **Balanced Accuracy Ratio (0.9552)**: Below the ideal value of 1, confirming reduced fairness across groups.  
- **Equal Odds Difference (0.0594)**: Shows that error rates (false positives/false negatives) differ, though the difference is smaller than what was observed for KNN.  
- **Equal Odds Ratio (1.4222)**: A ratio greater than 1, highlighting unequal treatment across genders in predictive errors.  
- **Disparate Impact Ratio (0.9972)**: Very close to 1, meaning the overall likelihood of receiving a positive prediction is almost equal between males and females.  
- **Positive Predictive Parity Difference (-0.0471)**: Suggests that predictive precision is somewhat lower for one gender.  
- **Positive Predictive Parity Ratio (0.9479)**: Less than 1, confirming unequal predictive reliability.  
- **Statistical Parity Difference (-0.0017)**: Close to zero, meaning overall prediction rates are nearly identical across genders.  

---

### **Summary**
The Decision Tree model exhibits **relatively small but consistent fairness disparities** across genders:  
- It shows stronger group differences in **AUC** (ranking ability) and **balanced accuracy**, suggesting uneven predictive performance.  
- **Equal Odds metrics** point to an imbalance in error rates, though less pronounced than in KNN.  
- At the same time, **statistical and disparate impact measures are near ideal**, indicating that the overall distribution of predictions is balanced across genders.  

**Overall**, the Decision Tree is *fairer than KNN in terms of statistical parity*, but it still struggles with **error rate fairness**, particularly in ranking ability and predictive precision, which disadvantages the unprivileged group (females).


In [56]:
print("FairMLHealth Stratified Bias Table - DT")
measure.bias(X_test, y_test, y_pred_dt, features=['gender'], flag_oor=False)

FairMLHealth Stratified Bias Table - DT


Unnamed: 0,Feature Name,Feature Value,Balanced Accuracy Difference,Balanced Accuracy Ratio,FPR Diff,FPR Ratio,PPV Diff,PPV Ratio,Selection Diff,Selection Ratio,TPR Diff,TPR Ratio
0,gender,0,0.0404,1.0469,-0.0594,0.7031,0.0471,1.055,0.0017,1.0028,0.0214,1.0231
1,gender,1,-0.0404,0.9552,0.0594,1.4222,-0.0471,0.9479,-0.0017,0.9972,-0.0214,0.9774


## Stratified Fairness Metrics by Gender – Decision Tree

The table provides subgroup-specific fairness metrics for the **Decision Tree model**, separated by gender.  
Here, **0 = Female (unprivileged)** and **1 = Male (privileged)**.

---

### 1. Balanced Accuracy
- **Females (0): +0.0404 (Ratio = 1.0469)** → Higher balanced accuracy compared to the baseline.  
- **Males (1): -0.0404 (Ratio = 0.9552)** → Lower balanced accuracy relative to the baseline.  
➡️ The model achieves slightly **better balanced accuracy for females**.

---

### 2. False Positive Rate (FPR)
- **Females (0): FPR Diff = -0.0594, FPR Ratio = 0.7031**  
- **Males (1): FPR Diff = +0.0594, FPR Ratio = 1.4222**  
➡️ Females have a **higher false positive rate** than males, though the disparity is smaller than observed in the KNN model.

---

### 3. Positive Predictive Value (PPV / Precision)
- **Females (0): PPV Diff = +0.0471, PPV Ratio = 1.0550**  
- **Males (1): PPV Diff = -0.0471, PPV Ratio = 0.9479**  
➡️ Precision is **slightly better for females**, meaning predictions for them are somewhat more reliable.

---

### 4. Selection Rate
- **Females (0): Selection Ratio = 1.0028**  
- **Males (1): Selection Ratio = 0.9972**  
➡️ The model selects females and males for positive predictions at almost identical rates.

---

### 5. True Positive Rate (TPR / Sensitivity)
- **Females (0): TPR Diff = +0.0214, TPR Ratio = 1.0231**  
- **Males (1): TPR Diff = -0.0214, TPR Ratio = 0.9774**  
➡️ Sensitivity is slightly higher for females, meaning they are marginally more likely to be correctly identified as positive cases.

---

### **Summary**
- **Females (unprivileged)**: show slightly **better balanced accuracy, precision, and sensitivity**, but still experience a **higher false positive rate** than males.  
- **Males (privileged)**: benefit from lower false positives but show somewhat lower precision and sensitivity.  
- Overall, disparities exist but are **less pronounced than in KNN**, suggesting that the **Decision Tree provides a more balanced treatment across genders**, even if females still face a higher risk of false positives.

---

In [None]:
# Get the stratified performance table for DT
perf_table_dt = measure.performance(
    X=gender_df,
    y_true=y_test,
    y_pred=y_pred_dt,
    y_prob=y_prob_dt
)

# Replace NaN with a dash
perf_table_dt = perf_table_dt.fillna("—")

# Display pretty table
display(perf_table_dt)


  warn(f"Possible error in column(s) {cols}. {wr}\n")


Unnamed: 0,Feature Name,Feature Value,Obs.,Mean Target,Mean Prediction,Accuracy,F1-Score,FPR,PR AUC,Precision,ROC AUC,TPR
0,ALL FEATURES,ALL VALUES,200.0,0.58,0.61,0.9,0.916,0.1548,—,0.8934,0.9276,0.9397
1,gender,0,46.0,0.5652,0.6087,0.8696,0.8889,0.2,—,0.8571,0.8365,0.9231
2,gender,1,154.0,0.5844,0.6104,0.9091,0.9239,0.1406,0.3924,0.9043,0.9559,0.9444


## Stratified Performance Analysis – Decision Tree by Gender

This table presents the **stratified performance metrics** of the Decision Tree model across gender groups.  
Here, **0 = Female (unprivileged)** and **1 = Male (privileged)**.

---

### 1. Overall Performance (All Features)
- **Accuracy (0.9000)** and **F1-score (0.9160)** indicate strong general performance.  
- **ROC AUC (0.9276)** shows good discriminatory power.  
- **Precision (0.8934)** and **TPR (0.9397)** confirm a good balance between sensitivity and predictive reliability.  
- **Note**: For some subgroups, **PR AUC is reported as NaN** because the subgroup sample size did not allow calculation of a meaningful precision–recall curve. This does not affect the validity of the other metrics (Accuracy, F1, ROC AUC, etc.), which remain comparable across groups.

---

### 2. Subgroup Comparison

| Metric        | Female (0) | Male (1) | Interpretation |
|---------------|------------|----------|----------------|
| **Accuracy**  | 0.8696     | 0.9091   | Accuracy is lower for females. |
| **F1-Score**  | 0.8889     | 0.9239   | Model performs better for males. |
| **FPR**       | 0.2000     | 0.1406   | Females experience more false positives. |
| **Precision** | 0.8571     | 0.9043   | Predictions are more reliable for males. |
| **ROC AUC**   | 0.8365     | 0.9559   | Substantial gap; males benefit from much stronger ranking performance. |
| **TPR**       | 0.9231     | 0.9444   | Slightly higher sensitivity for males. |

---

### 3. Interpretation
- **Females (unprivileged)**:  
  - Lower accuracy, F1-score, and precision compared to males.  
  - **Much lower ROC AUC (0.8365 vs. 0.9559)** → the model discriminates less effectively between positive and negative cases for females.  
  - Higher false positive rate (20% vs. 14%), meaning more healthy females are incorrectly predicted as having CVD.  

- **Males (privileged)**:  
  - Benefit from higher accuracy, F1, and precision.  
  - Stronger ROC AUC and lower FPR → predictions are both more accurate and more reliable.  
  - Slightly higher sensitivity (TPR), showing males are marginally more likely to be correctly identified when they have CVD.  

---

### **Summary**
The Decision Tree model demonstrates **systematic disadvantages for females**. They face **more false positives**, **weaker precision**, and **much poorer discriminatory power (ROC AUC)** compared to males.  
Although sensitivity (TPR) is relatively balanced, the disparities in precision and AUC highlight a fairness issue where **males consistently receive more favorable predictive performance**.

In [58]:
from fairmlhealth import performance_metrics as pm

# Define group masks with clear names
female_mask = (protected_attr_dt == 0)  # female = unprivileged group
male_mask   = (protected_attr_dt == 1)  # male = privileged group 

# Function to evaluate group-specific metrics
def evaluate_group_performance(group_name, mask):
    tpr = pm.true_positive_rate(y_test[mask], y_pred_dt[mask])
    fpr = pm.false_positive_rate(y_test[mask], y_pred_dt[mask])
    print(f"{group_name} Results:")
    print(f"  True Positive Rate (TPR): {tpr:.4f}")
    print(f"  False Positive Rate (FPR): {fpr:.4f}")
    print("-" * 40)

# Evaluate for each group
evaluate_group_performance("Female (Unprivileged)", female_mask)
evaluate_group_performance("Male (Privileged)", male_mask)

Female (Unprivileged) Results:
  True Positive Rate (TPR): 0.9231
  False Positive Rate (FPR): 0.2000
----------------------------------------
Male (Privileged) Results:
  True Positive Rate (TPR): 0.9444
  False Positive Rate (FPR): 0.1406
----------------------------------------


### Group-Specific Error Analysis – Decision Tree

This section analyzes the classification performance of the Decision Tree model across gender groups using **True Positive Rate (TPR)** and **False Positive Rate (FPR)**.

#### Results by Gender Group

| Group                        | TPR     | FPR     |
|------------------------------|---------|---------|
| Unprivileged (female = 0)    | 0.9231  | 0.2000  |
| Privileged (male = 1)        | 0.9444  | 0.1406  |

#### Interpretation

- **True Positive Rate (TPR)** is high and comparable across genders:  
  - Females (unprivileged): **92.31%**  
  - Males (privileged): **94.44%**  
  - This indicates that the model identifies positive CVD cases with similar effectiveness for both groups.  

- **False Positive Rate (FPR)** reveals more variation:  
  - Females: **20.00%**  
  - Males: **14.06%**  
  - Females are therefore **more frequently misclassified as having CVD** when they do not, which reflects a disadvantage for the unprivileged group.  

#### Implications

- The **Decision Tree achieves balanced sensitivity (TPR)** across genders, but the **higher FPR for females** suggests a bias in specificity.  
- These differences contribute to fairness metrics such as the **Equal Odds Difference (0.0594)** and **Equal Odds Ratio (1.4222)** observed earlier.  
- In summary, while the Decision Tree performs consistently in detecting true positives across genders, it still places a **systematic burden on females** by producing more false positives in this group.

---

### Ensemble Model - Random Forest - RF

In [59]:
rf_df = pd.read_csv("MendeleyData_75F25M_RF_predictions.csv")
print(rf_df.head())

   gender  y_true  y_pred_rf  y_prob
0       0       0          0    0.23
1       1       0          0    0.14
2       1       1          1    0.96
3       1       1          1    0.81
4       1       0          0    0.02


In [60]:
# Extract common columns
y_true_rf = rf_df["y_true"].values
y_pred_rf = rf_df["y_pred_rf"].values
y_prob_rf = rf_df["y_prob"].values
gender_rf = rf_df["gender"].values


# Use gender_knn as the protected attribute (0/1 as in your CSV)
protected_attr_rf = gender_rf

In [73]:
# Random Forest Gender Bias Report
print("\n Random Forest Gender Bias Report")

rf_bias = measure.summary(
    X=X_test,
    y_true=y_test,
    y_pred=y_pred_rf,
    y_prob=y_prob_rf,
    prtc_attr=protected_attr_rf,
    pred_type="classification",
    priv_grp=1,  # 1 = Male = Privileged
    sig_fig=4,
    skip_if=True,
    skip_performance = True
)

print(rf_bias)


 Random Forest Gender Bias Report
                                                        Value
Metric         Measure                                       
Group Fairness AUC Difference                          0.0201
               Balanced Accuracy Difference            0.0641
               Balanced Accuracy Ratio                 1.0703
               Disparate Impact Ratio                  1.0511
               Equal Odds Difference                   0.1000
               Equal Odds Ratio                        0.6400
               Positive Predictive Parity Difference   0.0211
               Positive Predictive Parity Ratio        1.0224
               Statistical Parity Difference           0.0285
Data Metrics   Prevalence of Privileged Class (%)     77.0000


In [62]:
# Flagged fairness table for Random Forest
styled_rf = MyFlagger().apply_flag(
    df=rf_bias,
    caption="Random Forest Fairness (Gender)",
    boundaries=bounds,
    sig_fig=4,
    as_styler=True
)
styled_rf

Unnamed: 0_level_0,Unnamed: 1_level_0,Value
Metric,Measure,Unnamed: 2_level_1
Group Fairness,AUC Difference,0.0201
Group Fairness,Balanced Accuracy Difference,0.0641
Group Fairness,Balanced Accuracy Ratio,1.0703
Group Fairness,Disparate Impact Ratio,1.0511
Group Fairness,Equal Odds Difference,0.1
Group Fairness,Equal Odds Ratio,0.64
Group Fairness,Positive Predictive Parity Difference,0.0211
Group Fairness,Positive Predictive Parity Ratio,1.0224
Group Fairness,Statistical Parity Difference,0.0285
Data Metrics,Prevalence of Privileged Class (%),77.0


## Fairness Metrics Interpretation for Random Forest (Gender)

The fairness evaluation of the **Random Forest model** with respect to gender provides the following insights:

---

### 1. Group Fairness
- **AUC Difference (0.0201)**: Very small, suggesting the model’s ranking ability is almost equal across genders.  
- **Balanced Accuracy Difference (0.0641)**: Indicates that balanced accuracy is moderately higher for one group, favoring fairness gaps.  
- **Balanced Accuracy Ratio (1.0703)**: Slightly above 1, showing a modest imbalance in balanced accuracy between genders.  
- **Disparate Impact Ratio (1.0511)**: Close to 1, meaning that the overall rate of positive predictions is fairly balanced across groups.  
- **Equal Odds Difference (0.1000)**: Suggests a noticeable disparity in error rates (false positives and false negatives) between genders.  
- **Equal Odds Ratio (0.6400)**: Below 1, reinforcing that one gender experiences fewer errors than the other.  
- **Positive Predictive Parity Difference (0.0211)**: Very small, indicating predictive precision is nearly equal across genders.  
- **Positive Predictive Parity Ratio (1.0224)**: Close to 1, confirming similar reliability in positive predictions.  
- **Statistical Parity Difference (0.0285)**: Small but positive, meaning one group receives slightly more positive predictions than the other.  

---

### **Summary**
The Random Forest model demonstrates **good overall fairness** compared to the KNN and Decision Tree models:  
- **Ranking ability (AUC)** and **precision (PPV)** are nearly identical across genders.  
- **Positive prediction distribution** (statistical parity and disparate impact) is also well balanced.  
- However, **Equal Odds metrics (difference = 0.1000, ratio = 0.64)** reveal disparities in error rates, meaning one gender (likely females) is still more affected by false positives or false negatives.  

**Overall**, the Random Forest performs consistently across genders in most fairness metrics but still shows **residual bias in error distribution**, indicating room for improvement in ensuring equal treatment.

---


In [63]:
print("FairMLHealth Stratified Bias Table - RF")
measure.bias(X_test, y_test, y_pred_rf, features=['gender'], flag_oor=False)

FairMLHealth Stratified Bias Table - RF


Unnamed: 0,Feature Name,Feature Value,Balanced Accuracy Difference,Balanced Accuracy Ratio,FPR Diff,FPR Ratio,PPV Diff,PPV Ratio,Selection Diff,Selection Ratio,TPR Diff,TPR Ratio
0,gender,0,-0.0641,0.9343,0.0281,1.5625,-0.0211,0.9781,-0.0285,0.9514,-0.1,0.9
1,gender,1,0.0641,1.0703,-0.0281,0.64,0.0211,1.0224,0.0285,1.0511,0.1,1.1111


## Stratified Fairness Metrics by Gender – Random Forest

The table presents subgroup-specific fairness metrics for the **Random Forest model**, separated by gender.  
Here, **0 = Female (unprivileged)** and **1 = Male (privileged)**.

---

### 1. Balanced Accuracy
- **Females (0): -0.0641, Ratio = 0.9343** → Balanced accuracy is lower for females.  
- **Males (1): +0.0641, Ratio = 1.0703** → Males benefit from higher balanced accuracy.  
➡️ The model performs more reliably for males.

---

### 2. False Positive Rate (FPR)
- **Females (0): FPR Diff = 0.0281, Ratio = 1.5625** → Females have a higher false positive rate.  
- **Males (1): FPR Diff = -0.0281, Ratio = 0.6400** → Males experience fewer false positives.  
➡️ This indicates a **specificity disadvantage for females**.

---

### 3. Positive Predictive Value (PPV / Precision)
- **Females (0): PPV Diff = -0.0211, Ratio = 0.9781** → Precision is slightly lower.  
- **Males (1): PPV Diff = +0.0211, Ratio = 1.0224** → Precision is slightly higher.  
➡️ Predictions are **more reliable for males**.

---

### 4. Selection Rate
- **Females (0): Selection Ratio = 0.9514**  
- **Males (1): Selection Ratio = 1.0511**  
➡️ Males are selected for positive predictions slightly more often.

---

### 5. True Positive Rate (TPR / Sensitivity)
- **Females (0): TPR Diff = -0.1000, Ratio = 0.9000** → Sensitivity is lower.  
- **Males (1): TPR Diff = +0.1000, Ratio = 1.1111** → Sensitivity is higher.  
➡️ The model is **better at correctly identifying positive cases for males**.

---

### **Summary**
- **Females (unprivileged)**: Disadvantaged across multiple metrics — **lower balanced accuracy, higher false positives, weaker precision, and lower sensitivity**.  
- **Males (privileged)**: Benefit from **higher accuracy, fewer false positives, more reliable predictions, and better sensitivity**.  

Overall, the Random Forest introduces a **systematic bias favoring males**. While group differences are not extreme, the **consistent performance gap across key metrics** indicates fairness concerns that should be addressed through mitigation strategies.

---

In [64]:
# Get the stratified performance table
perf_table_rf = measure.performance(
    X=gender_df,
    y_true=y_test,
    y_pred=y_pred_rf,
    y_prob=y_prob_rf
)

# Replace NaN with a dash
perf_table_rf = perf_table_rf.fillna("—")

# display pretty table
display(perf_table_rf)


  warn(f"Possible error in column(s) {cols}. {wr}\n")


Unnamed: 0,Feature Name,Feature Value,Obs.,Mean Target,Mean Prediction,Accuracy,F1-Score,FPR,PR AUC,Precision,ROC AUC,TPR
0,ALL FEATURES,ALL VALUES,200.0,0.58,0.565,0.925,0.9345,0.0714,—,0.9469,0.9852,0.9224
1,gender,0,46.0,0.5652,0.587,0.9783,0.9811,0.05,0.4348,0.963,1.0,1.0
2,gender,1,154.0,0.5844,0.5584,0.9091,0.9205,0.0781,—,0.9419,0.9799,0.9


## Stratified Performance Analysis – Random Forest by Gender

This table presents the **stratified performance metrics** of the Random Forest model across gender groups.  
Here, **0 = Female (unprivileged)** and **1 = Male (privileged)**.

---

### 1. Overall Performance (All Features)
- **Accuracy (0.9250)** and **F1-score (0.9345)** demonstrate strong overall performance.  
- **ROC AUC (0.9852)** indicates excellent discriminatory power.  
- **Precision (0.9469)** and **TPR (0.9224)** confirm the model balances sensitivity with predictive reliability.  
- **Note**: For some subgroups, **PR AUC is reported as “—”** because the subgroup sample size did not allow calculation of a meaningful precision–recall curve. This does not affect the validity of the other metrics.

---

### 2. Subgroup Comparison

| Metric        | Female (0) | Male (1) | Interpretation |
|---------------|------------|----------|----------------|
| **Accuracy**  | 0.9783     | 0.9091   | Accuracy is higher for females. |
| **F1-Score**  | 0.9811     | 0.9205   | Stronger F1 performance for females. |
| **FPR**       | 0.0500     | 0.0781   | Females experience fewer false positives. |
| **Precision** | 0.9630     | 0.9419   | Predictions are more reliable for females. |
| **ROC AUC**   | 1.0000     | 0.9799   | Females achieve near-perfect discrimination. |
| **TPR**       | 1.0000     | 0.9000   | Females are perfectly identified when they have CVD. |

---

### 3. Interpretation
- **Females (unprivileged)**:  
  - Benefit from **higher accuracy, F1, precision, and sensitivity (TPR)** compared to males.  
  - **ROC AUC = 1.0000**, suggesting near-perfect discrimination between positive and negative cases.  
  - Lower false positive rate (5% vs. 7.8%), meaning fewer healthy females are misclassified.  

- **Males (privileged)**:  
  - Performance remains strong, but consistently **lower than for females** across most metrics.  
  - Higher false positive rate and slightly weaker sensitivity, indicating less reliable predictions compared to females.  

---

### **Summary**
The Random Forest model appears to perform **better for females** across nearly all metrics.  
While overall performance is excellent for both groups, females receive systematically **more favorable outcomes** (higher F1, higher precision, lower FPR, and perfect TPR/ROC AUC), suggesting that this model may be biased **in favor of females** rather than against them.

---

In [65]:
from fairmlhealth import performance_metrics as pm

# Define group masks with clear names
female_mask = (protected_attr_rf == 0)  # female = unprivileged group
male_mask   = (protected_attr_rf == 1)  # male = privileged group 

# Function to evaluate group-specific metrics
def evaluate_group_performance(group_name, mask):
    tpr = pm.true_positive_rate(y_test[mask], y_pred_rf[mask])
    fpr = pm.false_positive_rate(y_test[mask], y_pred_rf[mask])
    print(f"{group_name} Results:")
    print(f"  True Positive Rate (TPR): {tpr:.4f}")
    print(f"  False Positive Rate (FPR): {fpr:.4f}")
    print("-" * 40)

# Evaluate for each group
evaluate_group_performance("Female (Unprivileged)", female_mask)
evaluate_group_performance("Male (Privileged)", male_mask)

Female (Unprivileged) Results:
  True Positive Rate (TPR): 1.0000
  False Positive Rate (FPR): 0.0500
----------------------------------------
Male (Privileged) Results:
  True Positive Rate (TPR): 0.9000
  False Positive Rate (FPR): 0.0781
----------------------------------------


### Group-Specific Error Analysis – Random Forest

This section presents the performance of the Random Forest model across gender groups, focusing on **True Positive Rate (TPR)** and **False Positive Rate (FPR)**.

#### Results by Gender Group

| Group                        | TPR     | FPR     |
|------------------------------|---------|---------|
| Unprivileged (female = 0)    | 1.0000  | 0.0500  |
| Privileged (male = 1)        | 0.9000  | 0.0781  |

#### Interpretation

- **True Positive Rate (TPR)** is **perfect for females (100%)**, compared to **90.00% for males**.  
  - This means the model successfully identifies all true positive cases among females, but misses a small proportion of cases among males.

- **False Positive Rate (FPR)** is **lower for females (5.00%)** than for males (7.81%).  
  - This indicates that males are more likely to receive **false alarms** compared to females.

#### Implications

- The model exhibits a **gender-based disparity**:  
  - It is both **more sensitive** (higher TPR) and **more specific** (lower FPR) for females,  
  - while males face slightly worse outcomes on both measures.

- This asymmetry suggests that the Random Forest model may be **biased in favor of females**. In practice, this means females are almost always correctly detected when they have CVD and face fewer false positives, whereas males are at a **double disadvantage** (slightly higher missed cases and more false alarms).

- Depending on the clinical use case, these imbalances could have **important consequences**: males may experience both a higher risk of underdiagnosis and more unnecessary follow-ups, raising fairness concerns.

---

### Deep Learning Model - Feed Forward Network (MLP)

In [66]:
mlp_df = pd.read_csv("MendeleyData_75F25M_MLP_lbfgs_predictions.csv")
print(mlp_df.head())

   gender  y_true  y_pred_lbfgs  y_prob_lbfgs
0       0       0             0      0.012395
1       1       0             0      0.000004
2       1       1             1      1.000000
3       1       1             1      0.999969
4       1       0             0      0.000014


In [67]:
# Extract common columns 
y_true_mlp = mlp_df["y_true"].values 
y_prob_mlp = mlp_df["y_prob_lbfgs"].values
y_pred_mlp = mlp_df["y_pred_lbfgs"].values
gender_mlp = mlp_df["gender"].values 

# Use gender_mlp as the protected attribute
protected_attr_mlp = gender_mlp 

In [68]:
#Run fairmlhealth bias detection for MLP 

mlp_bias = measure.summary(
    X=X_test,
    y_true=y_test,
    y_pred=y_pred_mlp,
    y_prob=y_prob_mlp,
    prtc_attr=protected_attr_mlp,
    pred_type="classification",
    priv_grp=1,
    sig_fig=4,
    skip_if=True,
    skip_performance = True
)

print(mlp_bias)

                                                        Value
Metric         Measure                                       
Group Fairness AUC Difference                         -0.0123
               Balanced Accuracy Difference            0.0084
               Balanced Accuracy Ratio                 1.0093
               Disparate Impact Ratio                  1.0005
               Equal Odds Difference                   0.0231
               Equal Odds Ratio                        1.0667
               Positive Predictive Parity Difference  -0.0080
               Positive Predictive Parity Ratio        0.9915
               Statistical Parity Difference           0.0003
Data Metrics   Prevalence of Privileged Class (%)     77.0000


In [69]:
# Flagged fairness table for MLP
styled_mlp = MyFlagger().apply_flag(
    df=mlp_bias,
    caption="MLP Fairness (Gender)",
    boundaries=bounds,
    sig_fig=4,
    as_styler=True
)
styled_mlp

Unnamed: 0_level_0,Unnamed: 1_level_0,Value
Metric,Measure,Unnamed: 2_level_1
Group Fairness,AUC Difference,-0.0123
Group Fairness,Balanced Accuracy Difference,0.0084
Group Fairness,Balanced Accuracy Ratio,1.0093
Group Fairness,Disparate Impact Ratio,1.0005
Group Fairness,Equal Odds Difference,0.0231
Group Fairness,Equal Odds Ratio,1.0667
Group Fairness,Positive Predictive Parity Difference,-0.008
Group Fairness,Positive Predictive Parity Ratio,0.9915
Group Fairness,Statistical Parity Difference,0.0003
Data Metrics,Prevalence of Privileged Class (%),77.0


## Fairness Evaluation – MLP by Gender

---

### 1. Group Fairness Metrics

- **AUC Difference (-0.0123)**: The area under the ROC curve is slightly higher for males than females, but the difference is minimal (< 0.02).  
- **Balanced Accuracy Difference (0.0084)** and **Ratio (1.0093)**: Balanced accuracy is almost identical across genders, with females performing slightly worse.  
- **Disparate Impact Ratio (1.0005)**: Essentially perfect parity in selection rates between males and females.  
- **Equal Odds Difference (0.0231)** and **Ratio (1.0667)**: Indicates small disparities in error rates (TPR/FPR) between groups, with males slightly favored.  
- **Positive Predictive Parity Difference (-0.0080)** and **Ratio (0.9915)**: Predictive reliability of positive classifications is very similar across genders.  
- **Statistical Parity Difference (0.0003)**: Selection rates are almost exactly equal for males and females.  

---

### 3. Interpretation

- The MLP shows **very small fairness disparities** across all metrics.  
- Differences in **AUC, balanced accuracy, and predictive parity** are negligible, suggesting the model performs similarly for both genders.  
- **Equal odds difference (0.0231)** is the only metric showing a small gap, meaning error rates differ slightly, but still within modest bounds.  
- The **disparate impact ratio (1.0005)** and **statistical parity difference (0.0003)** confirm that males and females are selected at nearly identical rates.  

---

### **Summary**

The MLP model demonstrates **high fairness with respect to gender**.  
While males (privileged group) appear to have a **slight advantage** in terms of error rates and AUC, the overall differences are **minimal and unlikely to represent significant bias**. This model is more balanced compared to others that exhibited stronger gender disparities.

---

In [70]:
print("FairMLHealth Stratified Bias Table - MLP")
measure.bias(X_test, y_test, y_pred_mlp, features=['gender'], flag_oor=False)

FairMLHealth Stratified Bias Table - MLP


Unnamed: 0,Feature Name,Feature Value,Balanced Accuracy Difference,Balanced Accuracy Ratio,FPR Diff,FPR Ratio,PPV Diff,PPV Ratio,Selection Diff,Selection Ratio,TPR Diff,TPR Ratio
0,gender,0,-0.0084,0.9908,-0.0063,0.9375,0.008,1.0086,-0.0003,0.9995,-0.0231,0.975
1,gender,1,0.0084,1.0093,0.0063,1.0667,-0.008,0.9915,0.0003,1.0005,0.0231,1.0256


## Stratified Bias Analysis – MLP by Gender

This table presents **group-specific fairness metrics** for the MLP model, stratified by gender.  
Here, **0 = Female (unprivileged)** and **1 = Male (privileged)**.

---

### 1. Balanced Accuracy
- **Difference**: Females = -0.0084, Males = 0.0084.  
- **Ratio**: Females = 0.9908, Males = 1.0093.  
- ➝ Balanced accuracy is **nearly identical** across genders, with a negligible advantage for males.

---

### 2. False Positive Rate (FPR)
- **Difference**: Females = -0.0063, Males = 0.0063.  
- **Ratio**: Females = 0.9375, Males = 1.0667.  
- ➝ Females have a **slightly lower false positive rate**, while males face marginally more false alarms.  

---

### 3. Positive Predictive Value (PPV / Precision)
- **Difference**: Females = 0.008, Males = -0.008.  
- **Ratio**: Females = 1.0086, Males = 0.9915.  
- ➝ Precision is **slightly higher for females**, meaning predictions of CVD for females are marginally more reliable.

---

### 4. Selection Rate
- **Difference**: Females = -0.0003, Males = 0.0003.  
- **Ratio**: Females = 0.9995, Males = 1.0005.  
- ➝ Selection rates are essentially **equal across genders**, showing no meaningful disparity.

---

### 5. True Positive Rate (TPR / Sensitivity)
- **Difference**: Females = -0.0231, Males = 0.0231.  
- **Ratio**: Females = 0.9750, Males = 1.0256.  
- ➝ Males have a **slight advantage in sensitivity**, being more likely to be correctly identified when they truly have CVD.

---

### **Summary**
- The MLP model demonstrates **high parity across gender groups**.  
- Most fairness differences are **very small (<0.02)** and ratios remain close to **1.0**, well within common fairness tolerance thresholds.  
- Small tendencies:
  - **Females** benefit from slightly higher **precision** and lower **false positive rates**.  
  - **Males** benefit from slightly higher **sensitivity (TPR)**.  
- Overall, the disparities are **minor** and unlikely to indicate systematic gender bias in this model.

---

In [71]:
# Get the stratified performance table
perf_table_mlp = measure.performance(
    X=gender_df,
    y_true=y_test,
    y_pred=y_pred_mlp,
    y_prob=y_prob_mlp
)

# Replace NaN with a dash
perf_table_mlp = perf_table_mlp.fillna("—")

# display pretty table
display(perf_table_mlp)


  warn(f"Possible error in column(s) {cols}. {wr}\n")


Unnamed: 0,Feature Name,Feature Value,Obs.,Mean Target,Mean Prediction,Accuracy,F1-Score,FPR,PR AUC,Precision,ROC AUC,TPR
0,ALL FEATURES,ALL VALUES,200.0,0.58,0.565,0.905,0.917,0.0952,—,0.9292,0.9698,0.9052
1,gender,0,46.0,0.5652,0.5652,0.913,0.9231,0.1,—,0.9231,0.9577,0.9231
2,gender,1,154.0,0.5844,0.5649,0.9026,0.9153,0.0938,—,0.931,0.97,0.9


## Stratified Performance Analysis – MLP by Gender

This table shows the **stratified performance metrics** of the MLP model across gender groups.  
Here, **0 = Female (unprivileged)** and **1 = Male (privileged)**.

---

### 1. Overall Performance (All Features)
- **Accuracy (0.9050)** and **F1-score (0.9170)** indicate strong general performance.  
- **ROC AUC (0.9698)** shows excellent discriminatory ability between positive and negative cases.  
- **Precision (0.9292)** and **TPR (0.9052)** suggest a good balance between predictive reliability and sensitivity.  
- **Note**: PR AUC is reported as “—” because the subgroup sample size did not allow calculation of a meaningful precision–recall curve.

---

### 2. Subgroup Comparison

| Metric        | Female (0) | Male (1) | Interpretation |
|---------------|------------|----------|----------------|
| **Accuracy**  | 0.9130     | 0.9026   | Accuracy is slightly higher for females. |
| **F1-Score**  | 0.9231     | 0.9153   | Females perform marginally better. |
| **FPR**       | 0.1000     | 0.0938   | Males experience slightly fewer false positives. |
| **Precision** | 0.9231     | 0.9310   | Predictions are slightly more reliable for males. |
| **ROC AUC**   | 0.9577     | 0.9700   | Males benefit from a small advantage in ranking performance. |
| **TPR**       | 0.9231     | 0.9000   | Females are somewhat more likely to be correctly identified when they have CVD. |

---

### 3. Interpretation
- **Females (unprivileged)**:  
  - Achieve higher accuracy and F1-score compared to males.  
  - Benefit from higher sensitivity (TPR = 0.9231), meaning fewer missed positive cases.  
  - However, they face a slightly higher false positive rate (10% vs. 9.38%), leading to more false alarms.  

- **Males (privileged)**:  
  - Have slightly stronger precision and ROC AUC, indicating more reliable predictions and better ranking of outcomes.  
  - Lower false positive rate suggests healthier males are less likely to be misclassified.  
  - Slightly weaker sensitivity compared to females, meaning a few more missed positive cases.  

---

### **Summary**
The MLP model achieves **balanced performance across genders**.  
- **Females** benefit from stronger recall (TPR) and overall accuracy.  
- **Males** benefit from slightly higher precision and ROC AUC, as well as fewer false positives.  
The disparities are **minor**, suggesting that the MLP model is **relatively fair across gender groups** without strong systematic bias.

---

In [72]:
from fairmlhealth import performance_metrics as pm

# Define group masks with clear names
female_mask = (protected_attr_mlp == 0)  # female = unprivileged group
male_mask   = (protected_attr_mlp == 1)  # male = privileged group 

# Function to evaluate group-specific metrics
def evaluate_group_performance(group_name, mask):
    tpr = pm.true_positive_rate(y_test[mask], y_pred_mlp[mask])
    fpr = pm.false_positive_rate(y_test[mask], y_pred_mlp[mask])
    print(f"{group_name} Results:")
    print(f"  True Positive Rate (TPR): {tpr:.4f}")
    print(f"  False Positive Rate (FPR): {fpr:.4f}")
    print("-" * 40)

# Evaluate for each group
evaluate_group_performance("Female (Unprivileged)", female_mask)
evaluate_group_performance("Male (Privileged)", male_mask)

Female (Unprivileged) Results:
  True Positive Rate (TPR): 0.9231
  False Positive Rate (FPR): 0.1000
----------------------------------------
Male (Privileged) Results:
  True Positive Rate (TPR): 0.9000
  False Positive Rate (FPR): 0.0938
----------------------------------------


### Group-Specific Error Analysis – MLP Model

This section breaks down the classification performance of the MLP model across gender groups, using **True Positive Rate (TPR)** and **False Positive Rate (FPR)**.

#### Results by Gender Group

| Group                        | TPR     | FPR     |
|------------------------------|---------|---------|
| Unprivileged (female = 0)    | 0.9231  | 0.1000  |
| Privileged (male = 1)        | 0.9000  | 0.0938  |

#### Interpretation

- **True Positive Rate (TPR)** is slightly higher for females (92.31%) compared to males (90.00%).  
  - This means the model is **better at correctly identifying true positive cases for females**.  

- **False Positive Rate (FPR)** is marginally higher for females (10.00%) than for males (9.38%).  
  - This suggests females are **slightly more likely to receive false alarms** compared to males.  

#### Implications

- The MLP model shows a **mild trade-off across gender groups**:  
  - **Females (unprivileged)**: benefit from better sensitivity (higher TPR) but face a small increase in false positives.  
  - **Males (privileged)**: experience fewer false positives but at the cost of lower sensitivity.  

- These asymmetries contribute to the fairness metrics (e.g., **Equal Odds Difference = 0.0231** and **Equal Odds Ratio = 1.0667**), which capture small but noticeable disparities in error rates.  

#### Recommendation

- While the differences are **minor overall**, the model tends to favor females in terms of sensitivity, while males receive slightly more reliable specificity.  
- Depending on clinical priorities (avoiding missed cases vs. minimizing false alarms), these imbalances could be relevant in evaluating fairness.
