### Bias Detection and Fairness Evaluation on Heart Failure Prediction Dataset (Kaggle) using FairMLhealth
(Source: https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction/data)

In [1]:
import pandas as pd

# Load X_test set
X_test = pd.read_csv("./data_splits/X_test.csv")
y_test = pd.read_csv("./data_splits/y_test.csv")

In [2]:
import fairmlhealth
import aif360
print("Environment setup successful")

Environment setup successful


In [3]:
#have a look at the details of fairmlhealth - especially the version
!pip show fairmlhealth

Name: fairmlhealth
Version: 1.0.2
Summary: Health-centered variation analysis
Home-page: https://github.com/KenSciResearch/fairMLHealth
Author: Christine Allen
Author-email: ca.magallen@gmail.com
License: 
Location: c:\users\patri\appdata\roaming\python\python310\site-packages
Requires: aif360, ipython, jupyter, numpy, pandas, requests, scikit-learn, scipy
Required-by: 


In [4]:
#have a look at the modules that are within fairmlhealth

print(dir(fairmlhealth))

['__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__']


In [5]:
#load necessary modules 

#import module measure to use measure.summary for bias detection
from fairmlhealth import measure

#import module for investigation of individual cohorts 
from fairmlhealth.__utils import iterate_cohorts

#import FairRanges to flag high values
from fairmlhealth.__utils import FairRanges

# Wrap the fairness summary function for cohort-wise analysis
@iterate_cohorts
def cohort_summary(**kwargs):
    return measure.summary(**kwargs)

pip install 'aif360[AdversarialDebiasing]'
pip install 'aif360[AdversarialDebiasing]'
  vect_normalized_discounted_cumulative_gain = vmap(
  monte_carlo_vect_ndcg = vmap(vect_normalized_discounted_cumulative_gain, in_dims=(0,))


During the execution of FairMLHealth and AIF360, several runtime warnings were raised (e.g., “AdversarialDebiasing will be unavailable” due to the absence of TensorFlow, and deprecation warnings from the inFairness package regarding PyTorch’s functorch.vmap). These warnings do not affect the fairness metrics or results presented in this study, as the unavailable components were not used. To maintain clarity of output, the warnings were silenced programmatically, and the analysis was conducted without issue.

In [6]:
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", module="inFairness")
warnings.filterwarnings("ignore", message="AdversarialDebiasing will be unavailable")

### Traditional Machine Learning Models - KNN & DT

#### K-nearest neighbors - KNN

In [7]:
import pandas as pd

# Load KNN results
knn_df = pd.read_csv("HeartFailureData_75F25M_PCA_KNN_predictions.csv")

print(knn_df.head())

   Sex  y_true    y_prob  y_pred
0    1       1  0.930457       1
1    1       1  0.580671       1
2    1       1  0.872885       1
3    1       1  0.062216       0
4    0       0  0.389552       0


In [8]:
# Extract common columns
y_true_knn = knn_df["y_true"].values
y_prob_knn = knn_df["y_prob"].values
y_pred_knn = knn_df["y_pred"].values
gender_knn = knn_df["Sex"].values

# Use gender_knn as the protected attribute (0/1 as in your CSV)
protected_attr_knn = gender_knn

In [9]:
knn_bias = measure.summary(
    X=X_test,
    y_true=y_true_knn,
    y_pred=y_pred_knn,
    y_prob=y_prob_knn,
    prtc_attr=protected_attr_knn,
    pred_type="classification",
    priv_grp=1,
    sig_fig=4,
    skip_if=True,   # skip inconsistency metrics that cause NearestNeighbors error
    skip_performance=True
)

print(knn_bias)

                                                        Value
Metric         Measure                                       
Group Fairness AUC Difference                          0.0632
               Balanced Accuracy Difference           -0.0029
               Balanced Accuracy Ratio                 0.9964
               Disparate Impact Ratio                  0.2343
               Equal Odds Difference                  -0.1146
               Equal Odds Ratio                        0.2232
               Positive Predictive Parity Difference  -0.1146
               Positive Predictive Parity Ratio        0.8747
               Statistical Parity Difference          -0.4301
Data Metrics   Prevalence of Privileged Class (%)     79.0000


In [10]:
# 2) Custom scenario oriented bounds

custom_ranges = {
    "tpr diff": (-0.03, 0.03),
    "fpr diff": (-0.03, 0.03),
    "equal odds difference": (-0.04, 0.04),
    "statistical parity difference": (-0.05, 0.05),
    "disparate impact ratio": (0.9, 1.1),
    "selection ratio": (0.9, 1.1),
    "auc difference": (-0.02, 0.02),
    "balanced accuracy difference": (-0.02, 0.02),
}

bounds = FairRanges().load_fair_ranges(custom_ranges=custom_ranges)

In [11]:
#for highlighting metrics outside of the thresholds
#  restore Styler.set_precision to adjust the highlighting color in the styled table
import pandas as pd, numpy as np

Styler = type(pd.DataFrame({"_":[0]}).style)  

if not hasattr(Styler, "set_precision"):
    def _set_precision(self, precision=4):
        try:
            return self.format(precision=precision)
        except TypeError:
            return self.format(formatter=lambda x:
                f"{x:.{precision}g}" if isinstance(x, (int, float, np.floating)) else x
            )
    setattr(Styler, "set_precision", _set_precision)

In [12]:
#Flag metrics outside acceptable fairness bounds in current table 

from fairmlhealth.__utils import Flagger

class MyFlagger(Flagger):
    def reset(self):
        super().reset()
        self.flag_color = "#491ee6"   
        self.flag_type = "background-color"

styled_knn = MyFlagger().apply_flag(
    df=knn_bias,
    caption="KNN Fairness (Gender)",
    boundaries=bounds,
    sig_fig=4,
    as_styler=True
)
styled_knn

Unnamed: 0_level_0,Unnamed: 1_level_0,Value
Metric,Measure,Unnamed: 2_level_1
Group Fairness,AUC Difference,0.0632
Group Fairness,Balanced Accuracy Difference,-0.0029
Group Fairness,Balanced Accuracy Ratio,0.9964
Group Fairness,Disparate Impact Ratio,0.2343
Group Fairness,Equal Odds Difference,-0.1146
Group Fairness,Equal Odds Ratio,0.2232
Group Fairness,Positive Predictive Parity Difference,-0.1146
Group Fairness,Positive Predictive Parity Ratio,0.8747
Group Fairness,Statistical Parity Difference,-0.4301
Data Metrics,Prevalence of Privileged Class (%),79.0


## Gender Bias Detection Results – KNN Model  

This table presents fairness metrics for the **KNN model**, stratified by gender.  
Here, **0 = Female (unprivileged)** and **1 = Male (privileged)**.  

---

### 1. Group Fairness Metrics  

- **AUC Difference (0.0632):**  
  The ranking ability of the model differs moderately across genders, with males slightly better represented.  

- **Balanced Accuracy Difference (−0.0029) and Ratio (0.9964):**  
  Balanced accuracy is nearly identical across genders, suggesting overall parity in classification performance.  

- **Disparate Impact Ratio (0.2343):**  
  Far below the acceptable threshold (0.80–1.25). This indicates that females receive positive predictions at a **much lower rate** compared to males, highlighting strong selection bias.  

- **Equal Odds Difference (−0.1146) and Ratio (0.2232):**  
  Substantial disparity in error rates (TPR/FPR) between genders. The negative value suggests that **females are treated less favorably**, with higher error disparities.  

- **Positive Predictive Parity Difference (−0.1146) and Ratio (0.8747):**  
  Precision is lower for females, meaning positive predictions for this group are less reliable compared to males.  

- **Statistical Parity Difference (−0.4301):**  
  Large negative value shows that **females are selected at a much lower rate** than males, reinforcing evidence of unequal treatment.  

---

### 2. Interpretation  

- **Disparities are most evident in outcome distribution** (statistical parity and disparate impact), showing that females are under-selected for positive outcomes.  
- **Error distribution (equal odds)** is also imbalanced, with females disadvantaged in terms of prediction reliability.  
- Although **balanced accuracy is nearly equal**, suggesting similar base classification performance, the **systematic under-selection of females** signals significant fairness issues.  

---

### **Summary**  

The KNN model exhibits **systematic gender bias**, primarily disadvantaging **females (unprivileged group)**.  
- They receive far fewer positive outcomes.  
- Their predictions are less precise and associated with higher error disparities.  

This combination of **outcome imbalance and prediction unreliability** indicates that fairness mitigation is necessary before deploying KNN in practice.  

---

In [13]:
print("FairMLHealth Stratified Bias Table - KNN")
measure.bias(X_test, y_test, y_pred_knn, features=['Sex'], flag_oor=False)

FairMLHealth Stratified Bias Table - KNN


Unnamed: 0,Feature Name,Feature Value,Balanced Accuracy Difference,Balanced Accuracy Ratio,FPR Diff,FPR Ratio,PPV Diff,PPV Ratio,Selection Diff,Selection Ratio,TPR Diff,TPR Ratio
0,Sex,0,0.0029,1.0036,0.1088,4.48,0.1146,1.1433,0.4301,4.2685,0.1146,1.1719
1,Sex,1,-0.0029,0.9964,-0.1088,0.2232,-0.1146,0.8747,-0.4301,0.2343,-0.1146,0.8533


## Stratified Bias Analysis – KNN by Gender  

This table presents **group-specific fairness metrics** for the KNN model, stratified by gender.  
Here, **0 = Female (unprivileged)** and **1 = Male (privileged)**.  

---

### 1. Balanced Accuracy  
- **Females (0):** Balanced Accuracy Difference = **+0.0029**, Ratio = **1.0036**  
- **Males (1):** Balanced Accuracy Difference = **−0.0029**, Ratio = **0.9964**  
- ➝ Balanced accuracy is nearly identical across genders, suggesting no strong disparity in base classification performance.  

---

### 2. False Positive Rate (FPR)  
- **Females (0):** FPR Difference = **+0.1088**, Ratio = **4.4800**  
- **Males (1):** FPR Difference = **−0.1088**, Ratio = **0.2232**  
- ➝ Females experience a **substantially higher false positive rate**, meaning they are more often incorrectly flagged as positive compared to males.  

---

### 3. Positive Predictive Value (PPV / Precision)  
- **Females (0):** PPV Difference = **+0.1146**, Ratio = **1.1433**  
- **Males (1):** PPV Difference = **−0.1146**, Ratio = **0.8747**  
- ➝ Precision is **higher for females**, meaning their positive predictions are somewhat more reliable than for males.  

---

### 4. Selection Rate  
- **Females (0):** Selection Difference = **+0.4301**, Ratio = **4.2685**  
- **Males (1):** Selection Difference = **−0.4301**, Ratio = **0.2343**  
- ➝ Females are **selected far more often**, while males are significantly under-selected.  

---

### 5. True Positive Rate (TPR / Sensitivity)  
- **Females (0):** TPR Difference = **+0.1146**, Ratio = **1.1719**  
- **Males (1):** TPR Difference = **−0.1146**, Ratio = **0.8533**  
- ➝ Females have a **higher sensitivity**, meaning they are more likely to be correctly identified when truly positive.  

---

### **Summary**  
- The KNN model introduces a **reverse bias** in this evaluation:  
  - **Females (unprivileged group)** benefit from **higher precision, higher TPR, and much higher selection rates**, but at the cost of also facing **significantly higher false positive rates**.  
  - **Males (privileged group)** are disadvantaged with fewer positive predictions and lower sensitivity.  

In essence, the model **favors females**, though not equitably: females get more opportunities (higher selection) but also face more false alarms. This imbalance still represents a fairness concern, as the error distribution is not consistent across genders.  

---

In [14]:
from fairmlhealth import measure
import pandas as pd
from IPython.display import display  

# convert gender into DataFrame with a clear column name to get a nice table as output
gender_df = pd.DataFrame({"Sex": X_test["Sex"].astype(int)})


# Get the stratified table
perf_table_knn = measure.performance(
    X=gender_df,
    y_true=y_test,
    y_pred=y_pred_knn,
    y_prob=y_prob_knn
)

# Replace NaN with a dash
perf_table_knn = perf_table_knn.fillna("—")

# display pretty table
display(perf_table_knn)


  warn(f"Possible error in column(s) {cols}. {wr}\n")


Unnamed: 0,Feature Name,Feature Value,Obs.,Mean Target,Mean Prediction,Accuracy,F1-Score,FPR,PR AUC,Precision,ROC AUC,TPR
0,ALL FEATURES,ALL VALUES,184.0,0.5543,0.4728,0.8315,0.836,0.0976,—,0.908,0.9103,0.7745
1,Sex,0,38.0,0.1579,0.1316,0.9211,0.7273,0.0312,—,0.8,0.9531,0.6667
2,Sex,1,146.0,0.6575,0.5616,0.8082,0.8427,0.14,—,0.9146,0.8899,0.7812


## Stratified Performance Analysis – KNN by Gender  

This table shows the **stratified performance metrics** of the KNN model across gender groups.  
Here, **0 = Female (unprivileged)** and **1 = Male (privileged)**.  

---

### 1. Overall Performance (All Features)  
- **Accuracy (0.8315)** and **F1-score (0.8360)** indicate solid overall performance.  
- **ROC AUC (0.9103)** demonstrates strong discriminatory ability.  
- **Precision (0.9080)** is high, suggesting reliable predictions.  
- **TPR (0.7745)** shows reasonable sensitivity, though subgroup analysis reveals imbalances.  

---

### 2. Subgroup Comparison  

| Metric        | Female (0) | Male (1) | Interpretation |
|---------------|------------|----------|----------------|
| **Accuracy**  | 0.9211     | 0.8082   | Accuracy is higher for females. |
| **F1-Score**  | 0.7273     | 0.8427   | Males benefit from stronger F1 performance. |
| **FPR**       | 0.0312     | 0.1400   | Females have a much lower false positive rate. |
| **Precision** | 0.8000     | 0.9146   | Predictions are more reliable for males. |
| **ROC AUC**   | 0.9531     | 0.8899   | The model ranks female cases better. |
| **TPR**       | 0.6667     | 0.7812   | Males are detected more often when truly positive. |

---

### 3. Interpretation  

- **Females (unprivileged):**  
  - Benefit from **higher accuracy** and **lower FPR (3.1%)**, meaning fewer false alarms.  
  - However, they suffer from **lower recall/TPR (66.7%)** and **weaker F1-score (0.7273)**, suggesting more missed true cases.  
  - ROC AUC (0.9531) is stronger, showing good ranking ability.  

- **Males (privileged):**  
  - Achieve **better recall (78.1%)** and **higher F1-score (0.8427)**, showing overall stronger predictive effectiveness.  
  - However, they face a **much higher FPR (14%)**, meaning more incorrect positive classifications.  
  - ROC AUC is lower (0.8899), indicating weaker ranking ability compared to females.  

---

### **Summary**  
The KNN model reveals a **mixed fairness pattern**:  
- **Females** benefit from **higher accuracy and fewer false positives**, but face **more missed true cases (low TPR)**.  
- **Males** gain from **higher recall and overall predictive balance**, but are more often **incorrectly flagged as positive**.  

This indicates that the model’s bias is not unidirectional: instead, it creates **different trade-offs** for each gender group.  

---

In [15]:
#group specific error analysis

from fairmlhealth import performance_metrics as pm

# Define group masks with clear names
female_mask = (protected_attr_knn == 0)  # unprivileged group (female)
male_mask   = (protected_attr_knn == 1)  # privileged group (male)

# Function to evaluate group-specific metrics
def evaluate_group_performance(group_name, mask):
    tpr = pm.true_positive_rate(y_test[mask], y_pred_knn[mask])
    fpr = pm.false_positive_rate(y_test[mask], y_pred_knn[mask])
    print(f"{group_name} Results:")
    print(f"  True Positive Rate (TPR): {tpr:.4f}")
    print(f"  False Positive Rate (FPR): {fpr:.4f}")
    print("-" * 40)

# Evaluate for each group
evaluate_group_performance("Female (Unprivileged)", female_mask)
evaluate_group_performance("Male (Privileged)", male_mask)


Female (Unprivileged) Results:
  True Positive Rate (TPR): 0.6667
  False Positive Rate (FPR): 0.0312
----------------------------------------
Male (Privileged) Results:
  True Positive Rate (TPR): 0.7812
  False Positive Rate (FPR): 0.1400
----------------------------------------


### Group-Specific Error Analysis – KNN Model  

To further examine fairness at the subgroup level, we compared the **True Positive Rate (TPR)** and **False Positive Rate (FPR)** for the unprivileged (female) and privileged (male) groups.  

#### Results by Gender Group  

| Group                  | TPR     | FPR     |
|------------------------|---------|---------|
| Unprivileged (Female)  | 0.6667  | 0.0312  |
| Privileged (Male)      | 0.7812  | 0.1400  |  

#### Interpretation  

- **Females (unprivileged group):**  
  - Show a **lower TPR (66.67%)**, meaning they are more likely to be missed when truly positive.  
  - However, they also benefit from a **very low FPR (3.12%)**, implying fewer false alarms.  

- **Males (privileged group):**  
  - Achieve a **higher TPR (78.12%)**, so true cases are more often correctly detected.  
  - On the downside, they experience a **much higher FPR (14%)**, meaning more incorrect positive predictions.  

#### Summary  

The model exhibits **different trade-offs across genders**:  
- Females are **less often correctly identified (lower sensitivity)** but **less frequently misclassified as positive (lower FPR)**.  
- Males benefit from **higher sensitivity**, but at the cost of **substantially more false positives**.  

This pattern highlights that fairness concerns are **bidirectional**, as each group faces a different type of disadvantage.  

---

### Decision Tree - DT

In [16]:
import pandas as pd

# Load KNN results
dt_df = pd.read_csv("HeartFailureData_75F25M_AltTunedDT_predictions.csv")

print(dt_df.head())

   Sex  y_true    y_prob  y_pred
0    1       1  0.989399       1
1    1       1  0.000000       0
2    1       1  0.989399       1
3    1       1  0.774194       1
4    0       0  0.972973       1


In [17]:
import re

# Extract common columns
y_true_dt = dt_df["y_true"].values
y_prob_dt = dt_df["y_prob"].values
y_pred_dt = dt_df["y_pred"].values
gender_dt = dt_df["Sex"].values


# Use gender_knn as the protected attribute (0/1 as in your CSV)
protected_attr_dt = gender_dt


In [18]:
# Decision Tree Gender Bias Report
print("\n--- Decision Tree Gender Bias Report ---")

dt_bias = measure.summary(
    X=X_test,
    y_true=y_test,
    y_pred=y_pred_dt,
    y_prob=y_prob_dt,
    prtc_attr=protected_attr_dt,
    pred_type="classification",
    priv_grp=1,  # 1 = Male = Privileged
    sig_fig=4,
    skip_if=True,  
    skip_performance = True
)

print(dt_bias)


--- Decision Tree Gender Bias Report ---
                                                        Value
Metric         Measure                                       
Group Fairness AUC Difference                         -0.0570
               Balanced Accuracy Difference           -0.0375
               Balanced Accuracy Ratio                 0.9539
               Disparate Impact Ratio                  0.5172
               Equal Odds Difference                  -0.0938
               Equal Odds Ratio                        0.8989
               Positive Predictive Parity Difference  -0.4986
               Positive Predictive Parity Ratio        0.4173
               Statistical Parity Difference          -0.3439
Data Metrics   Prevalence of Privileged Class (%)     79.0000


In [19]:
#Flag metrics outside acceptable fairness bounds in current table 

styled_dt = MyFlagger().apply_flag(
    df=dt_bias,
    caption="Decision Tree Fairness (Gender)",
    boundaries=bounds,
    sig_fig=4,
    as_styler=True
)
styled_dt

Unnamed: 0_level_0,Unnamed: 1_level_0,Value
Metric,Measure,Unnamed: 2_level_1
Group Fairness,AUC Difference,-0.057
Group Fairness,Balanced Accuracy Difference,-0.0375
Group Fairness,Balanced Accuracy Ratio,0.9539
Group Fairness,Disparate Impact Ratio,0.5172
Group Fairness,Equal Odds Difference,-0.0938
Group Fairness,Equal Odds Ratio,0.8989
Group Fairness,Positive Predictive Parity Difference,-0.4986
Group Fairness,Positive Predictive Parity Ratio,0.4173
Group Fairness,Statistical Parity Difference,-0.3439
Data Metrics,Prevalence of Privileged Class (%),79.0


## Gender Bias Detection Results – Decision Tree Model  

---

### 1. Group Fairness Metrics  

- **AUC Difference (−0.0570):** Females achieve slightly better ranking performance (ROC AUC) than males.  
- **Balanced Accuracy Difference (−0.0375), Ratio (0.9539):** Balanced accuracy is lower for males, suggesting the model performs more evenly for females.  
- **Disparate Impact Ratio (0.5172):** Well below the acceptable range of **0.80–1.25**, indicating a strong imbalance in selection rates that disadvantages females.  
- **Equal Odds Difference (−0.0938), Ratio (0.8989):** Error rates (TPR and FPR) differ, but the gap is moderate, with males slightly favored.  
- **Positive Predictive Parity Difference (−0.4986), Ratio (0.4173):** Precision is substantially lower for females, meaning predictions are much less reliable for them.  
- **Statistical Parity Difference (−0.3439):** Females are selected at a much lower rate than males, reinforcing evidence of underrepresentation.  

---

### 2. Interpretation  

- The Decision Tree shows **clear fairness concerns**:  
  - **Selection disparities** are large (low disparate impact and statistical parity values).  
  - **Precision (PPV)** is much worse for females, showing their positive predictions are far less trustworthy.  
  - **Equal odds** metrics suggest some imbalance in error distribution, with a small tendency to favor males.  

---

### **Summary**  

The Decision Tree model demonstrates **systematic disadvantages for females**:  
- They face **lower precision** and **lower selection rates**, meaning they are both underrepresented and less reliably classified when selected.  
- While AUC and balanced accuracy differences are relatively small, the **impact on fairness is significant** due to disparities in parity metrics.  

This indicates that the Decision Tree introduces a **notable gender bias** that requires mitigation to achieve equitable performance.  


---

In [20]:
print("FairMLHealth Stratified Bias Table - DT")
measure.bias(X_test, y_test, y_pred_dt, features=['Sex'], flag_oor=False)

FairMLHealth Stratified Bias Table - DT


Unnamed: 0,Feature Name,Feature Value,Balanced Accuracy Difference,Balanced Accuracy Ratio,FPR Diff,FPR Ratio,PPV Diff,PPV Ratio,Selection Diff,Selection Ratio,TPR Diff,TPR Ratio
0,Sex,0,0.0375,1.0483,0.0187,1.0667,0.4986,2.3962,0.3439,1.9335,0.0938,1.1125
1,Sex,1,-0.0375,0.9539,-0.0187,0.9375,-0.4986,0.4173,-0.3439,0.5172,-0.0938,0.8989


## Stratified Bias Analysis – Decision Tree by Gender  

This table shows the **group-specific fairness metrics** of the Decision Tree model, stratified by gender.  
Here, **0 = Female (unprivileged)** and **1 = Male (privileged)**.  

---

### 1. Balanced Accuracy  
- **Females (0):** Difference = **+0.0375**, Ratio = **1.0483**  
- **Males (1):** Difference = **−0.0375**, Ratio = **0.9539**  
- ➝ The model is slightly more balanced and accurate for females compared to males.  

---

### 2. False Positive Rate (FPR)  
- **Females (0):** FPR Difference = **+0.0187**, Ratio = **1.0667**  
- **Males (1):** FPR Difference = **−0.0187**, Ratio = **0.9375**  
- ➝ Females experience a marginally higher false positive rate than males.  

---

### 3. Positive Predictive Value (PPV / Precision)  
- **Females (0):** PPV Difference = **+0.4986**, Ratio = **2.3962**  
- **Males (1):** PPV Difference = **−0.4986**, Ratio = **0.4173**  
- ➝ Predictions are **far more reliable for females**, while males face a strong disadvantage in precision.  

---

### 4. Selection Rate  
- **Females (0):** Selection Difference = **+0.3439**, Ratio = **1.9335**  
- **Males (1):** Selection Difference = **−0.3439**, Ratio = **0.5172**  
- ➝ Females are selected at nearly **double the rate of males**, suggesting overrepresentation.  

---

### 5. True Positive Rate (TPR / Sensitivity)  
- **Females (0):** TPR Difference = **+0.0938**, Ratio = **1.1125**  
- **Males (1):** TPR Difference = **−0.0938**, Ratio = **0.8989**  
- ➝ Females have a slightly higher sensitivity, meaning their true cases are more likely to be detected.  

---

### **Summary**  
- The Decision Tree appears to **favor females across multiple metrics**:  
  - Higher balanced accuracy, PPV, selection rates, and TPR.  
  - However, they also face a **slightly higher false positive rate**.  
- Males are disadvantaged in terms of **precision and selection**, with notably worse predictive reliability.  

This indicates that the Decision Tree introduces a **reverse bias**, systematically favoring females (unprivileged group) while disadvantaging males.  

---

In [21]:
# Get the stratified performance table
perf_table_dt = measure.performance(
    X=gender_df,
    y_true=y_test,
    y_pred=y_pred_dt,
    y_prob=y_prob_dt
)

# Replace NaN with a dash
perf_table_dt = perf_table_dt.fillna("—")

# Display pretty table
display(perf_table_dt)


  warn(f"Possible error in column(s) {cols}. {wr}\n")


Unnamed: 0,Feature Name,Feature Value,Obs.,Mean Target,Mean Prediction,Accuracy,F1-Score,FPR,PR AUC,Precision,ROC AUC,TPR
0,ALL FEATURES,ALL VALUES,184.0,0.5543,0.6413,0.8261,0.8545,0.2927,—,0.7966,0.8578,0.9216
1,Sex,0,38.0,0.1579,0.3684,0.7368,0.5,0.2812,—,0.3571,0.7969,0.8333
2,Sex,1,146.0,0.6575,0.7123,0.8493,0.89,0.3,—,0.8558,0.8539,0.9271


## Stratified Performance Analysis – Decision Tree by Gender  

This table shows the **stratified performance metrics** of the Decision Tree model across gender groups.  
Here, **0 = Female (unprivileged)** and **1 = Male (privileged)**.  

---

### 1. Overall Performance (All Features)  
- **Accuracy (0.8261)** and **F1-score (0.8545)** show moderate overall performance.  
- **ROC AUC (0.8578)** indicates fair discriminatory ability.  
- **TPR (0.9216)** highlights strong sensitivity overall, though subgroup breakdowns reveal disparities.  
- **Precision (0.7966)** suggests predictions are fairly reliable.  

---

### 2. Subgroup Comparison  

| Metric        | Female (0) | Male (1) | Interpretation |
|---------------|------------|----------|----------------|
| **Accuracy**  | 0.7368     | 0.8493   | Model accuracy is considerably lower for females. |
| **F1-Score**  | 0.5000     | 0.8900   | Model is much less effective for females. |
| **FPR**       | 0.2812     | 0.3000   | Both groups face high false positive rates, but males slightly more so. |
| **Precision** | 0.3571     | 0.8558   | Predictions are far less reliable for females. |
| **ROC AUC**   | 0.7969     | 0.8539   | Females experience weaker ranking performance. |
| **TPR**       | 0.8333     | 0.9271   | Females are more likely to be missed compared to males. |

---

### 3. Interpretation  

- **Females (unprivileged):**  
  - Experience **substantially worse performance** across accuracy, F1, and precision.  
  - Their **precision (0.3571)** is especially poor, meaning many of their positive predictions are false alarms.  
  - Although sensitivity (TPR = 0.8333) is acceptable, it lags behind males, leading to more missed cases.  

- **Males (privileged):**  
  - Benefit from higher performance across nearly all metrics.  
  - Stronger accuracy (0.8493), much higher F1 (0.8900), and reliable precision (0.8558).  
  - Their ROC AUC (0.8539) also reflects better ranking ability than for females.  

---

### **Summary**  
The Decision Tree model introduces a **systematic disadvantage for females**:  
- Their predictions are much less reliable, with **very low precision and F1-score**.  
- While males achieve strong predictive performance, females face both higher false positives and weaker sensitivity.  

This aligns with the fairness metrics, showing that the Decision Tree disproportionately **favors males (privileged group)** while disadvantaging females in predictive reliability.  

---

In [22]:
from fairmlhealth import performance_metrics as pm

# Define group masks with clear names
female_mask = (protected_attr_dt == 0)  # female = unprivileged group
male_mask   = (protected_attr_dt == 1)  # male = privileged group 

# Function to evaluate group-specific metrics
def evaluate_group_performance(group_name, mask):
    tpr = pm.true_positive_rate(y_test[mask], y_pred_dt[mask])
    fpr = pm.false_positive_rate(y_test[mask], y_pred_dt[mask])
    print(f"{group_name} Results:")
    print(f"  True Positive Rate (TPR): {tpr:.4f}")
    print(f"  False Positive Rate (FPR): {fpr:.4f}")
    print("-" * 40)

# Evaluate for each group
evaluate_group_performance("Female (Unprivileged)", female_mask)
evaluate_group_performance("Male (Privileged)", male_mask)

Female (Unprivileged) Results:
  True Positive Rate (TPR): 0.8333
  False Positive Rate (FPR): 0.2812
----------------------------------------
Male (Privileged) Results:
  True Positive Rate (TPR): 0.9271
  False Positive Rate (FPR): 0.3000
----------------------------------------


### Group-Specific Error Analysis – Decision Tree Model  

To further assess fairness at the subgroup level, we examine the **True Positive Rate (TPR)** and **False Positive Rate (FPR)** for the unprivileged (female) and privileged (male) groups.  

---

#### Results by Gender Group  

| Group                  | TPR     | FPR     |
|------------------------|---------|---------|
| Unprivileged (Female)  | 0.8333  | 0.2812  |
| Privileged (Male)      | 0.9271  | 0.3000  |

---

#### Interpretation  

- The **privileged group (male)** achieves a **higher TPR (92.71%)** compared to females (83.33%).  
  - This means males are **more likely to be correctly identified** when they truly have the condition.  

- However, the **FPR is high for both groups**:  
  - Females: **28.12%**  
  - Males: **30.00%**  
  - Both groups face a considerable rate of false alarms, but males slightly more so.  

- The disparity is **more pronounced in sensitivity (TPR)**, where females risk being under-identified compared to males.  

---

#### Summary  

The Decision Tree model shows a **gender imbalance in detection**:  
- **Males benefit from stronger sensitivity**, meaning they are more reliably detected when positive.  
- **Both groups suffer from high false positive rates**, which reduces overall trust in predictions.  
- The results highlight that while detection is slightly skewed in favor of males, **false alarms remain a fairness concern across both genders**.  

---

### Ensemble Model - Random Forest - RF

In [23]:
rf_df = pd.read_csv("HeartFailureData_75F25M_RF_predictions.csv")
print(rf_df.head())

   Sex  y_true  y_prob  y_pred
0    1       1  0.8975       1
1    1       1  0.1550       0
2    1       1  0.8625       1
3    1       1  0.2225       0
4    0       0  0.2400       0


In [24]:
# Extract common columns
y_true_rf = rf_df["y_true"].values
y_pred_rf = rf_df["y_pred"].values
y_prob_rf = rf_df["y_prob"].values
gender_rf = rf_df["Sex"].values


# Use gender_knn as the protected attribute (0/1 as in your CSV)
protected_attr_rf = gender_rf

In [25]:
# Random Forest Gender Bias Report
print("\n--- Random Forest Gender Bias Report ---")

rf_bias = measure.summary(
    X=X_test,
    y_true=y_test,
    y_pred=y_pred_rf,
    y_prob=y_prob_rf,
    prtc_attr=protected_attr_rf,
    pred_type="classification",
    priv_grp=1,  # 1 = Male = Privileged
    sig_fig=4,
    skip_if=True,
    skip_performance = True
)

print(rf_bias)


--- Random Forest Gender Bias Report ---
                                                        Value
Metric         Measure                                       
Group Fairness AUC Difference                          0.0494
               Balanced Accuracy Difference            0.0800
               Balanced Accuracy Ratio                 1.0974
               Disparate Impact Ratio                  0.2712
               Equal Odds Difference                  -0.1288
               Equal Odds Ratio                        0.1953
               Positive Predictive Parity Difference  -0.0725
               Positive Predictive Parity Ratio        0.9199
               Statistical Parity Difference          -0.4243
Data Metrics   Prevalence of Privileged Class (%)     79.0000


In [26]:
# Flagged fairness table for Random Forest
styled_rf = MyFlagger().apply_flag(
    df=rf_bias,
    caption="Random Forest Fairness (Gender)",
    boundaries=bounds,
    sig_fig=4,
    as_styler=True
)
styled_rf

Unnamed: 0_level_0,Unnamed: 1_level_0,Value
Metric,Measure,Unnamed: 2_level_1
Group Fairness,AUC Difference,0.0494
Group Fairness,Balanced Accuracy Difference,0.08
Group Fairness,Balanced Accuracy Ratio,1.0974
Group Fairness,Disparate Impact Ratio,0.2712
Group Fairness,Equal Odds Difference,-0.1288
Group Fairness,Equal Odds Ratio,0.1953
Group Fairness,Positive Predictive Parity Difference,-0.0725
Group Fairness,Positive Predictive Parity Ratio,0.9199
Group Fairness,Statistical Parity Difference,-0.4243
Data Metrics,Prevalence of Privileged Class (%),79.0


## Random Forest Fairness Analysis by Gender  

This table summarizes **fairness metrics** for the Random Forest model, focusing on gender bias.  
Here, **0 = Female (unprivileged)** and **1 = Male (privileged)**.  

---

### 1. Group Fairness Metrics  

- **AUC Difference (0.0494)**: Slight difference in ROC AUC between genders, suggesting somewhat better ranking performance for one group.  
- **Balanced Accuracy Difference (0.0800)** and **Ratio (1.0974)**: Indicates that males benefit from higher balanced accuracy, reflecting better sensitivity-specificity balance.  
- **Disparate Impact Ratio (0.2712)**: Far below the acceptable fairness threshold (0.80–1.25), indicating **substantial inequality in selection rates**, disadvantaging females.  
- **Equal Odds Difference (−0.1288)** and **Equal Odds Ratio (0.1953)**: Shows notable disparities in error rates (TPR and FPR), suggesting males are treated more favorably.  
- **Positive Predictive Parity Difference (−0.0725)** and **Ratio (0.9199)**: Predictions are less precise for females, indicating lower reliability of positive classifications.  
- **Statistical Parity Difference (−0.4243)**: Strongly negative value confirms that females are **selected much less frequently** than males.  

---

### 2. Interpretation  

- The Random Forest model introduces **systematic bias** against females.  
- Females experience **lower selection rates, reduced precision, and worse balanced accuracy**, making their predictions less reliable.  
- Error distribution (Equal Odds) shows clear imbalances, with males receiving more favorable treatment across both true and false positive outcomes.  
- Statistical parity and disparate impact measures further confirm **structural disadvantage for females**, as they are consistently under-selected compared to males.  

---

### **Summary**  

The Random Forest model demonstrates **strong gender bias**, primarily disadvantaging females (unprivileged group).  
Despite overall good model performance, fairness metrics reveal significant disparities in **selection rates, predictive reliability, and error distribution**.   

---

In [27]:
print("FairMLHealth Stratified Bias Table - RF")
measure.bias(X_test, y_test, y_pred_rf, features=['Sex'], flag_oor=False)

FairMLHealth Stratified Bias Table - RF


Unnamed: 0,Feature Name,Feature Value,Balanced Accuracy Difference,Balanced Accuracy Ratio,FPR Diff,FPR Ratio,PPV Diff,PPV Ratio,Selection Diff,Selection Ratio,TPR Diff,TPR Ratio
0,Sex,0,-0.08,0.9112,0.1288,5.12,0.0725,1.0871,0.4243,3.6872,-0.0312,0.9625
1,Sex,1,0.08,1.0974,-0.1288,0.1953,-0.0725,0.9199,-0.4243,0.2712,0.0312,1.039


## Stratified Bias Analysis – Random Forest by Gender  

This table presents **group-specific fairness metrics** for the Random Forest model, stratified by gender.  
Here, **0 = Female (unprivileged)** and **1 = Male (privileged)**.  

---

### 1. Balanced Accuracy  
- **Females (0):** Balanced Accuracy Difference = **−0.0800**, Ratio = **0.9112**  
- **Males (1):** Balanced Accuracy Difference = **+0.0800**, Ratio = **1.0974**  
- ➝ The model is more balanced and accurate for **males**, while females experience worse balanced accuracy.  

---

### 2. False Positive Rate (FPR)  
- **Females (0):** FPR Difference = **+0.1288**, Ratio = **5.1200**  
- **Males (1):** FPR Difference = **−0.1288**, Ratio = **0.1953**  
- ➝ Females face a **much higher false positive rate**, meaning they are more often incorrectly classified as positive.  

---

### 3. Positive Predictive Value (PPV / Precision)  
- **Females (0):** PPV Difference = **+0.0725**, Ratio = **1.0871**  
- **Males (1):** PPV Difference = **−0.0725**, Ratio = **0.9199**  
- ➝ Predictions are **slightly more precise for females**, while males experience lower precision.  

---

### 4. Selection Rate  
- **Females (0):** Selection Difference = **+0.4243**, Ratio = **3.6872**  
- **Males (1):** Selection Difference = **−0.4243**, Ratio = **0.2712**  
- ➝ Females are **selected much more often** than males, showing a strong imbalance.  

---

### 5. True Positive Rate (TPR / Sensitivity)  
- **Females (0):** TPR Difference = **−0.0312**, Ratio = **0.9625**  
- **Males (1):** TPR Difference = **+0.0312**, Ratio = **1.0390**  
- ➝ The model is **slightly more sensitive for males**, detecting a few more true cases compared to females.  

---

### **Summary**  
- The Random Forest model shows **mixed bias patterns**:  
  - **Females** suffer from a **high false positive rate**, meaning they are more often incorrectly flagged as positive.  
  - At the same time, they benefit from **higher selection rates and slightly better precision**.  
  - **Males** have **lower FPR and higher sensitivity**, suggesting they are more accurately classified overall, but their predictions are slightly less precise.  

Overall, the Random Forest introduces **imbalances in error distribution and selection**, disadvantaging **females in terms of false positives**, while **males face disadvantages in selection and precision**.  

---

In [28]:
# Get the stratified performance table
perf_table_rf = measure.performance(
    X=gender_df,
    y_true=y_test,
    y_pred=y_pred_rf,
    y_prob=y_prob_rf
)

# Replace NaN with a dash
perf_table_rf = perf_table_rf.fillna("—")

# display pretty table
display(perf_table_rf)


  warn(f"Possible error in column(s) {cols}. {wr}\n")


Unnamed: 0,Feature Name,Feature Value,Obs.,Mean Target,Mean Prediction,Accuracy,F1-Score,FPR,PR AUC,Precision,ROC AUC,TPR
0,ALL FEATURES,ALL VALUES,184.0,0.5543,0.4946,0.8424,0.8497,0.1098,—,0.9011,0.9231,0.8039
1,Sex,0,38.0,0.1579,0.1579,0.9474,0.8333,0.0312,—,0.8333,0.9531,0.8333
2,Sex,1,146.0,0.6575,0.5822,0.8151,0.8508,0.16,—,0.9059,0.9038,0.8021


## Stratified Performance Analysis – Random Forest by Gender  

This table shows the **stratified performance metrics** of the Random Forest model across gender groups.  
Here, **0 = Female (unprivileged)** and **1 = Male (privileged)**.  

---

### 1. Overall Performance (All Features)  
- **Accuracy (0.8424)** and **F1-score (0.8497)** indicate solid overall classification performance.  
- **ROC AUC (0.9231)** shows strong discriminatory ability.  
- **Precision (0.9011)** is high, suggesting reliable positive predictions.  
- **TPR (0.8039)** indicates reasonable sensitivity, though subgroup breakdowns reveal imbalances.  

---

### 2. Subgroup Comparison  

| Metric        | Female (0) | Male (1) | Interpretation |
|---------------|------------|----------|----------------|
| **Accuracy**  | 0.9474     | 0.8151   | Accuracy is substantially higher for females. |
| **F1-Score**  | 0.8333     | 0.8508   | F1 is slightly better for males, though close. |
| **FPR**       | 0.0312     | 0.1600   | Females experience far fewer false positives. |
| **Precision** | 0.8333     | 0.9059   | Males benefit from more reliable predictions. |
| **ROC AUC**   | 0.9531     | 0.9038   | ROC AUC is higher for females, suggesting stronger ranking performance. |
| **TPR**       | 0.8333     | 0.8021   | Females have a slightly higher sensitivity than males. |

---

### 3. Interpretation  
- **Females (unprivileged)**:  
  - Benefit from **higher accuracy (94.7%)** and **better ROC AUC (0.9531)**.  
  - Experience **much lower false positive rates (3.1%)**, which reduces unnecessary misclassifications.  
  - Precision (0.8333) is lower than for males, meaning their positive predictions are less reliable.  

- **Males (privileged)**:  
  - Have **lower accuracy (81.5%)** and weaker ROC AUC (0.9038).  
  - Precision (0.9059) is higher, making their positive predictions more trustworthy.  
  - However, they face **more false positives (16%)** and slightly worse sensitivity (TPR = 0.8021).  

---

### **Summary**  
The Random Forest model shows a **mixed fairness pattern**:  
- **Females** perform better in terms of accuracy, sensitivity, ROC AUC, and especially false positive rate.  
- **Males** benefit from higher precision and slightly better F1.  

This indicates that while females are overall **better protected against misclassification errors**, males enjoy **more reliable positive predictions**.  

---

In [29]:
from fairmlhealth import performance_metrics as pm

# Define group masks with clear names
female_mask = (protected_attr_rf == 0)  # female = unprivileged group
male_mask   = (protected_attr_rf == 1)  # male = privileged group 

# Function to evaluate group-specific metrics
def evaluate_group_performance(group_name, mask):
    tpr = pm.true_positive_rate(y_test[mask], y_pred_rf[mask])
    fpr = pm.false_positive_rate(y_test[mask], y_pred_rf[mask])
    print(f"{group_name} Results:")
    print(f"  True Positive Rate (TPR): {tpr:.4f}")
    print(f"  False Positive Rate (FPR): {fpr:.4f}")
    print("-" * 40)

# Evaluate for each group
evaluate_group_performance("Female (Unprivileged)", female_mask)
evaluate_group_performance("Male (Privileged)", male_mask)

Female (Unprivileged) Results:
  True Positive Rate (TPR): 0.8333
  False Positive Rate (FPR): 0.0312
----------------------------------------
Male (Privileged) Results:
  True Positive Rate (TPR): 0.8021
  False Positive Rate (FPR): 0.1600
----------------------------------------


### Group-Specific Error Analysis – Random Forest Model  

To assess fairness at the subgroup level, we compared the **True Positive Rate (TPR)** and **False Positive Rate (FPR)** for the unprivileged (female) and privileged (male) groups.  

---

#### Results by Gender Group  

| Group                  | TPR     | FPR     |
|------------------------|---------|---------|
| Unprivileged (Female)  | 0.8333  | 0.0312  |
| Privileged (Male)      | 0.8021  | 0.1600  |

---

#### Interpretation  

- **Females (unprivileged group)** achieve a slightly higher **TPR (83.33%)** than males (80.21%), meaning women are somewhat more likely to be correctly identified when they have CVD.  
- At the same time, females have a **much lower FPR (3.12%)** compared to males (16.00%), indicating that they are far less likely to be incorrectly flagged as having CVD.  
- This combination suggests that the model is **both more sensitive and more specific for females**, providing them with more favorable outcomes.  
- In contrast, males are disadvantaged, as they face both a higher risk of **missed true cases (lower TPR)** and **false alarms (higher FPR)**.  

---

#### Summary  

The Random Forest model shows a **reverse bias pattern**, favoring the unprivileged group (females).  
- Females experience **better diagnostic accuracy**, with stronger sensitivity and far fewer false positives.  
- Males, despite being the privileged group, face **higher misclassification risks**, which highlights an imbalance in model behavior that may require bias mitigation.  

---

### Deep Learning Model - Feed Forward Network (MLP)

In [30]:
mlp_df = pd.read_csv("HeartFailureData_75F25M_RecallFirstTunedMLP_predictions.csv")
print(mlp_df.head())

   Sex  y_true    y_prob  y_pred
0    1       1  0.997220       1
1    1       1  0.000009       0
2    1       1  0.791310       1
3    1       1  0.001596       0
4    0       0  0.000887       0


In [31]:
# Extract common columns 
y_true_mlp = mlp_df["y_true"].values 
y_prob_mlp = mlp_df["y_prob"].values
y_pred_mlp = mlp_df["y_pred"].values
gender_mlp = mlp_df["Sex"].values 

# Use gender_mlp as the protected attribute
protected_attr_mlp = gender_mlp 

In [32]:
#Run fairmlhealth bias detection for MLP 

mlp_bias = measure.summary(
    X=X_test,
    y_true=y_test,
    y_pred=y_pred_mlp,
    y_prob=y_prob_mlp,
    prtc_attr=protected_attr_mlp,
    pred_type="classification",
    priv_grp=1,
    sig_fig=4,
    skip_if=True,
    skip_performance = True
)

print(mlp_bias)

                                                        Value
Metric         Measure                                       
Group Fairness AUC Difference                         -0.0065
               Balanced Accuracy Difference            0.0267
               Balanced Accuracy Ratio                 1.0337
               Disparate Impact Ratio                  0.2183
               Equal Odds Difference                  -0.1888
               Equal Odds Ratio                        0.1420
               Positive Predictive Parity Difference  -0.0750
               Positive Predictive Parity Ratio        0.9143
               Statistical Parity Difference          -0.4712
Data Metrics   Prevalence of Privileged Class (%)     79.0000


In [33]:
# Flagged fairness table for MLP
styled_mlp = MyFlagger().apply_flag(
    df=mlp_bias,
    caption="MLP Fairness (Gender)",
    boundaries=bounds,
    sig_fig=4,
    as_styler=True
)
styled_mlp

Unnamed: 0_level_0,Unnamed: 1_level_0,Value
Metric,Measure,Unnamed: 2_level_1
Group Fairness,AUC Difference,-0.0065
Group Fairness,Balanced Accuracy Difference,0.0267
Group Fairness,Balanced Accuracy Ratio,1.0337
Group Fairness,Disparate Impact Ratio,0.2183
Group Fairness,Equal Odds Difference,-0.1888
Group Fairness,Equal Odds Ratio,0.142
Group Fairness,Positive Predictive Parity Difference,-0.075
Group Fairness,Positive Predictive Parity Ratio,0.9143
Group Fairness,Statistical Parity Difference,-0.4712
Data Metrics,Prevalence of Privileged Class (%),79.0


## Gender Bias Detection Results for MLP Model  

---

### 1. Group Fairness Metrics  

- **AUC Difference (−0.0065)**: Minimal disparity, showing that ranking quality between genders is nearly equal.  
- **Balanced Accuracy Difference (+0.0267)** and **Ratio (1.0337)**: Slightly favors females, who achieve marginally higher balanced accuracy.  
- **Disparate Impact Ratio (0.2183)**: Far below the fairness guideline of 0.80–1.25, revealing **severe inequality in selection rates**, strongly disadvantaging females.  
- **Equal Odds Difference (−0.1888)** and **Equal Odds Ratio (0.1420)**: Substantial disparity in error rates (TPR and FPR), with outcomes skewed toward males.  
- **Positive Predictive Parity Difference (−0.0750)** and **Ratio (0.9143)**: Predictions for females are **less precise**, meaning their positive classifications are less trustworthy.  
- **Statistical Parity Difference (−0.4712)**: Indicates a major shortfall in positive outcomes for females compared to males.  

---

### 2. Interpretation  

- The MLP model exhibits **systematic fairness concerns**, despite strong overall predictive capability.  
- Females face significant disadvantages in:  
  - **Selection rates** (very low compared to males).  
  - **Error distribution**, with much worse equal odds performance.  
  - **Precision**, as predictions are less reliable for females.  
- Males benefit disproportionately, reflected in both statistical and predictive parity measures.  

---

### **Summary**  

The MLP model shows **marked gender bias in favor of males (privileged group)**.  
- While AUC and balanced accuracy differences are relatively small, the **large gaps in selection rates, equal odds, and statistical parity** reveal **severe inequity**.  
- This means the MLP model consistently provides **better opportunities and more reliable outcomes for males**, while **systematically disadvantaging females**.

---

In [34]:
print("FairMLHealth Stratified Bias Table - MLP")
measure.bias(X_test, y_test, y_pred_mlp, features=['Sex'], flag_oor=False)

FairMLHealth Stratified Bias Table - MLP


Unnamed: 0,Feature Name,Feature Value,Balanced Accuracy Difference,Balanced Accuracy Ratio,FPR Diff,FPR Ratio,PPV Diff,PPV Ratio,Selection Diff,Selection Ratio,TPR Diff,TPR Ratio
0,Sex,0,-0.0267,0.9674,0.1888,7.04,0.075,1.0938,0.4712,4.5808,0.1354,1.2031
1,Sex,1,0.0267,1.0337,-0.1888,0.142,-0.075,0.9143,-0.4712,0.2183,-0.1354,0.8312


## Stratified Bias Analysis – MLP by Gender  

This table presents **group-specific fairness metrics** for the MLP model, stratified by gender.  
Here, **0 = Female (unprivileged)** and **1 = Male (privileged)**.  

---

### 1. Balanced Accuracy  
- **Females (0):** Balanced Accuracy Difference = **−0.0267**, Ratio = **0.9674**  
- **Males (1):** Balanced Accuracy Difference = **+0.0267**, Ratio = **1.0337**  
- ➝ Males have a slight advantage in balanced accuracy, while females perform somewhat worse.  

---

### 2. False Positive Rate (FPR)  
- **Females (0):** FPR Difference = **+0.1888**, Ratio = **7.0400**  
- **Males (1):** FPR Difference = **−0.1888**, Ratio = **0.1420**  
- ➝ Females are **far more likely to receive false positives**, showing a strong disadvantage.  

---

### 3. Positive Predictive Value (PPV / Precision)  
- **Females (0):** PPV Difference = **+0.0750**, Ratio = **1.0938**  
- **Males (1):** PPV Difference = **−0.0750**, Ratio = **0.9143**  
- ➝ Precision is slightly better for females, meaning when they are predicted positive, it is more often correct.  

---

### 4. Selection Rate  
- **Females (0):** Selection Difference = **+0.4712**, Ratio = **4.5808**  
- **Males (1):** Selection Difference = **−0.4712**, Ratio = **0.2183**  
- ➝ The model **selects females at a much higher rate**, which may indicate over-prediction of positives for them.  

---

### 5. True Positive Rate (TPR / Sensitivity)  
- **Females (0):** TPR Difference = **+0.1354**, Ratio = **1.2031**  
- **Males (1):** TPR Difference = **−0.1354**, Ratio = **0.8312**  
- ➝ Females have higher sensitivity, meaning their true cases are more often detected than males’.  

---

### **Summary**  
- The MLP model **favors females in selection rate, precision, and sensitivity (TPR)**.  
- However, this comes at the cost of a **much higher false positive rate for females**, which reduces fairness in error distribution.  
- Males, while slightly disadvantaged in TPR and selection, experience **lower false positives**, making predictions more conservative for them.  

Overall, the MLP introduces a **reverse bias**: it systematically **favors females over males**, but at the same time increases the **burden of false positives** for the female group.  

---

In [35]:
# Get the stratified performance table
perf_table_mlp = measure.performance(
    X=gender_df,
    y_true=y_test,
    y_pred=y_pred_mlp,
    y_prob=y_prob_mlp
)

# Replace NaN with a dash
perf_table_mlp = perf_table_mlp.fillna("—")

# display pretty table
display(perf_table_mlp)


  warn(f"Possible error in column(s) {cols}. {wr}\n")


Unnamed: 0,Feature Name,Feature Value,Obs.,Mean Target,Mean Prediction,Accuracy,F1-Score,FPR,PR AUC,Precision,ROC AUC,TPR
0,ALL FEATURES,ALL VALUES,184.0,0.5543,0.5054,0.8207,0.8308,0.1463,—,0.871,0.8968,0.7941
1,Sex,0,38.0,0.1579,0.1316,0.9211,0.7273,0.0312,—,0.8,0.8646,0.6667
2,Sex,1,146.0,0.6575,0.6027,0.7945,0.837,0.22,—,0.875,0.871,0.8021


## Stratified Performance Analysis – MLP by Gender  

This table shows the **stratified performance metrics** of the MLP model across gender groups.  
Here, **0 = Female (unprivileged)** and **1 = Male (privileged)**.  

---

### 1. Overall Performance (All Features)  
- **Accuracy (0.8207)** and **F1-score (0.8308)** indicate good but not perfect classification ability.  
- **ROC AUC (0.8968)** reflects strong discriminatory power.  
- **Precision (0.8710)** is high, showing predictions are generally reliable.  
- **TPR (0.7941)** suggests reasonable sensitivity, though subgroup analysis reveals disparities.  
- **Note**: PR AUC is not available (“—”) due to subgroup calculation limits.  

---

### 2. Subgroup Comparison  

| Metric        | Female (0) | Male (1) | Interpretation |
|---------------|------------|----------|----------------|
| **Accuracy**  | 0.9211     | 0.7945   | Accuracy is higher for females, suggesting better classification for this group. |
| **F1-Score**  | 0.7273     | 0.8370   | Males benefit from a stronger balance between precision and recall. |
| **FPR**       | 0.0312     | 0.2200   | Females experience far fewer false positives compared to males. |
| **Precision** | 0.8000     | 0.8750   | Predictions are more reliable for males than for females. |
| **ROC AUC**   | 0.8646     | 0.8710   | Both groups show strong ranking performance, with males slightly ahead. |
| **TPR**       | 0.6667     | 0.8021   | Females are more likely to be missed (lower sensitivity). |

---

### 3. Interpretation  
- **Females (unprivileged)**:  
  - Benefit from **higher accuracy** and **much lower false positive rates** (3.12%).  
  - However, they suffer from a **lower recall/TPR (66.7%)**, meaning more missed true cases.  
  - Precision (0.8000) is weaker, making positive predictions less trustworthy.  

- **Males (privileged)**:  
  - Achieve **better F1 and TPR**, showing stronger overall detection ability.  
  - However, they experience **substantially higher false positive rates (22%)**, which may lead to over-diagnosis.  
  - Predictions are more precise (0.8750), reducing incorrect positives.  

---

### **Summary**  
The MLP model exhibits **mixed gender disparities**:  
- Females gain **higher accuracy and fewer false alarms**, but at the cost of **lower sensitivity** (missed true cases).  
- Males benefit from **higher recall and F1-score**, but face a **greater false positive burden**.  

This indicates that the MLP is **not uniformly fair**: it balances error differently across genders, with each group facing different trade-offs.  

---

In [36]:
from fairmlhealth import performance_metrics as pm

# Define group masks with clear names
female_mask = (protected_attr_mlp == 0)  # female = unprivileged group
male_mask   = (protected_attr_mlp == 1)  # male = privileged group 

# Function to evaluate group-specific metrics
def evaluate_group_performance(group_name, mask):
    tpr = pm.true_positive_rate(y_test[mask], y_pred_mlp[mask])
    fpr = pm.false_positive_rate(y_test[mask], y_pred_mlp[mask])
    print(f"{group_name} Results:")
    print(f"  True Positive Rate (TPR): {tpr:.4f}")
    print(f"  False Positive Rate (FPR): {fpr:.4f}")
    print("-" * 40)

# Evaluate for each group
evaluate_group_performance("Female (Unprivileged)", female_mask)
evaluate_group_performance("Male (Privileged)", male_mask)

Female (Unprivileged) Results:
  True Positive Rate (TPR): 0.6667
  False Positive Rate (FPR): 0.0312
----------------------------------------
Male (Privileged) Results:
  True Positive Rate (TPR): 0.8021
  False Positive Rate (FPR): 0.2200
----------------------------------------


### Group-Specific Error Analysis – MLP Model  

To further assess fairness, the **True Positive Rate (TPR)** and **False Positive Rate (FPR)** are compared between the unprivileged (female) and privileged (male) groups.  

---

#### Results by Gender Group  

| Group                  | TPR     | FPR     |
|------------------------|---------|---------|
| Unprivileged (Female)  | 0.6667  | 0.0312  |
| Privileged (Male)      | 0.8021  | 0.2200  |

---

#### Interpretation  

- The **True Positive Rate (TPR)** is higher for males (80.21%) than for females (66.67%).  
  - This means males are **more likely to be correctly identified** when they have CVD, while females face a greater risk of missed diagnoses.  

- The **False Positive Rate (FPR)** is much lower for females (3.12%) compared to males (22.00%).  
  - This indicates that females are **less likely to be incorrectly flagged** as having CVD, while males face a substantial risk of false alarms.  

- These results show a **trade-off in error distribution**:  
  - Females experience **higher under-detection (low TPR)** but fewer false positives.  
  - Males benefit from **higher sensitivity (high TPR)** but at the cost of a much larger false positive rate.  

---

#### Summary  

The MLP model creates **different fairness concerns** across genders:  
- **Females (unprivileged group)** are more often missed (lower TPR).  
- **Males (privileged group)** are more frequently misclassified as positive (higher FPR).  

This highlights that the model’s **error balance is uneven**, with each gender disadvantaged in different ways.  

---