### Bias Detection and Fairness Evaluation on Cardiovascular Disease (Kaggle) using FairMLhealth
(Source: https://www.kaggle.com/datasets/sulianova/cardiovascular-disease-dataset)

In [1]:
import pandas as pd

# Load X_test set
X_test = pd.read_csv("./data_splits/X_test.csv")
y_test = pd.read_csv("./data_splits/y_test.csv")

In [2]:
import fairmlhealth
import aif360
print("Environment setup successful")

Environment setup successful


In [3]:
#have a look at the details of fairmlhealth - especially the version
!pip show fairmlhealth

Name: fairmlhealth
Version: 1.0.2
Summary: Health-centered variation analysis
Home-page: https://github.com/KenSciResearch/fairMLHealth
Author: Christine Allen
Author-email: ca.magallen@gmail.com
License: 
Location: c:\users\patri\appdata\roaming\python\python310\site-packages
Requires: aif360, ipython, jupyter, numpy, pandas, requests, scikit-learn, scipy
Required-by: 


In [4]:
#have a look at the modules that are within fairmlhealth

print(dir(fairmlhealth))

['__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__']


In [5]:
#load necessary modules 

#import module measure to use measure.summary for bias detection
from fairmlhealth import measure

#import module for investigation of individual cohorts 
from fairmlhealth.__utils import iterate_cohorts

#import FairRanges to flag high values
from fairmlhealth.__utils import FairRanges

# Wrap the fairness summary function for cohort-wise analysis
@iterate_cohorts
def cohort_summary(**kwargs):
    return measure.summary(**kwargs)

pip install 'aif360[AdversarialDebiasing]'
pip install 'aif360[AdversarialDebiasing]'
  vect_normalized_discounted_cumulative_gain = vmap(
  monte_carlo_vect_ndcg = vmap(vect_normalized_discounted_cumulative_gain, in_dims=(0,))


In [6]:
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", module="inFairness")
warnings.filterwarnings("ignore", message="AdversarialDebiasing will be unavailable")

### Traditional Machine Learning Models - KNN & DT

#### K-nearest neighbors - KNN

In [7]:
import pandas as pd

# Load KNN results
knn_df = pd.read_csv("CVDKaggleData_75F25M__tunedKNN_predictions.csv")

print(knn_df.head())

   gender  y_true    y_prob  y_pred
0       0       0  0.344828       0
1       0       0  0.862069       1
2       1       0  0.379310       0
3       0       0  0.379310       0
4       0       0  0.275862       0


In [8]:
# Extract common columns
y_true_knn = knn_df["y_true"].values
y_prob_knn = knn_df["y_prob"].values
y_pred_knn = knn_df["y_pred"].values
gender_knn = knn_df["gender"].values

# Use gender_knn as the protected attribute (0/1 as in your CSV)
protected_attr_knn = gender_knn

In [9]:
knn_bias = measure.summary(
    X=X_test,
    y_true=y_true_knn,
    y_pred=y_pred_knn,
    y_prob=y_prob_knn,
    prtc_attr=protected_attr_knn,
    pred_type="classification",
    priv_grp=1,
    sig_fig=4,
    skip_if=True,   # skip inconsistency metrics that cause NearestNeighbors error
    skip_performance=True
)

print(knn_bias)

                                                        Value
Metric         Measure                                       
Group Fairness AUC Difference                          0.0146
               Balanced Accuracy Difference            0.0138
               Balanced Accuracy Ratio                 1.0198
               Disparate Impact Ratio                  1.0999
               Equal Odds Difference                   0.0564
               Equal Odds Ratio                        1.1090
               Positive Predictive Parity Difference   0.0023
               Positive Predictive Parity Ratio        1.0033
               Statistical Parity Difference           0.0458
Data Metrics   Prevalence of Privileged Class (%)     35.0000


In [10]:
# 2) Custom scenario oriented bounds

custom_ranges = {
    "tpr diff": (-0.03, 0.03),
    "fpr diff": (-0.03, 0.03),
    "equal odds difference": (-0.04, 0.04),
    "statistical parity difference": (-0.05, 0.05),
    "disparate impact ratio": (0.9, 1.1),
    "selection ratio": (0.9, 1.1),
    "auc difference": (-0.02, 0.02),
    "balanced accuracy difference": (-0.02, 0.02),
}

bounds = FairRanges().load_fair_ranges(custom_ranges=custom_ranges)

In [11]:
#  restore Styler.set_precision to adjust the highlighting color in the styled table
import pandas as pd, numpy as np

Styler = type(pd.DataFrame({"_":[0]}).style)  

if not hasattr(Styler, "set_precision"):
    def _set_precision(self, precision=4):
        try:
            return self.format(precision=precision)
        except TypeError:
            return self.format(formatter=lambda x:
                f"{x:.{precision}g}" if isinstance(x, (int, float, np.floating)) else x
            )
    setattr(Styler, "set_precision", _set_precision)

In [12]:
#Flag metrics outside acceptable fairness bounds in current table 

from fairmlhealth.__utils import Flagger

class MyFlagger(Flagger):
    def reset(self):
        super().reset()
        self.flag_color = "#491ee6"   
        self.flag_type = "background-color"

styled_knn = MyFlagger().apply_flag(
    df=knn_bias,
    caption="KNN Fairness (Gender)",
    boundaries=bounds,
    sig_fig=4,
    as_styler=True
)
styled_knn

Unnamed: 0_level_0,Unnamed: 1_level_0,Value
Metric,Measure,Unnamed: 2_level_1
Group Fairness,AUC Difference,0.0146
Group Fairness,Balanced Accuracy Difference,0.0138
Group Fairness,Balanced Accuracy Ratio,1.0198
Group Fairness,Disparate Impact Ratio,1.0999
Group Fairness,Equal Odds Difference,0.0564
Group Fairness,Equal Odds Ratio,1.109
Group Fairness,Positive Predictive Parity Difference,0.0023
Group Fairness,Positive Predictive Parity Ratio,1.0033
Group Fairness,Statistical Parity Difference,0.0458
Data Metrics,Prevalence of Privileged Class (%),35.0


### Interpretation of KNN Fairness Metrics by Gender

The table reports fairness metrics for the **K-Nearest Neighbors (KNN) model**, using gender as the sensitive attribute.

---

#### Key Observations

- **AUC Difference (0.0146):**  
  → Very small; the model’s ranking performance is almost identical across genders.

- **Balanced Accuracy Difference (0.0138)** and **Ratio (1.0198):**  
  → Shows a small advantage for one gender, but the ratio remains close to 1, indicating that balanced accuracy is fairly consistent across groups.

- **Disparate Impact Ratio (1.0999):**  
  → Indicates that the likelihood of receiving a positive prediction is slightly higher for one group.  
  → Still well within the commonly accepted fairness threshold (0.8–1.25).

- **Equal Odds Difference (0.0564)** and **Equal Odds Ratio (1.1090):**  
  → This is the **largest disparity observed**. It means that the true positive rate and false positive rate differ more noticeably across genders compared to other metrics, although the difference (≈5.6%) is still moderate.

- **Positive Predictive Parity Difference (0.0023)** and **Ratio (1.0033):**  
  → Precision is nearly equal across genders, with an almost negligible difference.

- **Statistical Parity Difference (0.0458):**  
  → Indicates a slight imbalance in the probability of being predicted positive, with one gender being favored slightly more.

- **Prevalence of Privileged Class (35%):**  
  → The privileged group makes up 35% of the dataset. While the imbalance could affect outcomes, disparities remain relatively small.

---

#### Overall Conclusion
The KNN model demonstrates **mostly balanced fairness performance across genders**, with the following nuances:
- **Most metrics (AUC, balanced accuracy, predictive parity)** show only minor disparities.  
- **Equal Odds Difference (0.0564)** stands out as the largest gap, suggesting that error rates (TPR/FPR) differ more noticeably between genders.  
- **Statistical parity and disparate impact** remain within acceptable ranges, meaning that both groups receive positive outcomes at fairly similar rates.

In summary, the KNN model is **reasonably fair**, though it introduces a **slight imbalance in error distribution** across gender groups compared to other fairness aspects.

---

In [13]:
print("FairMLHealth Stratified Bias Table - KNN")
measure.bias(X_test, y_test, y_pred_knn, features=['gender'], flag_oor=False)

FairMLHealth Stratified Bias Table - KNN


Unnamed: 0,Feature Name,Feature Value,Balanced Accuracy Difference,Balanced Accuracy Ratio,FPR Diff,FPR Ratio,PPV Diff,PPV Ratio,Selection Diff,Selection Ratio,TPR Diff,TPR Ratio
0,gender,0,-0.0138,0.9806,-0.0288,0.9017,-0.0023,0.9967,-0.0458,0.9092,-0.0564,0.9211
1,gender,1,0.0138,1.0198,0.0288,1.109,0.0023,1.0033,0.0458,1.0999,0.0564,1.0856


### Interpretation of Stratified Bias Analysis – KNN (by Gender)

The table shows group-specific fairness metrics for the **K-Nearest Neighbors (KNN) model**, stratified by gender  
(0 = Female, 1 = Male).

---

#### 1. Balanced Accuracy
- **Female (0):** Difference = **-0.0138**, Ratio = **0.9806**  
- **Male (1):** Difference = **0.0138**, Ratio = **1.0198**  
➡️ Males benefit from slightly higher balanced accuracy, while females are at a small disadvantage. The gap is minor (~1–2%).

---

#### 2. False Positive Rate (FPR)
- **Female (0):** FPR Diff = **-0.0288**, Ratio = **0.9017**  
- **Male (1):** FPR Diff = **0.0288**, Ratio = **1.1090**  
➡️ Females experience a **lower false positive rate**, meaning fewer incorrect positive classifications.  
➡️ Males face a somewhat higher false positive rate.

---

#### 3. Positive Predictive Value (PPV, Precision)
- **Female (0):** PPV Diff = **-0.0023**, Ratio = **0.9967**  
- **Male (1):** PPV Diff = **0.0023**, Ratio = **1.0033**  
➡️ Precision is **almost identical** across genders, with negligible differences.

---

#### 4. Selection Rate (likelihood of being predicted positive)
- **Female (0):** Selection Diff = **-0.0458**, Ratio = **0.9092**  
- **Male (1):** Selection Diff = **0.0458**, Ratio = **1.0999**  
➡️ Males are **more likely to be predicted positive** than females.  
➡️ Females have a slightly lower chance of being classified as positive cases.

---

#### 5. True Positive Rate (TPR, Recall)
- **Female (0):** TPR Diff = **-0.0564**, Ratio = **0.9211**  
- **Male (1):** TPR Diff = **0.0564**, Ratio = **1.0856**  
➡️ Males benefit from a **higher recall**, meaning fewer false negatives.  
➡️ Females have a somewhat lower recall, missing more true positives.

---

### Overall Conclusion
- **Females (0):**  
  - Advantage: Lower false positive rate.  
  - Disadvantage: Lower recall (more missed true positives) and lower selection rate.  

- **Males (1):**  
  - Advantage: Higher recall and higher chance of being predicted positive.  
  - Disadvantage: Higher false positive rate.  

The KNN model shows a **trade-off**:  
- Females are more protected from false alarms but risk missing more true cases.  
- Males are detected more often (higher recall, more positives predicted) but also face more false positives.  

Overall, disparities are moderate, with the **largest gap in recall (TPR difference ~0.056)**, suggesting a **slight imbalance in error distribution** across genders.

---

In [14]:
from fairmlhealth import measure
import pandas as pd
from IPython.display import display  

# convert gender into DataFrame with a clear column name to get a nice table as output
gender_df = pd.DataFrame({"gender": X_test["gender"].astype(int)})


# Get the stratified table
perf_table_knn = measure.performance(
    X=gender_df,
    y_true=y_test,
    y_pred=y_pred_knn,
    y_prob=y_prob_knn
)

# Replace NaN with a dash
perf_table_knn = perf_table_knn.fillna("—")

# display pretty table
display(perf_table_knn)


  warn(f"Possible error in column(s) {cols}. {wr}\n")


Unnamed: 0,Feature Name,Feature Value,Obs.,Mean Target,Mean Prediction,Accuracy,F1-Score,FPR,PR AUC,Precision,ROC AUC,TPR
0,ALL FEATURES,ALL VALUES,11256.0,0.4976,0.4886,0.7064,0.7023,0.2833,—,0.7087,0.7607,0.6959
1,gender,0,7362.0,0.5004,0.5045,0.7109,0.7124,0.2934,0.2533,0.7095,0.7661,0.7153
2,gender,1,3894.0,0.4923,0.4587,0.6977,0.6821,0.2645,—,0.7072,0.7515,0.6588


### Interpretation of Stratified Performance Metrics – KNN (by Gender)

The table shows performance results for the **K-Nearest Neighbors (KNN) model**, stratified by gender  
(0 = Female, 1 = Male).

---

#### Overall Performance (All Data)
- **Accuracy:** 0.7064  
- **F1-Score:** 0.7023  
- **Precision:** 0.7087  
- **ROC AUC:** 0.7607  
- **TPR (Recall):** 0.6959  
→ The model performs at a **moderate level**, with a good trade-off between precision and recall.

---

#### Group-Specific Performance

**1. Female (0 – Unprivileged)**  
- **Accuracy:** 0.7109 (slightly higher than males)  
- **F1-Score:** 0.7124 (better balance of precision and recall)  
- **FPR:** 0.2934 (higher than males → more false positives)  
- **Precision:** 0.7095 (almost identical to males)  
- **ROC AUC:** 0.7661 (better discrimination ability than males)  
- **TPR (Recall):** 0.7153 (higher recall than males)  

➡️ Females benefit from **higher accuracy, F1-score, recall, and ROC AUC**, but at the cost of a **higher false positive rate**.

---

**2. Male (1 – Privileged)**  
- **Accuracy:** 0.6977 (lower than females)  
- **F1-Score:** 0.6821 (lower than females)  
- **FPR:** 0.2645 (lower than females → fewer false positives)  
- **Precision:** 0.7072 (very similar to females)  
- **ROC AUC:** 0.7515 (slightly worse than females)  
- **TPR (Recall):** 0.6588 (lower recall → more missed positives)  

➡️ Males experience **lower recall and overall accuracy**, but benefit from a **lower false positive rate**.

---

### Overall Conclusion
- **Females (0):** Better overall performance (accuracy, F1, recall, ROC AUC), but face more false positives.  
- **Males (1):** Slightly worse overall performance, with fewer false positives but more missed positives.  

The KNN model shows a **trade-off in fairness**:  
- It favors **females in terms of detection ability (higher recall, better accuracy)**.  
- It favors **males in terms of error protection (lower false positives)**.  

Overall, disparities are modest, suggesting the model is **reasonably fair but with small gender-specific trade-offs**.

---

In [15]:
#group specific error analysis

from fairmlhealth import performance_metrics as pm

# Define group masks with clear names
female_mask = (protected_attr_knn == 0)  # unprivileged group (female)
male_mask   = (protected_attr_knn == 1)  # privileged group (male)

# Function to evaluate group-specific metrics
def evaluate_group_performance(group_name, mask):
    tpr = pm.true_positive_rate(y_test[mask], y_pred_knn[mask])
    fpr = pm.false_positive_rate(y_test[mask], y_pred_knn[mask])
    print(f"{group_name} Results:")
    print(f"  True Positive Rate (TPR): {tpr:.4f}")
    print(f"  False Positive Rate (FPR): {fpr:.4f}")
    print("-" * 40)

# Evaluate for each group
evaluate_group_performance("Female (Unprivileged)", female_mask)
evaluate_group_performance("Male (Privileged)", male_mask)


Female (Unprivileged) Results:
  True Positive Rate (TPR): 0.7153
  False Positive Rate (FPR): 0.2934
----------------------------------------
Male (Privileged) Results:
  True Positive Rate (TPR): 0.6588
  False Positive Rate (FPR): 0.2645
----------------------------------------


### Group-Specific Error Analysis – KNN Model

This section evaluates the classification performance of the KNN model across gender groups using **True Positive Rate (TPR)** and **False Positive Rate (FPR)**.

#### Results by Gender Group

| Group                        | TPR     | FPR     |
|------------------------------|---------|---------|
| Female (Unprivileged = 0)    | 0.7153  | 0.2934  |
| Male (Privileged = 1)        | 0.6588  | 0.2645  |

#### Interpretation
- **Females (Unprivileged):**  
  - Achieve a **higher TPR (71.53%)**, meaning the model detects more true positives for this group.  
  - However, they also experience a **higher FPR (29.34%)**, meaning they receive more false positives.

- **Males (Privileged):**  
  - Achieve a **lower TPR (65.88%)**, so the model misses more true positives compared to females.  
  - At the same time, they have a **lower FPR (26.45%)**, meaning they are less likely to be incorrectly classified as positive.

#### Conclusion
The KNN model reveals a **sensitivity–specificity trade-off across genders**:  
- **Females** benefit from higher sensitivity (better recall of true positives) but face more false alarms.  
- **Males** are protected against false positives but suffer from lower sensitivity, missing more true cases.  

Overall, the differences are moderate, suggesting the model is **reasonably fair but slightly imbalanced** in how it distributes errors between genders.

---

### Decision Tree - DT

In [16]:
import pandas as pd

# Load KNN results
dt_df = pd.read_csv("CVDKaggleData_75F25M_DT_tuned_predictions.csv")

print(dt_df.head())

   gender  y_true  y_pred_dt    y_prob
0       0       0          0  0.256610
1       0       0          1  0.842593
2       1       0          0  0.417021
3       0       0          0  0.335329
4       0       0          0  0.354173


In [17]:
import re

# Extract common columns
y_true_dt = dt_df["y_true"].values
y_prob_dt = dt_df["y_prob"].values
y_pred_dt = dt_df["y_pred_dt"].values
gender_dt = dt_df["gender"].values


# Use gender_knn as the protected attribute (0/1 as in your CSV)
protected_attr_dt = gender_dt


In [18]:
# Decision Tree Gender Bias Report
print("\n--- Decision Tree Gender Bias Report ---")

dt_bias = measure.summary(
    X=X_test,
    y_true=y_test,
    y_pred=y_pred_dt,
    y_prob=y_prob_dt,
    prtc_attr=protected_attr_dt,
    pred_type="classification",
    priv_grp=1,  # 1 = Male = Privileged
    sig_fig=4,
    skip_if=True,  
    skip_performance = True
)

print(dt_bias)


--- Decision Tree Gender Bias Report ---
                                                        Value
Metric         Measure                                       
Group Fairness AUC Difference                          0.0153
               Balanced Accuracy Difference           -0.0006
               Balanced Accuracy Ratio                 0.9992
               Disparate Impact Ratio                  0.9925
               Equal Odds Difference                  -0.0078
               Equal Odds Ratio                        0.9774
               Positive Predictive Parity Difference   0.0092
               Positive Predictive Parity Ratio        1.0130
               Statistical Parity Difference          -0.0038
Data Metrics   Prevalence of Privileged Class (%)     35.0000


In [19]:
#Flag metrics outside acceptable fairness bounds in current table 

styled_dt = MyFlagger().apply_flag(
    df=dt_bias,
    caption="Decision Tree Fairness (Gender)",
    boundaries=bounds,
    sig_fig=4,
    as_styler=True
)
styled_dt

Unnamed: 0_level_0,Unnamed: 1_level_0,Value
Metric,Measure,Unnamed: 2_level_1
Group Fairness,AUC Difference,0.0153
Group Fairness,Balanced Accuracy Difference,-0.0006
Group Fairness,Balanced Accuracy Ratio,0.9992
Group Fairness,Disparate Impact Ratio,0.9925
Group Fairness,Equal Odds Difference,-0.0078
Group Fairness,Equal Odds Ratio,0.9774
Group Fairness,Positive Predictive Parity Difference,0.0092
Group Fairness,Positive Predictive Parity Ratio,1.013
Group Fairness,Statistical Parity Difference,-0.0038
Data Metrics,Prevalence of Privileged Class (%),35.0


### Interpretation of Decision Tree Fairness Metrics by Gender

The table reports fairness metrics for the **Decision Tree model**, using gender as the sensitive attribute.

---

#### Key Observations

- **AUC Difference (0.0153):**  
  → Very small; the model’s ranking ability is nearly identical across genders.

- **Balanced Accuracy Difference (-0.0006)** and **Ratio (0.9992):**  
  → Essentially no disparity in balanced accuracy; both genders are treated equally well.

- **Disparate Impact Ratio (0.9925):**  
  → Very close to 1, meaning positive prediction rates are almost equal for both genders. This value lies comfortably within the fairness range (0.8–1.25).

- **Equal Odds Difference (-0.0078)** and **Ratio (0.9774):**  
  → Very small difference in error rates (TPR and FPR). The negative sign suggests a **slight advantage for females**, but the magnitude is negligible.

- **Positive Predictive Parity Difference (0.0092)** and **Ratio (1.0130):**  
  → Precision is nearly identical between genders, with a marginal advantage for males.

- **Statistical Parity Difference (-0.0038):**  
  → Virtually no difference in the probability of receiving a positive prediction. Slightly favors females, but the difference is negligible.

- **Prevalence of Privileged Class (35%):**  
  → The privileged group (males) makes up 35% of the dataset, but this imbalance does not translate into fairness violations.

---

#### Overall Conclusion
The Decision Tree model demonstrates **excellent fairness across gender groups**:  
- All differences are extremely small (well below common fairness concern thresholds).  
- Positive outcomes, precision, recall, and error rates are distributed almost equally between genders.  
- If anything, there is a **tiny advantage for females** (unprivileged group), but it is so small that it does not indicate systematic bias.

In summary, the Decision Tree is **highly fair across genders**, making it one of the most balanced models in terms of fairness.

---

In [20]:
print("FairMLHealth Stratified Bias Table - DT")
measure.bias(X_test, y_test, y_pred_dt, features=['gender'], flag_oor=False)

FairMLHealth Stratified Bias Table - DT


Unnamed: 0,Feature Name,Feature Value,Balanced Accuracy Difference,Balanced Accuracy Ratio,FPR Diff,FPR Ratio,PPV Diff,PPV Ratio,Selection Diff,Selection Ratio,TPR Diff,TPR Ratio
0,gender,0,0.0006,1.0008,0.0066,1.0231,-0.0092,0.9871,0.0038,1.0076,0.0078,1.011
1,gender,1,-0.0006,0.9992,-0.0066,0.9774,0.0092,1.013,-0.0038,0.9925,-0.0078,0.9891


### Interpretation of Stratified Bias Analysis – Decision Tree (by Gender)

The table shows group-specific fairness metrics for the **Decision Tree model**, stratified by gender  
(0 = Female, 1 = Male).

---

#### 1. Balanced Accuracy
- **Female (0):** Difference = **0.0006**, Ratio = **1.0008**  
- **Male (1):** Difference = **-0.0006**, Ratio = **0.9992**  
➡️ Balanced accuracy is nearly identical across genders, with differences close to zero.

---

#### 2. False Positive Rate (FPR)
- **Female (0):** FPR Diff = **0.0066**, Ratio = **1.0231**  
- **Male (1):** FPR Diff = **-0.0066**, Ratio = **0.9774**  
➡️ Females experience a **slightly higher false positive rate**, while males benefit from slightly fewer false positives.  
The difference is very small (~0.6%).

---

#### 3. Positive Predictive Value (PPV, Precision)
- **Female (0):** PPV Diff = **-0.0092**, Ratio = **0.9871**  
- **Male (1):** PPV Diff = **0.0092**, Ratio = **1.0130**  
➡️ Precision is slightly better for males, but the difference is negligible (<1%).

---

#### 4. Selection Rate (likelihood of being predicted positive)
- **Female (0):** Selection Diff = **0.0038**, Ratio = **1.0076**  
- **Male (1):** Selection Diff = **-0.0038**, Ratio = **0.9925**  
➡️ Females are slightly **more likely to be predicted positive** than males, but the difference is almost negligible.

---

#### 5. True Positive Rate (TPR, Recall)
- **Female (0):** TPR Diff = **0.0078**, Ratio = **1.0110**  
- **Male (1):** TPR Diff = **-0.0078**, Ratio = **0.9891**  
➡️ Females achieve a **slightly higher recall**, while males experience slightly more false negatives.  
The difference is marginal (~0.8%).

---

### Overall Conclusion
- **Females (0):** Small advantages in recall (TPR) and likelihood of being predicted positive, but slightly higher false positives and slightly lower precision.  
- **Males (1):** Benefit from fewer false positives and slightly higher precision, but with a minor disadvantage in recall and prediction rate.  

All disparities are **extremely small (≤ 1%)**, meaning the Decision Tree model demonstrates **very strong fairness across gender groups**, with no meaningful bias.

---

In [21]:
# Get the stratified performance table
perf_table_dt = measure.performance(
    X=gender_df,
    y_true=y_test,
    y_pred=y_pred_dt,
    y_prob=y_prob_dt
)

# Replace NaN with a dash
perf_table_dt = perf_table_dt.fillna("—")

# Display pretty table
display(perf_table_dt)


  warn(f"Possible error in column(s) {cols}. {wr}\n")


Unnamed: 0,Feature Name,Feature Value,Obs.,Mean Target,Mean Prediction,Accuracy,F1-Score,FPR,PR AUC,Precision,ROC AUC,TPR
0,ALL FEATURES,ALL VALUES,11256.0,0.4976,0.4983,0.7113,0.7101,0.2881,—,0.7096,0.7657,0.7106
1,gender,0,7362.0,0.5004,0.497,0.7111,0.7103,0.2858,—,0.7128,0.7709,0.7079
2,gender,1,3894.0,0.4923,0.5008,0.7116,0.7096,0.2924,—,0.7036,0.7556,0.7157


### Interpretation of Stratified Performance Metrics – Decision Tree (by Gender)

The table shows performance results for the **Decision Tree model**, stratified by gender  
(0 = Female, 1 = Male).

---

#### Overall Performance (All Data)
- **Accuracy:** 0.7113  
- **F1-Score:** 0.7101  
- **Precision:** 0.7096  
- **ROC AUC:** 0.7657  
- **TPR (Recall):** 0.7106  
→ The model achieves **solid and balanced predictive performance**, with a good balance of precision and recall.

---

#### Group-Specific Performance

**1. Female (0 – Unprivileged)**  
- **Accuracy:** 0.7111 (almost the same as males)  
- **F1-Score:** 0.7103 (nearly identical)  
- **FPR:** 0.2858 (slightly lower than males → fewer false positives)  
- **Precision:** 0.7128 (slightly higher than males)  
- **ROC AUC:** 0.7709 (slightly better discrimination ability)  
- **TPR (Recall):** 0.7079 (very close to males)  

➡️ Females show **slightly better precision, ROC AUC, and fewer false positives**, though recall is marginally lower.

---

**2. Male (1 – Privileged)**  
- **Accuracy:** 0.7116 (almost the same as females)  
- **F1-Score:** 0.7096 (virtually identical)  
- **FPR:** 0.2924 (slightly higher than females → more false positives)  
- **Precision:** 0.7036 (slightly lower than females)  
- **ROC AUC:** 0.7556 (slightly lower discrimination ability)  
- **TPR (Recall):** 0.7157 (slightly higher than females)  

➡️ Males achieve **slightly higher recall**, meaning fewer missed positives, but at the cost of **lower precision and higher false positive rate**.

---

### Overall Conclusion
- **Females (0):** Small advantage in precision, ROC AUC, and fewer false positives.  
- **Males (1):** Small advantage in recall, detecting slightly more true positives, but with more false alarms.  

The Decision Tree model demonstrates **very balanced performance across genders**, with differences so small that they indicate **no meaningful gender bias**, only minor trade-offs between recall and precision.

---

In [22]:
from fairmlhealth import performance_metrics as pm

# Define group masks with clear names
female_mask = (protected_attr_dt == 0)  # female = unprivileged group
male_mask   = (protected_attr_dt == 1)  # male = privileged group 

# Function to evaluate group-specific metrics
def evaluate_group_performance(group_name, mask):
    tpr = pm.true_positive_rate(y_test[mask], y_pred_dt[mask])
    fpr = pm.false_positive_rate(y_test[mask], y_pred_dt[mask])
    print(f"{group_name} Results:")
    print(f"  True Positive Rate (TPR): {tpr:.4f}")
    print(f"  False Positive Rate (FPR): {fpr:.4f}")
    print("-" * 40)

# Evaluate for each group
evaluate_group_performance("Female (Unprivileged)", female_mask)
evaluate_group_performance("Male (Privileged)", male_mask)

Female (Unprivileged) Results:
  True Positive Rate (TPR): 0.7079
  False Positive Rate (FPR): 0.2858
----------------------------------------
Male (Privileged) Results:
  True Positive Rate (TPR): 0.7157
  False Positive Rate (FPR): 0.2924
----------------------------------------


### Group-Specific Error Analysis – Decision Tree Model

This section evaluates the classification performance of the Decision Tree model across gender groups using **True Positive Rate (TPR)** and **False Positive Rate (FPR)**.

#### Results by Gender Group

| Group                        | TPR     | FPR     |
|------------------------------|---------|---------|
| Female (Unprivileged = 0)    | 0.7079  | 0.2858  |
| Male (Privileged = 1)        | 0.7157  | 0.2924  |

#### Interpretation
- **Females (Unprivileged):**  
  - Achieve a **TPR of 70.79%**, which is slightly lower than males.  
  - Benefit from a **lower FPR (28.58%)**, meaning they receive fewer false positives.

- **Males (Privileged):**  
  - Achieve a **TPR of 71.57%**, slightly higher than females, meaning more true positives are detected.  
  - However, they also have a **higher FPR (29.24%)**, meaning they are more often misclassified as positive.

#### Conclusion
The Decision Tree model demonstrates a **small trade-off in error distribution**:  
- **Males** gain slightly higher sensitivity (recall) but also face more false alarms.  
- **Females** avoid more false positives but miss slightly more true cases.  

The differences are **minimal**, indicating the model maintains **strong gender fairness** with only negligible performance trade-offs.

---

### Ensemble Model - Random Forest - RF

In [23]:
rf_df = pd.read_csv("CVDKaggleData_75F25M_RF_tuned_predictions.csv")
print(rf_df.head())

   gender  y_true  y_pred_rf    y_prob
0       0       0          0  0.284528
1       0       0          1  0.804516
2       1       0          0  0.347408
3       0       0          0  0.272444
4       0       0          1  0.537651


In [24]:
# Extract common columns
y_true_rf = rf_df["y_true"].values
y_pred_rf = rf_df["y_pred_rf"].values
y_prob_rf = rf_df["y_prob"].values
gender_rf = rf_df["gender"].values


# Use gender_knn as the protected attribute (0/1 as in your CSV)
protected_attr_rf = gender_rf

In [25]:
# Random Forest Gender Bias Report
print("\n--- Random Forest Gender Bias Report ---")

rf_bias = measure.summary(
    X=X_test,
    y_true=y_test,
    y_pred=y_pred_rf,
    y_prob=y_prob_rf,
    prtc_attr=protected_attr_rf,
    pred_type="classification",
    priv_grp=1,  # 1 = Male = Privileged
    sig_fig=4,
    skip_if=True,
    skip_performance = True
)

print(rf_bias)


--- Random Forest Gender Bias Report ---
                                                        Value
Metric         Measure                                       
Group Fairness AUC Difference                          0.0131
               Balanced Accuracy Difference            0.0099
               Balanced Accuracy Ratio                 1.0142
               Disparate Impact Ratio                  0.9795
               Equal Odds Difference                  -0.0228
               Equal Odds Ratio                        0.9159
               Positive Predictive Parity Difference   0.0235
               Positive Predictive Parity Ratio        1.0332
               Statistical Parity Difference          -0.0096
Data Metrics   Prevalence of Privileged Class (%)     35.0000


In [26]:
# Flagged fairness table for Random Forest
styled_rf = MyFlagger().apply_flag(
    df=rf_bias,
    caption="Random Forest Fairness (Gender)",
    boundaries=bounds,
    sig_fig=4,
    as_styler=True
)
styled_rf

Unnamed: 0_level_0,Unnamed: 1_level_0,Value
Metric,Measure,Unnamed: 2_level_1
Group Fairness,AUC Difference,0.0131
Group Fairness,Balanced Accuracy Difference,0.0099
Group Fairness,Balanced Accuracy Ratio,1.0142
Group Fairness,Disparate Impact Ratio,0.9795
Group Fairness,Equal Odds Difference,-0.0228
Group Fairness,Equal Odds Ratio,0.9159
Group Fairness,Positive Predictive Parity Difference,0.0235
Group Fairness,Positive Predictive Parity Ratio,1.0332
Group Fairness,Statistical Parity Difference,-0.0096
Data Metrics,Prevalence of Privileged Class (%),35.0


### Interpretation of Random Forest Fairness Metrics by Gender

The table reports fairness metrics for the **Random Forest (RF) model**, using gender as the sensitive attribute.

---

#### Key Observations

- **AUC Difference (0.0131):**  
  → Very small; the model’s ranking ability is nearly the same across genders.

- **Balanced Accuracy Difference (0.0099)** and **Ratio (1.0142):**  
  → Indicates a very slight advantage for one gender in balanced accuracy, though the difference is minimal.

- **Disparate Impact Ratio (0.9795):**  
  → Close to 1, suggesting that both genders have nearly equal chances of receiving a positive prediction. This value lies well within the acceptable fairness range (0.8–1.25).

- **Equal Odds Difference (-0.0228)** and **Equal Odds Ratio (0.9159):**  
  → Shows some disparity in error rates (TPR and FPR). The negative difference indicates a slight advantage for females, but the difference is still relatively small (~2%).

- **Positive Predictive Parity Difference (0.0235)** and **Ratio (1.0332):**  
  → Precision is slightly higher for males, but the difference remains minor.

- **Statistical Parity Difference (-0.0096):**  
  → Almost negligible difference in the rate of positive predictions, slightly favoring females.

- **Prevalence of Privileged Class (35%):**  
  → The privileged group (males) represents 35% of the dataset. This imbalance does not translate into major fairness concerns in the metrics.

---

#### Overall Conclusion
The Random Forest model demonstrates **fairness performance that is strong overall but slightly less balanced than Decision Tree or MLP**:  
- Most differences are small and within acceptable thresholds.  
- **Females** seem to have a slight advantage in terms of error rates (Equal Odds), while **males** show a small advantage in precision.  
- The **largest disparity** appears in **Equal Odds Difference (-0.0228)**, but this is still relatively minor.

Overall, the Random Forest model is **reasonably fair**, with no strong systematic gender bias, but with slightly more noticeable differences than the most balanced models.

---

In [27]:
print("FairMLHealth Stratified Bias Table - RF")
measure.bias(X_test, y_test, y_pred_rf, features=['gender'], flag_oor=False)

FairMLHealth Stratified Bias Table - RF


Unnamed: 0,Feature Name,Feature Value,Balanced Accuracy Difference,Balanced Accuracy Ratio,FPR Diff,FPR Ratio,PPV Diff,PPV Ratio,Selection Diff,Selection Ratio,TPR Diff,TPR Ratio
0,gender,0,-0.0099,0.986,0.0228,1.0918,-0.0235,0.9679,0.0096,1.021,0.003,1.0044
1,gender,1,0.0099,1.0142,-0.0228,0.9159,0.0235,1.0332,-0.0096,0.9795,-0.003,0.9956


### Interpretation of Stratified Bias Analysis – Random Forest (by Gender)

The table shows group-specific fairness metrics for the **Random Forest model**, stratified by gender  
(0 = Female, 1 = Male).

---

#### 1. Balanced Accuracy
- **Female (0):** Difference = **-0.0099**, Ratio = **0.9860**  
- **Male (1):** Difference = **0.0099**, Ratio = **1.0142**  
➡️ Males have a slight advantage in balanced accuracy (~1%), while females are at a small disadvantage.

---

#### 2. False Positive Rate (FPR)
- **Female (0):** FPR Diff = **0.0228**, Ratio = **1.0918**  
- **Male (1):** FPR Diff = **-0.0228**, Ratio = **0.9159**  
➡️ Females face a **slightly higher false positive rate**, while males benefit from fewer false positives.

---

#### 3. Positive Predictive Value (PPV, Precision)
- **Female (0):** PPV Diff = **-0.0235**, Ratio = **0.9679**  
- **Male (1):** PPV Diff = **0.0235**, Ratio = **1.0332**  
➡️ Males have a **small precision advantage**, meaning their positive predictions are slightly more reliable.

---

#### 4. Selection Rate (likelihood of being predicted positive)
- **Female (0):** Selection Diff = **0.0096**, Ratio = **1.0210**  
- **Male (1):** Selection Diff = **-0.0096**, Ratio = **0.9795**  
➡️ Females are slightly **more likely to be predicted positive** than males.

---

#### 5. True Positive Rate (TPR, Recall)
- **Female (0):** TPR Diff = **0.003**, Ratio = **1.0044**  
- **Male (1):** TPR Diff = **-0.003**, Ratio = **0.9956**  
➡️ Recall is nearly identical across genders, with females having a marginally higher TPR.

---

### Overall Conclusion
- **Females (0):** Slight advantages in **recall** and **selection rate**, but at the cost of a **higher false positive rate** and **slightly lower precision**.  
- **Males (1):** Benefit from **fewer false positives** and **higher precision**, but have a marginally lower recall and are slightly less likely to be predicted positive.  

The disparities are **very small (mostly around 1–2%)**, meaning the Random Forest model shows **reasonably balanced fairness across gender groups**, with no major bias but some **minor trade-offs** in error distribution.

---

In [28]:
# Get the stratified performance table
perf_table_rf = measure.performance(
    X=gender_df,
    y_true=y_test,
    y_pred=y_pred_rf,
    y_prob=y_prob_rf
)

# Replace NaN with a dash
perf_table_rf = perf_table_rf.fillna("—")

# display pretty table
display(perf_table_rf)


  warn(f"Possible error in column(s) {cols}. {wr}\n")


Unnamed: 0,Feature Name,Feature Value,Obs.,Mean Target,Mean Prediction,Accuracy,F1-Score,FPR,PR AUC,Precision,ROC AUC,TPR
0,ALL FEATURES,ALL VALUES,11256.0,0.4976,0.4631,0.7075,0.6956,0.2568,—,0.7215,0.7651,0.6715
1,gender,0,7362.0,0.5004,0.4598,0.7108,0.6988,0.2488,—,0.7297,0.7697,0.6705
2,gender,1,3894.0,0.4923,0.4694,0.7013,0.6895,0.2716,—,0.7062,0.7566,0.6734


### Interpretation of Stratified Performance Metrics – Random Forest (by Gender)

The table shows performance results for the **Random Forest model**, stratified by gender  
(0 = Female, 1 = Male).

---

#### Overall Performance (All Data)
- **Accuracy:** 0.7075  
- **F1-Score:** 0.6956  
- **Precision:** 0.7215  
- **ROC AUC:** 0.7651  
- **TPR (Recall):** 0.6715  
→ The Random Forest achieves **solid predictive performance**, balancing precision and recall effectively.

---

#### Group-Specific Performance

**1. Female (0 – Unprivileged)**  
- **Accuracy:** 0.7108 (slightly higher than males)  
- **F1-Score:** 0.6988 (slightly higher than males)  
- **FPR:** 0.2488 (lower false positive rate than males)  
- **Precision:** 0.7297 (higher than males)  
- **ROC AUC:** 0.7697 (slightly better discrimination ability)  
- **TPR (Recall):** 0.6705 (very close to males)  

➡️ Females achieve **better overall performance** in accuracy, F1, precision, and ROC AUC, while also benefiting from a **lower false positive rate**. Recall is nearly identical to males.

---

**2. Male (1 – Privileged)**  
- **Accuracy:** 0.7013 (lower than females)  
- **F1-Score:** 0.6895 (lower than females)  
- **FPR:** 0.2716 (higher false positive rate)  
- **Precision:** 0.7062 (lower than females)  
- **ROC AUC:** 0.7566 (slightly lower discrimination ability)  
- **TPR (Recall):** 0.6734 (almost identical to females)  

➡️ Males perform **slightly worse across most metrics**, with more false positives and lower precision. Recall remains essentially the same as for females.

---

### Overall Conclusion
- **Females (0):** Stronger performance across accuracy, F1, precision, and ROC AUC, plus fewer false positives.  
- **Males (1):** Weaker across most metrics, with higher false positives and lower precision, but nearly equal recall.  

The Random Forest model shows a **small but consistent performance advantage for females**, while maintaining **fairly balanced recall across genders**. This indicates **no severe bias**, though the model leans slightly in favor of females in terms of predictive reliability.

---

In [29]:
from fairmlhealth import performance_metrics as pm

# Define group masks with clear names
female_mask = (protected_attr_rf == 0)  # female = unprivileged group
male_mask   = (protected_attr_rf == 1)  # male = privileged group 

# Function to evaluate group-specific metrics
def evaluate_group_performance(group_name, mask):
    tpr = pm.true_positive_rate(y_test[mask], y_pred_rf[mask])
    fpr = pm.false_positive_rate(y_test[mask], y_pred_rf[mask])
    print(f"{group_name} Results:")
    print(f"  True Positive Rate (TPR): {tpr:.4f}")
    print(f"  False Positive Rate (FPR): {fpr:.4f}")
    print("-" * 40)

# Evaluate for each group
evaluate_group_performance("Female (Unprivileged)", female_mask)
evaluate_group_performance("Male (Privileged)", male_mask)

Female (Unprivileged) Results:
  True Positive Rate (TPR): 0.6705
  False Positive Rate (FPR): 0.2488
----------------------------------------
Male (Privileged) Results:
  True Positive Rate (TPR): 0.6734
  False Positive Rate (FPR): 0.2716
----------------------------------------


### Group-Specific Error Analysis – Random Forest Model

This section evaluates the classification performance of the Random Forest model across gender groups using **True Positive Rate (TPR)** and **False Positive Rate (FPR)**.

#### Results by Gender Group

| Group                        | TPR     | FPR     |
|------------------------------|---------|---------|
| Female (Unprivileged = 0)    | 0.6705  | 0.2488  |
| Male (Privileged = 1)        | 0.6734  | 0.2716  |

#### Interpretation
- **Females (Unprivileged):**  
  - Achieve a **TPR of 67.05%**, almost identical to males.  
  - Benefit from a **lower FPR (24.88%)**, meaning fewer false positives compared to males.

- **Males (Privileged):**  
  - Achieve a **TPR of 67.34%**, slightly higher than females, meaning marginally better recall.  
  - However, they also experience a **higher FPR (27.16%)**, meaning they are more often incorrectly classified as positive.

#### Conclusion
The Random Forest model achieves **very balanced recall across genders**:  
- **Males** gain a very small advantage in sensitivity (recall).  
- **Females** benefit from fewer false alarms (lower FPR).  

The differences are **minimal** (≈0.3% in TPR, ≈2.3% in FPR), suggesting the model is **highly fair across gender groups** with only negligible trade-offs.

---

### Deep Learning Model - Feed Forward Network (MLP)

In [30]:
mlp_df = pd.read_csv("CVDKaggleData_75F25M_MLP_adamtuned_predictions.csv")
print(mlp_df.head())

   gender  y_true  y_pred    y_prob
0       0       0       0  0.332016
1       0       0       1  0.861841
2       1       0       0  0.411947
3       0       0       0  0.338920
4       0       0       0  0.319818


In [31]:
# Extract common columns 
y_true_mlp = mlp_df["y_true"].values 
y_prob_mlp = mlp_df["y_prob"].values
y_pred_mlp = mlp_df["y_pred"].values
gender_mlp = mlp_df["gender"].values 

# Use gender_mlp as the protected attribute
protected_attr_mlp = gender_mlp 

In [32]:
#Run fairmlhealth bias detection for MLP 

mlp_bias = measure.summary(
    X=X_test,
    y_true=y_test,
    y_pred=y_pred_mlp,
    y_prob=y_prob_mlp,
    prtc_attr=protected_attr_mlp,
    pred_type="classification",
    priv_grp=1,
    sig_fig=4,
    skip_if=True,
    skip_performance = True
)

print(mlp_bias)

                                                        Value
Metric         Measure                                       
Group Fairness AUC Difference                          0.0133
               Balanced Accuracy Difference            0.0099
               Balanced Accuracy Ratio                 1.0141
               Disparate Impact Ratio                  0.9429
               Equal Odds Difference                  -0.0421
               Equal Odds Ratio                        0.8616
               Positive Predictive Parity Difference   0.0308
               Positive Predictive Parity Ratio        1.0444
               Statistical Parity Difference          -0.0289
Data Metrics   Prevalence of Privileged Class (%)     35.0000


In [33]:
# Flagged fairness table for MLP
styled_mlp = MyFlagger().apply_flag(
    df=mlp_bias,
    caption="MLP Fairness (Gender)",
    boundaries=bounds,
    sig_fig=4,
    as_styler=True
)
styled_mlp

Unnamed: 0_level_0,Unnamed: 1_level_0,Value
Metric,Measure,Unnamed: 2_level_1
Group Fairness,AUC Difference,0.0133
Group Fairness,Balanced Accuracy Difference,0.0099
Group Fairness,Balanced Accuracy Ratio,1.0141
Group Fairness,Disparate Impact Ratio,0.9429
Group Fairness,Equal Odds Difference,-0.0421
Group Fairness,Equal Odds Ratio,0.8616
Group Fairness,Positive Predictive Parity Difference,0.0308
Group Fairness,Positive Predictive Parity Ratio,1.0444
Group Fairness,Statistical Parity Difference,-0.0289
Data Metrics,Prevalence of Privileged Class (%),35.0


### Interpretation of MLP Fairness Metrics by Gender

The table reports fairness metrics for the **Multilayer Perceptron (MLP) model**, using gender as the sensitive attribute.

---

#### Key Observations

- **AUC Difference (0.0133):**  
  → Very small; ranking performance is nearly identical across genders.

- **Balanced Accuracy Difference (0.0099)** and **Ratio (1.0141):**  
  → Balanced accuracy is almost equal across genders, with only a ~1% difference.

- **Disparate Impact Ratio (0.9429):**  
  → Slightly below 1, suggesting that females are somewhat less likely to receive positive predictions than males.  
  → Still within the generally acceptable fairness range (0.8–1.25).

- **Equal Odds Difference (-0.0421)** and **Equal Odds Ratio (0.8616):**  
  → This is the **largest disparity** observed. The negative value indicates that females have a slight advantage in error rates (TPR/FPR).  
  → However, the ratio being closer to 0.86 shows some imbalance in how errors are distributed.

- **Positive Predictive Parity Difference (0.0308)** and **Ratio (1.0444):**  
  → Precision is marginally higher for males, meaning their positive predictions are a bit more reliable.

- **Statistical Parity Difference (-0.0289):**  
  → Indicates females are predicted positive at a slightly lower rate than males, but the difference is modest.

- **Prevalence of Privileged Class (35%):**  
  → Males make up 35% of the dataset, but this imbalance does not fully account for the fairness gaps.

---

#### Overall Conclusion
The MLP model achieves **reasonably balanced fairness**, but with **some small disparities**:  
- **Females**: Slight advantage in error rates (equal odds difference negative), but are **less likely to be predicted positive** overall.  
- **Males**: Small advantage in precision and positive prediction rate.  

The most notable metric is **Equal Odds Difference (-0.0421)**, suggesting some imbalance in error distribution. Still, all disparities remain **relatively minor**, keeping the model within **acceptable fairness bounds**, though less balanced than the Decision Tree model.

---

In [34]:
print("FairMLHealth Stratified Bias Table - MLP")
measure.bias(X_test, y_test, y_pred_mlp, features=['gender'], flag_oor=False)

FairMLHealth Stratified Bias Table - MLP


Unnamed: 0,Feature Name,Feature Value,Balanced Accuracy Difference,Balanced Accuracy Ratio,FPR Diff,FPR Ratio,PPV Diff,PPV Ratio,Selection Diff,Selection Ratio,TPR Diff,TPR Ratio
0,gender,0,-0.0099,0.9861,0.0421,1.1606,-0.0308,0.9575,0.0289,1.0606,0.0223,1.0322
1,gender,1,0.0099,1.0141,-0.0421,0.8616,0.0308,1.0444,-0.0289,0.9429,-0.0223,0.9688


### Interpretation of Stratified Bias Analysis – MLP (by Gender)

The table shows group-specific fairness metrics for the **Multilayer Perceptron (MLP) model**, stratified by gender  
(0 = Female, 1 = Male).

---

#### 1. Balanced Accuracy
- **Female (0):** Difference = **-0.0099**, Ratio = **0.9861**  
- **Male (1):** Difference = **0.0099**, Ratio = **1.0141**  
➡️ Males show a slight advantage in balanced accuracy (~1%), while females are marginally disadvantaged.

---

#### 2. False Positive Rate (FPR)
- **Female (0):** FPR Diff = **0.0421**, Ratio = **1.1606**  
- **Male (1):** FPR Diff = **-0.0421**, Ratio = **0.8616**  
➡️ Females experience a **higher false positive rate**, meaning they are more often incorrectly classified as positive.  
➡️ Males benefit from fewer false positives.

---

#### 3. Positive Predictive Value (PPV, Precision)
- **Female (0):** PPV Diff = **-0.0308**, Ratio = **0.9575**  
- **Male (1):** PPV Diff = **0.0308**, Ratio = **1.0444**  
➡️ Males achieve **higher precision**, meaning their positive predictions are more reliable.  
➡️ Females show a small disadvantage in precision.

---

#### 4. Selection Rate (likelihood of being predicted positive)
- **Female (0):** Selection Diff = **0.0289**, Ratio = **1.0606**  
- **Male (1):** Selection Diff = **-0.0289**, Ratio = **0.9429**  
➡️ Females are **slightly more likely to be predicted positive** compared to males.  
➡️ Males are less frequently predicted positive.

---

#### 5. True Positive Rate (TPR, Recall)
- **Female (0):** TPR Diff = **0.0223**, Ratio = **1.0322**  
- **Male (1):** TPR Diff = **-0.0223**, Ratio = **0.9688**  
➡️ Females achieve **slightly higher recall**, detecting more true positives.  
➡️ Males experience slightly more false negatives.

---

### Overall Conclusion
- **Females (0):**  
  - Advantages: Slightly **higher recall** and **higher selection rate**.  
  - Disadvantages: **Higher false positive rate** and **lower precision**.  

- **Males (1):**  
  - Advantages: **Fewer false positives** and **higher precision**.  
  - Disadvantages: Slightly **lower recall** and **less likely to be predicted positive**.  

The disparities are **modest (mostly 2–4%)**, indicating the MLP model is **fair overall** but introduces a **small trade-off**:  
- **Females** are detected more often but at the cost of more false alarms.  
- **Males** enjoy more reliable predictions but risk missing more true positives.

---

In [35]:
# Get the stratified performance table
perf_table_mlp = measure.performance(
    X=gender_df,
    y_true=y_test,
    y_pred=y_pred_mlp,
    y_prob=y_prob_mlp
)

# Replace NaN with a dash
perf_table_mlp = perf_table_mlp.fillna("—")

# display pretty table
display(perf_table_mlp)


  warn(f"Possible error in column(s) {cols}. {wr}\n")


Unnamed: 0,Feature Name,Feature Value,Obs.,Mean Target,Mean Prediction,Accuracy,F1-Score,FPR,PR AUC,Precision,ROC AUC,TPR
0,ALL FEATURES,ALL VALUES,11256.0,0.4976,0.4868,0.7107,0.7061,0.2771,—,0.714,0.7679,0.6984
1,gender,0,7362.0,0.5004,0.4768,0.7142,0.7075,0.2624,—,0.7251,0.7725,0.6908
2,gender,1,3894.0,0.4923,0.5056,0.7042,0.7036,0.3045,—,0.6943,0.7592,0.7131


### Interpretation of Stratified Performance Metrics – MLP (by Gender)

The table shows performance results for the **Multilayer Perceptron (MLP) model**, stratified by gender  
(0 = Female, 1 = Male).

---

#### Overall Performance (All Data)
- **Accuracy:** 0.7107  
- **F1-Score:** 0.7061  
- **Precision:** 0.7140  
- **ROC AUC:** 0.7679  
- **TPR (Recall):** 0.6984  
→ The MLP achieves **solid performance**, balancing recall and precision with good discrimination ability.

---

#### Group-Specific Performance

**1. Female (0 – Unprivileged)**  
- **Accuracy:** 0.7142 (slightly higher than males)  
- **F1-Score:** 0.7075 (slightly higher)  
- **FPR:** 0.2624 (lower false positive rate)  
- **Precision:** 0.7251 (higher precision)  
- **ROC AUC:** 0.7725 (better discrimination ability)  
- **TPR (Recall):** 0.6908 (slightly lower than males)  

➡️ Females benefit from **better accuracy, precision, and ROC AUC**, and fewer false positives. Their recall is slightly lower compared to males.

---

**2. Male (1 – Privileged)**  
- **Accuracy:** 0.7042 (lower than females)  
- **F1-Score:** 0.7036 (slightly lower)  
- **FPR:** 0.3045 (higher false positive rate)  
- **Precision:** 0.6943 (lower precision)  
- **ROC AUC:** 0.7592 (weaker discrimination ability)  
- **TPR (Recall):** 0.7131 (slightly higher than females)  

➡️ Males achieve **better recall (TPR)**, meaning fewer missed positives, but at the cost of **higher false positives and lower precision**.

---

### Overall Conclusion
- **Females (0):** Stronger performance in precision, accuracy, ROC AUC, and fewer false alarms.  
- **Males (1):** Advantage in recall, detecting more positives, but suffer from more false positives and less reliable predictions.  

The MLP model demonstrates a **small performance trade-off**:  
- **Females** → more reliable predictions (higher precision, fewer false positives).  
- **Males** → better sensitivity (higher recall), but with more false alarms.  

The differences are **moderate but not extreme**, suggesting the MLP remains **reasonably fair across genders**, with minor imbalances in error distribution.

---

In [36]:
from fairmlhealth import performance_metrics as pm

# Define group masks with clear names
female_mask = (protected_attr_mlp == 0)  # female = unprivileged group
male_mask   = (protected_attr_mlp == 1)  # male = privileged group 

# Function to evaluate group-specific metrics
def evaluate_group_performance(group_name, mask):
    tpr = pm.true_positive_rate(y_test[mask], y_pred_mlp[mask])
    fpr = pm.false_positive_rate(y_test[mask], y_pred_mlp[mask])
    print(f"{group_name} Results:")
    print(f"  True Positive Rate (TPR): {tpr:.4f}")
    print(f"  False Positive Rate (FPR): {fpr:.4f}")
    print("-" * 40)

# Evaluate for each group
evaluate_group_performance("Female (Unprivileged)", female_mask)
evaluate_group_performance("Male (Privileged)", male_mask)

Female (Unprivileged) Results:
  True Positive Rate (TPR): 0.6908
  False Positive Rate (FPR): 0.2624
----------------------------------------
Male (Privileged) Results:
  True Positive Rate (TPR): 0.7131
  False Positive Rate (FPR): 0.3045
----------------------------------------


### Group-Specific Error Analysis – MLP Model

This section evaluates the classification performance of the MLP model across gender groups using **True Positive Rate (TPR)** and **False Positive Rate (FPR)**.

#### Results by Gender Group

| Group                        | TPR     | FPR     |
|------------------------------|---------|---------|
| Female (Unprivileged = 0)    | 0.6908  | 0.2624  |
| Male (Privileged = 1)        | 0.7131  | 0.3045  |

#### Interpretation
- **Females (Unprivileged):**  
  - Achieve a **TPR of 69.08%**, slightly lower than males.  
  - Benefit from a **lower FPR (26.24%)**, meaning fewer false positives.

- **Males (Privileged):**  
  - Achieve a **TPR of 71.31%**, slightly higher than females, meaning more true positives are detected.  
  - However, they also suffer from a **higher FPR (30.45%)**, meaning more false positives occur.

#### Conclusion
The MLP model reveals a **balanced but trade-off relationship**:  
- **Males** → higher sensitivity (better recall), but at the cost of more false alarms.  
- **Females** → fewer false positives, but at the cost of slightly reduced recall.  

The differences (≈2.2% in TPR, ≈4.2% in FPR) are **small**, suggesting the model maintains **fair performance across genders** with only minor imbalances in error distribution.

---