### Bias Detection and Fairness Evaluation on Cardiovascular Disease (Kaggle) using FairMLhealth
(Source: https://www.kaggle.com/datasets/sulianova/cardiovascular-disease-dataset)

In [1]:
import pandas as pd

# Load X_test set
X_test = pd.read_csv("./data_splits/X_test.csv")
y_test = pd.read_csv("./data_splits/y_test.csv")

In [2]:
import fairmlhealth
import aif360
print("Environment setup successful")

Environment setup successful


In [3]:
#have a look at the details of fairmlhealth - especially the version
!pip show fairmlhealth

Name: fairmlhealth
Version: 1.0.2
Summary: Health-centered variation analysis
Home-page: https://github.com/KenSciResearch/fairMLHealth
Author: Christine Allen
Author-email: ca.magallen@gmail.com
License: 
Location: c:\users\patri\appdata\roaming\python\python310\site-packages
Requires: aif360, ipython, jupyter, numpy, pandas, requests, scikit-learn, scipy
Required-by: 


In [4]:
#have a look at the modules that are within fairmlhealth

print(dir(fairmlhealth))

['__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__']


In [5]:
#load necessary modules 

#import module measure to use measure.summary for bias detection
from fairmlhealth import measure

#import module for investigation of individual cohorts 
from fairmlhealth.__utils import iterate_cohorts

#import FairRanges to flag high values
from fairmlhealth.__utils import FairRanges

# Wrap the fairness summary function for cohort-wise analysis
@iterate_cohorts
def cohort_summary(**kwargs):
    return measure.summary(**kwargs)

pip install 'aif360[AdversarialDebiasing]'
pip install 'aif360[AdversarialDebiasing]'
  vect_normalized_discounted_cumulative_gain = vmap(
  monte_carlo_vect_ndcg = vmap(vect_normalized_discounted_cumulative_gain, in_dims=(0,))


In [6]:
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", module="inFairness")
warnings.filterwarnings("ignore", message="AdversarialDebiasing will be unavailable")

### Traditional Machine Learning Models - KNN & DT

#### K-nearest neighbors - KNN

In [7]:
import pandas as pd

# Load KNN results
knn_df = pd.read_csv("CVDKaggleData_75M25F_PCAKNN_predictions.csv")

print(knn_df.head())

   gender  y_true  y_prob  y_pred
0       0       0    0.40       0
1       0       0    0.75       1
2       1       0    0.20       0
3       0       0    0.40       0
4       0       0    1.00       1


In [8]:
# Extract common columns
y_true_knn = knn_df["y_true"].values
y_prob_knn = knn_df["y_prob"].values
y_pred_knn = knn_df["y_pred"].values
gender_knn = knn_df["gender"].values

# Use gender_knn as the protected attribute (0/1 as in your CSV)
protected_attr_knn = gender_knn

In [9]:
knn_bias = measure.summary(
    X=X_test,
    y_true=y_true_knn,
    y_pred=y_pred_knn,
    y_prob=y_prob_knn,
    prtc_attr=protected_attr_knn,
    pred_type="classification",
    priv_grp=1,
    sig_fig=4,
    skip_if=True,   # skip inconsistency metrics that cause NearestNeighbors error
    skip_performance=True
)

print(knn_bias)

                                                        Value
Metric         Measure                                       
Group Fairness AUC Difference                          0.0064
               Balanced Accuracy Difference            0.0095
               Balanced Accuracy Ratio                 1.0141
               Disparate Impact Ratio                  1.0780
               Equal Odds Difference                   0.0425
               Equal Odds Ratio                        1.0817
               Positive Predictive Parity Difference   0.0040
               Positive Predictive Parity Ratio        1.0059
               Statistical Parity Difference           0.0358
Data Metrics   Prevalence of Privileged Class (%)     35.0000


In [10]:
# 2) Custom scenario oriented bounds

custom_ranges = {
    "tpr diff": (-0.03, 0.03),
    "fpr diff": (-0.03, 0.03),
    "equal odds difference": (-0.04, 0.04),
    "statistical parity difference": (-0.05, 0.05),
    "disparate impact ratio": (0.9, 1.1),
    "selection ratio": (0.9, 1.1),
    "auc difference": (-0.02, 0.02),
    "balanced accuracy difference": (-0.02, 0.02),
}

bounds = FairRanges().load_fair_ranges(custom_ranges=custom_ranges)

In [11]:
#  restore Styler.set_precision to adjust the highlighting color in the styled table
import pandas as pd, numpy as np

Styler = type(pd.DataFrame({"_":[0]}).style)  

if not hasattr(Styler, "set_precision"):
    def _set_precision(self, precision=4):
        try:
            return self.format(precision=precision)
        except TypeError:
            return self.format(formatter=lambda x:
                f"{x:.{precision}g}" if isinstance(x, (int, float, np.floating)) else x
            )
    setattr(Styler, "set_precision", _set_precision)

In [12]:
#Flag metrics outside acceptable fairness bounds in current table 

from fairmlhealth.__utils import Flagger

class MyFlagger(Flagger):
    def reset(self):
        super().reset()
        self.flag_color = "#491ee6"   
        self.flag_type = "background-color"

styled_knn = MyFlagger().apply_flag(
    df=knn_bias,
    caption="KNN Fairness (Gender)",
    boundaries=bounds,
    sig_fig=4,
    as_styler=True
)
styled_knn

Unnamed: 0_level_0,Unnamed: 1_level_0,Value
Metric,Measure,Unnamed: 2_level_1
Group Fairness,AUC Difference,0.0064
Group Fairness,Balanced Accuracy Difference,0.0095
Group Fairness,Balanced Accuracy Ratio,1.0141
Group Fairness,Disparate Impact Ratio,1.078
Group Fairness,Equal Odds Difference,0.0425
Group Fairness,Equal Odds Ratio,1.0817
Group Fairness,Positive Predictive Parity Difference,0.004
Group Fairness,Positive Predictive Parity Ratio,1.0059
Group Fairness,Statistical Parity Difference,0.0358
Data Metrics,Prevalence of Privileged Class (%),35.0


### Interpretation of KNN Fairness Metrics by Gender

The table summarizes multiple fairness metrics for the KNN model with **gender** as the sensitive attribute.

#### Key Observations:

- **AUC Difference (0.0064)** and **Balanced Accuracy Difference (0.0095)** are very small.  
  → The model performs almost equally well across gender groups in terms of ranking ability and balanced accuracy.

- **Balanced Accuracy Ratio (1.0141)** is close to 1, further confirming parity in performance across groups.

- **Disparate Impact Ratio (1.0780)** is slightly above 1.  
  → This means the positive outcome rate is a bit higher for the unprivileged group, but still within the commonly accepted fairness range (0.8–1.25).

- **Equal Odds Difference (0.0425)** is modest.  
  → Suggests a small difference in error rates (TPR and FPR) between genders, though not extreme.

- **Equal Odds Ratio (1.0817)** also indicates near parity, with only minor deviation between groups.

- **Positive Predictive Parity Difference (0.0040)** and **Ratio (1.0059)** are almost negligible.  
  → The precision (likelihood that a predicted positive is correct) is essentially the same for both genders.

- **Statistical Parity Difference (0.0358)** is small.  
  → The overall probability of receiving a positive prediction differs only slightly between genders.

- **Prevalence of Privileged Class (35%)**: The privileged group constitutes 35% of the dataset.  
  → This class imbalance may influence fairness metrics but does not appear to create strong disparities.

#### Overall Conclusion:
The KNN model shows **minor fairness disparities across gender**, but all differences remain relatively small and within acceptable ranges. Metrics like **Equal Odds Difference (0.0425)** and **Disparate Impact Ratio (1.0780)** indicate some variation, yet the overall fairness performance suggests that the model does not exhibit strong gender bias.

---

In [13]:
print("FairMLHealth Stratified Bias Table - KNN")
measure.bias(X_test, y_test, y_pred_knn, features=['gender'], flag_oor=False)

FairMLHealth Stratified Bias Table - KNN


Unnamed: 0,Feature Name,Feature Value,Balanced Accuracy Difference,Balanced Accuracy Ratio,FPR Diff,FPR Ratio,PPV Diff,PPV Ratio,Selection Diff,Selection Ratio,TPR Diff,TPR Ratio
0,gender,0,-0.0095,0.9861,-0.0234,0.9245,-0.004,0.9942,-0.0358,0.9277,-0.0425,0.9375
1,gender,1,0.0095,1.0141,0.0234,1.0817,0.004,1.0059,0.0358,1.078,0.0425,1.0667


### Interpretation of Stratified Bias Analysis – KNN (by Gender)

The table reports **bias metrics stratified by gender groups (0 = unprivileged, 1 = privileged)** for the KNN model.

#### 1. Balanced Accuracy
- **Gender 0 (unprivileged):** Balanced Accuracy Difference = **-0.0095**, Ratio = **0.9861**  
- **Gender 1 (privileged):** Balanced Accuracy Difference = **0.0095**, Ratio = **1.0141**  
➡️ Performance is very similar across genders, with only a **~1% difference**, showing near parity.

#### 2. False Positive Rate (FPR)
- **Unprivileged (0):** FPR Diff = **-0.0234**, Ratio = **0.9245**  
- **Privileged (1):** FPR Diff = **0.0234**, Ratio = **1.0817**  
➡️ The privileged group has a **slightly higher false positive rate** compared to the unprivileged group.

#### 3. Positive Predictive Value (PPV, i.e., precision)
- **Unprivileged (0):** PPV Diff = **-0.0040**, Ratio = **0.9942**  
- **Privileged (1):** PPV Diff = **0.0040**, Ratio = **1.0059**  
➡️ Precision is nearly identical across groups, with differences negligible (<0.5%).

#### 4. Selection Rate (likelihood of being predicted positive)
- **Unprivileged (0):** Selection Diff = **-0.0358**, Ratio = **0.9277**  
- **Privileged (1):** Selection Diff = **0.0358**, Ratio = **1.0780**  
➡️ Privileged individuals are **slightly more likely** to receive positive predictions.

#### 5. True Positive Rate (TPR, recall/sensitivity)
- **Unprivileged (0):** TPR Diff = **-0.0425**, Ratio = **0.9375**  
- **Privileged (1):** TPR Diff = **0.0425**, Ratio = **1.0667**  
➡️ Privileged group has a **slightly higher recall**, meaning they benefit from fewer false negatives.

---

### Overall Conclusion
The stratified analysis shows that:
- **Performance differences are small** (balanced accuracy and precision are nearly equal).  
- **Privileged group (1)** has **slightly higher recall and selection rate**, meaning they are more often predicted positive and less often missed.  
- **Unprivileged group (0)** faces slightly fewer false positives but also slightly lower recall.  

All differences remain relatively minor, suggesting that the KNN model is **fairly balanced across gender groups**, though it leans slightly in favor of the privileged group in terms of recall and positive prediction likelihood.

---

In [14]:
from fairmlhealth import measure
import pandas as pd
from IPython.display import display  

# convert gender into DataFrame with a clear column name to get a nice table as output
gender_df = pd.DataFrame({"gender": X_test["gender"].astype(int)})


# Get the stratified table
perf_table_knn = measure.performance(
    X=gender_df,
    y_true=y_test,
    y_pred=y_pred_knn,
    y_prob=y_prob_knn
)

# Replace NaN with a dash
perf_table_knn = perf_table_knn.fillna("—")

# display pretty table
display(perf_table_knn)


  warn(f"Possible error in column(s) {cols}. {wr}\n")


Unnamed: 0,Feature Name,Feature Value,Obs.,Mean Target,Mean Prediction,Accuracy,F1-Score,FPR,PR AUC,Precision,ROC AUC,TPR
0,ALL FEATURES,ALL VALUES,11256.0,0.4976,0.4826,0.6815,0.6751,0.302,—,0.6856,0.7218,0.6649
1,gender,0,7362.0,0.5004,0.495,0.6846,0.6831,0.3102,—,0.6869,0.7248,0.6794
2,gender,1,3894.0,0.4923,0.4592,0.6757,0.6591,0.2868,—,0.6829,0.7184,0.6369


### Interpretation of Performance Metrics – KNN by Gender

The table shows performance metrics of the KNN model, stratified by gender groups (0 = unprivileged, 1 = privileged).

#### Overall Performance (All Features)
- **Accuracy:** 0.6815  
- **F1-Score:** 0.6751  
- **Precision:** 0.6856  
- **ROC AUC:** 0.7218  
- **TPR (Recall):** 0.6649  
- → The KNN model shows **moderate predictive performance**, with balanced precision and recall.

---

#### Group-Specific Performance

**1. Gender = 0 (Unprivileged Group)**
- **Accuracy:** 0.6846 (slightly higher than privileged group)  
- **F1-Score:** 0.6831 (better balance of precision and recall)  
- **FPR:** 0.3102 (higher false positive rate than privileged group)  
- **Precision:** 0.6869 (slightly higher than privileged group)  
- **ROC AUC:** 0.7248 (slightly better discrimination ability)  
- **TPR (Recall):** 0.6794 (higher sensitivity than privileged group)  

➡️ The unprivileged group benefits from **better recall and F1-score**, but also experiences a **higher false positive rate**.

---

**2. Gender = 1 (Privileged Group)**
- **Accuracy:** 0.6757 (slightly lower than unprivileged group)  
- **F1-Score:** 0.6591 (lower balance of precision and recall)  
- **FPR:** 0.2868 (lower false positive rate than unprivileged group)  
- **Precision:** 0.6829 (comparable to unprivileged group)  
- **ROC AUC:** 0.7184 (slightly lower than unprivileged group)  
- **TPR (Recall):** 0.6369 (lower sensitivity, more false negatives)  

➡️ The privileged group experiences **fewer false positives**, but at the cost of **lower recall and F1-score**.

---

### Overall Conclusion
- **Unprivileged group (0):** Higher recall and F1-score → better at identifying true positives, but with more false positives.  
- **Privileged group (1):** Lower false positives but worse recall and F1-score, meaning more missed positives.  
- **Fairness perspective:** The performance differences are small (within a few percentage points), suggesting **no severe bias**, but the model leans slightly toward favoring the unprivileged group in terms of predictive sensitivity.

---

In [15]:
#group specific error analysis

from fairmlhealth import performance_metrics as pm

# Define group masks with clear names
female_mask = (protected_attr_knn == 0)  # unprivileged group (female)
male_mask   = (protected_attr_knn == 1)  # privileged group (male)

# Function to evaluate group-specific metrics
def evaluate_group_performance(group_name, mask):
    tpr = pm.true_positive_rate(y_test[mask], y_pred_knn[mask])
    fpr = pm.false_positive_rate(y_test[mask], y_pred_knn[mask])
    print(f"{group_name} Results:")
    print(f"  True Positive Rate (TPR): {tpr:.4f}")
    print(f"  False Positive Rate (FPR): {fpr:.4f}")
    print("-" * 40)

# Evaluate for each group
evaluate_group_performance("Female (Unprivileged)", female_mask)
evaluate_group_performance("Male (Privileged)", male_mask)


Female (Unprivileged) Results:
  True Positive Rate (TPR): 0.6794
  False Positive Rate (FPR): 0.3102
----------------------------------------
Male (Privileged) Results:
  True Positive Rate (TPR): 0.6369
  False Positive Rate (FPR): 0.2868
----------------------------------------


### Group-Specific Error Analysis – KNN Model

This section evaluates the classification performance of the KNN model across gender groups using **True Positive Rate (TPR)** and **False Positive Rate (FPR)**.

#### Results by Gender Group

| Group                        | TPR     | FPR     |
|------------------------------|---------|---------|
| Female (Unprivileged = 0)    | 0.6794  | 0.3102  |
| Male (Privileged = 1)        | 0.6369  | 0.2868  |

#### Interpretation
- **Females (Unprivileged):**  
  - Achieve a **higher TPR (67.94%)**, meaning the model correctly identifies more true positives compared to males.  
  - However, they also have a **higher FPR (31.02%)**, meaning more false positives occur.

- **Males (Privileged):**  
  - Have a **lower TPR (63.69%)**, so the model misses more true positives in this group.  
  - At the same time, they experience a **lower FPR (28.68%)**, resulting in fewer false positives.

#### Conclusion
The KNN model shows a **trade-off between sensitivity and false positives** across genders:  
- It is **more sensitive for females** (higher recall), but at the cost of more false alarms.  
- For **males**, the model makes **fewer false alarms**, but also fails to detect more true cases.  
Overall, the differences are moderate and suggest a slight imbalance rather than a strong systematic bias.

---

### Decision Tree - DT

In [16]:
import pandas as pd

# Load KNN results
dt_df = pd.read_csv("CVDKaggleData_75M25F_DT_tuned_predictions.csv")

print(dt_df.head())

   gender  y_true  y_pred_dt    y_prob
0       0       0          0  0.229598
1       0       0          1  0.872696
2       1       0          0  0.386614
3       0       0          0  0.386614
4       0       0          0  0.326141


In [17]:
import re

# Extract common columns
y_true_dt = dt_df["y_true"].values
y_prob_dt = dt_df["y_prob"].values
y_pred_dt = dt_df["y_pred_dt"].values
gender_dt = dt_df["gender"].values


# Use gender_knn as the protected attribute (0/1 as in your CSV)
protected_attr_dt = gender_dt


In [18]:
# Decision Tree Gender Bias Report
print("\n--- Decision Tree Gender Bias Report ---")

dt_bias = measure.summary(
    X=X_test,
    y_true=y_test,
    y_pred=y_pred_dt,
    y_prob=y_prob_dt,
    prtc_attr=protected_attr_dt,
    pred_type="classification",
    priv_grp=1,  # 1 = Male = Privileged
    sig_fig=4,
    skip_if=True,  
    skip_performance = True
)

print(dt_bias)


--- Decision Tree Gender Bias Report ---
                                                        Value
Metric         Measure                                       
Group Fairness AUC Difference                          0.0123
               Balanced Accuracy Difference            0.0008
               Balanced Accuracy Ratio                 1.0011
               Disparate Impact Ratio                  0.9782
               Equal Odds Difference                  -0.0155
               Equal Odds Ratio                        0.9508
               Positive Predictive Parity Difference   0.0134
               Positive Predictive Parity Ratio        1.0193
               Statistical Parity Difference          -0.0114
Data Metrics   Prevalence of Privileged Class (%)     35.0000


In [19]:
#Flag metrics outside acceptable fairness bounds in current table 

styled_dt = MyFlagger().apply_flag(
    df=dt_bias,
    caption="Decision Tree Fairness (Gender)",
    boundaries=bounds,
    sig_fig=4,
    as_styler=True
)
styled_dt

Unnamed: 0_level_0,Unnamed: 1_level_0,Value
Metric,Measure,Unnamed: 2_level_1
Group Fairness,AUC Difference,0.0123
Group Fairness,Balanced Accuracy Difference,0.0008
Group Fairness,Balanced Accuracy Ratio,1.0011
Group Fairness,Disparate Impact Ratio,0.9782
Group Fairness,Equal Odds Difference,-0.0155
Group Fairness,Equal Odds Ratio,0.9508
Group Fairness,Positive Predictive Parity Difference,0.0134
Group Fairness,Positive Predictive Parity Ratio,1.0193
Group Fairness,Statistical Parity Difference,-0.0114
Data Metrics,Prevalence of Privileged Class (%),35.0


### Interpretation of Decision Tree Fairness Metrics by Gender

The table summarizes fairness metrics for the **Decision Tree model**, using gender as the sensitive attribute.

#### Key Observations

- **AUC Difference (0.0123):** Very small.  
  → The ability to rank positives and negatives is nearly identical between gender groups.

- **Balanced Accuracy Difference (0.0008)** and **Ratio (1.0011):**  
  → Practically no disparity; the model achieves nearly equal accuracy across genders.

- **Disparate Impact Ratio (0.9782):**  
  → Close to 1, showing that the probability of receiving a positive prediction is similar across genders (well within the acceptable fairness range 0.8–1.25).

- **Equal Odds Difference (-0.0155)** and **Ratio (0.9508):**  
  → Small deviation in terms of error rates (TPR/FPR). The slightly negative difference suggests marginally better treatment of the unprivileged group.

- **Positive Predictive Parity Difference (0.0134)** and **Ratio (1.0193):**  
  → Precision is nearly the same for both genders; differences are negligible.

- **Statistical Parity Difference (-0.0114):**  
  → Indicates a very small disparity in positive prediction rates, slightly favoring the unprivileged group.

- **Prevalence of Privileged Class (35%):**  
  → The privileged group makes up 35% of the dataset, but the metrics show that this imbalance does not lead to major fairness issues.

---

#### Overall Conclusion
The Decision Tree model exhibits **very balanced fairness performance across gender groups**.  
- Disparities in AUC, balanced accuracy, predictive parity, and statistical parity are minimal.  
- All fairness metrics remain well within accepted thresholds, suggesting that the model does **not introduce meaningful gender bias**.  
If anything, the results show a **slight advantage for the unprivileged group**, but the differences are too small to be concerning.

---

In [20]:
print("FairMLHealth Stratified Bias Table - DT")
measure.bias(X_test, y_test, y_pred_dt, features=['gender'], flag_oor=False)

FairMLHealth Stratified Bias Table - DT


Unnamed: 0,Feature Name,Feature Value,Balanced Accuracy Difference,Balanced Accuracy Ratio,FPR Diff,FPR Ratio,PPV Diff,PPV Ratio,Selection Diff,Selection Ratio,TPR Diff,TPR Ratio
0,gender,0,-0.0008,0.9989,0.0155,1.0517,-0.0134,0.981,0.0114,1.0223,0.0139,1.0195
1,gender,1,0.0008,1.0011,-0.0155,0.9508,0.0134,1.0193,-0.0114,0.9782,-0.0139,0.9809


### Interpretation of Stratified Bias Analysis – Decision Tree (by Gender)

The table shows group-specific fairness metrics for the **Decision Tree model**, stratified by gender  
(0 = Female, 1 = Male).

---

#### 1. Balanced Accuracy
- **Female (0):** Difference = **-0.0008**, Ratio = **0.9989**  
- **Male (1):** Difference = **0.0008**, Ratio = **1.0011**  
➡️ Practically no disparity in balanced accuracy; both genders are treated equally well.

---

#### 2. False Positive Rate (FPR)
- **Female (0):** FPR Diff = **0.0155**, Ratio = **1.0517**  
- **Male (1):** FPR Diff = **-0.0155**, Ratio = **0.9508**  
➡️ Females experience a **slightly higher false positive rate**, while males benefit from fewer false positives. The difference is small.

---

#### 3. Positive Predictive Value (PPV, Precision)
- **Female (0):** PPV Diff = **-0.0134**, Ratio = **0.9810**  
- **Male (1):** PPV Diff = **0.0134**, Ratio = **1.0193**  
➡️ Precision is nearly identical across genders, with only a slight advantage for males.

---

#### 4. Selection Rate (likelihood of being predicted positive)
- **Female (0):** Selection Diff = **0.0114**, Ratio = **1.0223**  
- **Male (1):** Selection Diff = **-0.0114**, Ratio = **0.9782**  
➡️ Females are slightly **more likely to be predicted positive** compared to males.

---

#### 5. True Positive Rate (TPR, Recall)
- **Female (0):** TPR Diff = **0.0139**, Ratio = **1.0195**  
- **Male (1):** TPR Diff = **-0.0139**, Ratio = **0.9809**  
➡️ Females benefit from a **slightly higher recall**, while males experience more false negatives.

---

### Overall Conclusion
- **Differences across all fairness metrics are minimal** (all differences ≤ 0.015, ratios very close to 1).  
- **Females (0)** have a slight advantage in **recall and selection rate**, but at the cost of a somewhat **higher false positive rate**.  
- **Males (1)** benefit from fewer false positives but at the cost of slightly lower recall and selection rate.  

Overall, the Decision Tree model demonstrates **very balanced fairness across gender groups**, with only negligible disparities that do not indicate systematic bias.

---

In [21]:
# Get the stratified performance table
perf_table_dt = measure.performance(
    X=gender_df,
    y_true=y_test,
    y_pred=y_pred_dt,
    y_prob=y_prob_dt
)

# Replace NaN with a dash
perf_table_dt = perf_table_dt.fillna("—")

# Display pretty table
display(perf_table_dt)


  warn(f"Possible error in column(s) {cols}. {wr}\n")


Unnamed: 0,Feature Name,Feature Value,Obs.,Mean Target,Mean Prediction,Accuracy,F1-Score,FPR,PR AUC,Precision,ROC AUC,TPR
0,ALL FEATURES,ALL VALUES,11256.0,0.4976,0.5123,0.7082,0.7111,0.305,—,0.7009,0.7593,0.7217
1,gender,0,7362.0,0.5004,0.5084,0.7086,0.7112,0.2996,—,0.7056,0.7637,0.7169
2,gender,1,3894.0,0.4923,0.5198,0.7075,0.711,0.3151,—,0.6922,0.7514,0.7308


### Interpretation of Performance Metrics – Decision Tree by Gender

The table shows performance metrics for the **Decision Tree model**, stratified by gender  
(0 = Female / Unprivileged, 1 = Male / Privileged).

---

#### Overall Performance (All Data)
- **Accuracy:** 0.7082  
- **F1-Score:** 0.7111  
- **Precision:** 0.7009  
- **ROC AUC:** 0.7593  
- **TPR (Recall):** 0.7217  
→ The Decision Tree achieves **moderate predictive performance**, with a balanced trade-off between precision and recall.

---

#### Group-Specific Performance

**1. Female (Unprivileged = 0)**  
- **Accuracy:** 0.7086 (slightly higher than males)  
- **F1-Score:** 0.7112 (virtually identical to males)  
- **FPR:** 0.2996 (lower false positive rate than males)  
- **Precision:** 0.7056 (slightly higher precision than males)  
- **ROC AUC:** 0.7637 (better discrimination ability than males)  
- **TPR (Recall):** 0.7169 (slightly lower recall than males)  

➡️ Female group shows **better precision and ROC AUC**, meaning predictions are more reliable and the model separates classes slightly better for females. However, recall is marginally lower.

---

**2. Male (Privileged = 1)**  
- **Accuracy:** 0.7075 (slightly lower than females)  
- **F1-Score:** 0.7110 (nearly identical to females)  
- **FPR:** 0.3151 (higher false positive rate than females)  
- **Precision:** 0.6922 (lower than females)  
- **ROC AUC:** 0.7514 (slightly worse than females)  
- **TPR (Recall):** 0.7308 (higher recall than females)  

➡️ Male group benefits from a **higher recall (73.08%)**, meaning fewer missed positives, but this comes at the cost of **lower precision and a higher false positive rate**.

---

### Overall Conclusion
- **Females (Unprivileged):** Stronger in **precision and ROC AUC**, meaning more reliable and well-calibrated predictions, with fewer false positives.  
- **Males (Privileged):** Stronger in **recall**, meaning more positives are detected, but at the cost of more false alarms and slightly lower precision.  

The model demonstrates **balanced performance across genders**, with only **minor trade-offs**: females get fewer false positives, while males get more true positives. This indicates **no substantial gender bias** but rather a small sensitivity–specificity trade-off between groups.

---

In [22]:
from fairmlhealth import performance_metrics as pm

# Define group masks with clear names
female_mask = (protected_attr_dt == 0)  # female = unprivileged group
male_mask   = (protected_attr_dt == 1)  # male = privileged group 

# Function to evaluate group-specific metrics
def evaluate_group_performance(group_name, mask):
    tpr = pm.true_positive_rate(y_test[mask], y_pred_dt[mask])
    fpr = pm.false_positive_rate(y_test[mask], y_pred_dt[mask])
    print(f"{group_name} Results:")
    print(f"  True Positive Rate (TPR): {tpr:.4f}")
    print(f"  False Positive Rate (FPR): {fpr:.4f}")
    print("-" * 40)

# Evaluate for each group
evaluate_group_performance("Female (Unprivileged)", female_mask)
evaluate_group_performance("Male (Privileged)", male_mask)

Female (Unprivileged) Results:
  True Positive Rate (TPR): 0.7169
  False Positive Rate (FPR): 0.2996
----------------------------------------
Male (Privileged) Results:
  True Positive Rate (TPR): 0.7308
  False Positive Rate (FPR): 0.3151
----------------------------------------


### Group-Specific Error Analysis – Decision Tree Model

This section reports the classification performance of the Decision Tree model across gender groups using **True Positive Rate (TPR)** and **False Positive Rate (FPR)**.

#### Results by Gender Group

| Group                        | TPR     | FPR     |
|------------------------------|---------|---------|
| Female (Unprivileged = 0)    | 0.7169  | 0.2996  |
| Male (Privileged = 1)        | 0.7308  | 0.3151  |

#### Interpretation
- **Females (Unprivileged):**  
  - Achieve a **slightly lower TPR (71.69%)**, meaning the model detects fewer true positives compared to males.  
  - Benefit from a **lower FPR (29.96%)**, meaning they receive fewer false positives than males.

- **Males (Privileged):**  
  - Achieve a **slightly higher TPR (73.08%)**, meaning the model detects more true positives in this group.  
  - However, they also experience a **higher FPR (31.51%)**, meaning more false positives occur.

#### Conclusion
The Decision Tree model shows a **small trade-off between sensitivity and specificity** across genders:  
- **Males** benefit from higher sensitivity (recall) but at the cost of more false alarms.  
- **Females** have fewer false alarms but slightly lower sensitivity.  

The differences are minor, suggesting that the model maintains **relatively balanced fairness across gender groups**.

---

### Ensemble Model - Random Forest - RF

In [23]:
rf_df = pd.read_csv("CVDKaggleData_75M25F_RF_tuned_predictions.csv")
print(rf_df.head())

   gender  y_true  y_pred_rf_tuned    y_prob
0       0       0                0  0.235104
1       0       0                1  0.780096
2       1       0                0  0.222839
3       0       0                0  0.282291
4       0       0                1  0.574989


In [24]:
# Extract common columns
y_true_rf = rf_df["y_true"].values
y_pred_rf = rf_df["y_pred_rf_tuned"].values
y_prob_rf = rf_df["y_prob"].values
gender_rf = rf_df["gender"].values


# Use gender_knn as the protected attribute (0/1 as in your CSV)
protected_attr_rf = gender_rf

In [25]:
# Random Forest Gender Bias Report
print("\n--- Random Forest Gender Bias Report ---")

rf_bias = measure.summary(
    X=X_test,
    y_true=y_test,
    y_pred=y_pred_rf,
    y_prob=y_prob_rf,
    prtc_attr=protected_attr_rf,
    pred_type="classification",
    priv_grp=1,  # 1 = Male = Privileged
    sig_fig=4,
    skip_if=True,
    skip_performance = True
)

print(rf_bias)


--- Random Forest Gender Bias Report ---
                                                        Value
Metric         Measure                                       
Group Fairness AUC Difference                          0.0188
               Balanced Accuracy Difference            0.0198
               Balanced Accuracy Ratio                 1.0287
               Disparate Impact Ratio                  1.0216
               Equal Odds Difference                   0.0267
               Equal Odds Ratio                        0.9543
               Positive Predictive Parity Difference   0.0246
               Positive Predictive Parity Ratio        1.0354
               Statistical Parity Difference           0.0101
Data Metrics   Prevalence of Privileged Class (%)     35.0000


In [26]:
# Flagged fairness table for Random Forest
styled_rf = MyFlagger().apply_flag(
    df=rf_bias,
    caption="Random Forest Fairness (Gender)",
    boundaries=bounds,
    sig_fig=4,
    as_styler=True
)
styled_rf

Unnamed: 0_level_0,Unnamed: 1_level_0,Value
Metric,Measure,Unnamed: 2_level_1
Group Fairness,AUC Difference,0.0188
Group Fairness,Balanced Accuracy Difference,0.0198
Group Fairness,Balanced Accuracy Ratio,1.0287
Group Fairness,Disparate Impact Ratio,1.0216
Group Fairness,Equal Odds Difference,0.0267
Group Fairness,Equal Odds Ratio,0.9543
Group Fairness,Positive Predictive Parity Difference,0.0246
Group Fairness,Positive Predictive Parity Ratio,1.0354
Group Fairness,Statistical Parity Difference,0.0101
Data Metrics,Prevalence of Privileged Class (%),35.0


### Interpretation of Random Forest Fairness Metrics by Gender

The table reports fairness metrics for the **Random Forest model**, with gender as the sensitive attribute.

---

#### Key Observations

- **AUC Difference (0.0188):**  
  → Very small difference in ranking performance across genders; the model distinguishes positives and negatives equally well for both groups.

- **Balanced Accuracy Difference (0.0198)** and **Balanced Accuracy Ratio (1.0287):**  
  → A slight disparity in accuracy across genders, with one group performing marginally better. However, the ratio is still close to 1, indicating only a small imbalance.

- **Disparate Impact Ratio (1.0216):**  
  → Positive prediction rates between genders are very similar. This value lies comfortably within the commonly accepted fairness range (0.8–1.25).

- **Equal Odds Difference (0.0267)** and **Equal Odds Ratio (0.9543):**  
  → Small differences in error rates (true positive and false positive rates) between genders. Indicates that predictions are fairly balanced, though not perfectly aligned.

- **Positive Predictive Parity Difference (0.0246)** and **Ratio (1.0354):**  
  → Precision (likelihood that a predicted positive is correct) is very similar between genders, with only a minor advantage for one group.

- **Statistical Parity Difference (0.0101):**  
  → Practically negligible difference in overall positive prediction rates across genders.

- **Prevalence of Privileged Class (35%):**  
  → Males (privileged group) account for 35% of the dataset, which could contribute to minor metric fluctuations but does not result in strong disparities.

---

#### Overall Conclusion
The Random Forest model demonstrates **balanced fairness performance across gender groups**:  
- Differences in **AUC, accuracy, and error rates** are very small.  
- Ratios remain well within accepted fairness thresholds (0.8–1.25).  
- Slight variations exist (e.g., recall and precision slightly higher for one group), but they are **not substantial enough to indicate systematic bias**.  

Overall, Random Forest shows **slightly larger disparities than Decision Tree**, but the fairness performance remains strong and indicates **no major gender bias**.

---

In [27]:
print("FairMLHealth Stratified Bias Table - RF")
measure.bias(X_test, y_test, y_pred_rf, features=['gender'], flag_oor=False)

FairMLHealth Stratified Bias Table - RF


Unnamed: 0,Feature Name,Feature Value,Balanced Accuracy Difference,Balanced Accuracy Ratio,FPR Diff,FPR Ratio,PPV Diff,PPV Ratio,Selection Diff,Selection Ratio,TPR Diff,TPR Ratio
0,gender,0,-0.0198,0.9721,0.0128,1.0479,-0.0246,0.9658,-0.0101,0.9788,-0.0267,0.9609
1,gender,1,0.0198,1.0287,-0.0128,0.9543,0.0246,1.0354,0.0101,1.0216,0.0267,1.0407


### Interpretation of Stratified Bias Analysis – Random Forest (by Gender)

The table shows group-specific fairness metrics for the **Random Forest model**, stratified by gender  
(0 = Female, 1 = Male).

---

#### 1. Balanced Accuracy
- **Female (0):** Difference = **-0.0198**, Ratio = **0.9721**  
- **Male (1):** Difference = **0.0198**, Ratio = **1.0287**  
➡️ Balanced accuracy is slightly higher for males, while females show a small disadvantage (~2% lower). The difference is minor.

---

#### 2. False Positive Rate (FPR)
- **Female (0):** FPR Diff = **0.0128**, Ratio = **1.0479**  
- **Male (1):** FPR Diff = **-0.0128**, Ratio = **0.9543**  
➡️ Females experience a **slightly higher false positive rate**, while males benefit from fewer false positives.

---

#### 3. Positive Predictive Value (PPV, Precision)
- **Female (0):** PPV Diff = **-0.0246**, Ratio = **0.9658**  
- **Male (1):** PPV Diff = **0.0246**, Ratio = **1.0354**  
➡️ Precision is slightly higher for males, meaning male positive predictions are a bit more reliable.

---

#### 4. Selection Rate (likelihood of being predicted positive)
- **Female (0):** Selection Diff = **-0.0101**, Ratio = **0.9788**  
- **Male (1):** Selection Diff = **0.0101**, Ratio = **1.0216**  
➡️ Males are slightly **more likely to be predicted positive** than females.

---

#### 5. True Positive Rate (TPR, Recall)
- **Female (0):** TPR Diff = **-0.0267**, Ratio = **0.9609**  
- **Male (1):** TPR Diff = **0.0267**, Ratio = **1.0407**  
➡️ Males achieve a **slightly higher recall**, meaning fewer false negatives compared to females.

---

### Overall Conclusion
- **Females (0):** Slight disadvantage in **balanced accuracy, recall, and precision**, while also having a **slightly higher false positive rate**.  
- **Males (1):** Slight advantage in **precision, recall, and selection rate**, while benefiting from fewer false positives.  

The disparities are small (all differences ≤ 0.027), but they suggest that the **Random Forest model leans marginally in favor of males** in terms of predictive performance and error balance. Nonetheless, the results remain **within acceptable fairness thresholds**, indicating no severe gender bias.

---

In [28]:
# Get the stratified performance table
perf_table_rf = measure.performance(
    X=gender_df,
    y_true=y_test,
    y_pred=y_pred_rf,
    y_prob=y_prob_rf
)

# Replace NaN with a dash
perf_table_rf = perf_table_rf.fillna("—")

# display pretty table
display(perf_table_rf)


  warn(f"Possible error in column(s) {cols}. {wr}\n")


Unnamed: 0,Feature Name,Feature Value,Obs.,Mean Target,Mean Prediction,Accuracy,F1-Score,FPR,PR AUC,Precision,ROC AUC,TPR
0,ALL FEATURES,ALL VALUES,11256.0,0.4976,0.4717,0.7018,0.6924,0.2709,—,0.7114,0.7572,0.6743
1,gender,0,7362.0,0.5004,0.4751,0.7085,0.7012,0.2664,—,0.7198,0.7638,0.6835
2,gender,1,3894.0,0.4923,0.4651,0.6893,0.6754,0.2792,—,0.6952,0.7449,0.6568


### Interpretation of Stratified Performance Metrics – Random Forest (by Gender)

The table shows performance results for the **Random Forest model**, stratified by gender  
(0 = Female, 1 = Male).

---

#### Overall Performance (All Data)
- **Accuracy:** 0.7018  
- **F1-Score:** 0.6924  
- **Precision:** 0.7114  
- **ROC AUC:** 0.7572  
- **TPR (Recall):** 0.6743  
→ The model performs at a **moderate level**, with a good balance of precision and recall.

---

#### Group-Specific Performance

**1. Female (0 – Unprivileged)**  
- **Accuracy:** 0.7085 (slightly higher than males)  
- **F1-Score:** 0.7012 (higher than males)  
- **FPR:** 0.2664 (lower than males → fewer false positives)  
- **Precision:** 0.7198 (higher than males)  
- **ROC AUC:** 0.7638 (better discrimination ability)  
- **TPR (Recall):** 0.6835 (higher sensitivity than males)  

➡️ Females perform **better overall**: higher accuracy, F1-score, precision, recall, and ROC AUC, while also benefiting from fewer false positives.

---

**2. Male (1 – Privileged)**  
- **Accuracy:** 0.6893 (lower than females)  
- **F1-Score:** 0.6754 (lower than females)  
- **FPR:** 0.2792 (higher than females → more false positives)  
- **Precision:** 0.6952 (lower than females)  
- **ROC AUC:** 0.7449 (lower discrimination ability)  
- **TPR (Recall):** 0.6568 (lower sensitivity → more missed positives)  

➡️ Males perform **slightly worse overall**, showing weaker accuracy, recall, and precision, with higher false positive rates.

---

### Overall Conclusion
- **Females (0)** achieve consistently stronger performance across almost all metrics (accuracy, F1, precision, recall, ROC AUC).  
- **Males (1)** lag behind in all measures and face both more false positives and more false negatives.  

The Random Forest model therefore shows a **slight performance advantage for females**, but the differences are not extreme. This suggests that while the model is reasonably fair overall, it **leans in favor of females** in predictive performance.

---

In [29]:
from fairmlhealth import performance_metrics as pm

# Define group masks with clear names
female_mask = (protected_attr_rf == 0)  # female = unprivileged group
male_mask   = (protected_attr_rf == 1)  # male = privileged group 

# Function to evaluate group-specific metrics
def evaluate_group_performance(group_name, mask):
    tpr = pm.true_positive_rate(y_test[mask], y_pred_rf[mask])
    fpr = pm.false_positive_rate(y_test[mask], y_pred_rf[mask])
    print(f"{group_name} Results:")
    print(f"  True Positive Rate (TPR): {tpr:.4f}")
    print(f"  False Positive Rate (FPR): {fpr:.4f}")
    print("-" * 40)

# Evaluate for each group
evaluate_group_performance("Female (Unprivileged)", female_mask)
evaluate_group_performance("Male (Privileged)", male_mask)

Female (Unprivileged) Results:
  True Positive Rate (TPR): 0.6835
  False Positive Rate (FPR): 0.2664
----------------------------------------
Male (Privileged) Results:
  True Positive Rate (TPR): 0.6568
  False Positive Rate (FPR): 0.2792
----------------------------------------


### Interpretation of Stratified Performance Metrics – Random Forest (by Gender)

The table shows performance results for the **Random Forest model**, stratified by gender  
(0 = Female, 1 = Male).

---

#### Overall Performance (All Data)
- **Accuracy:** 0.7018  
- **F1-Score:** 0.6924  
- **Precision:** 0.7114  
- **ROC AUC:** 0.7572  
- **TPR (Recall):** 0.6743  
→ The model performs at a **moderate level**, with a good balance of precision and recall.

---

#### Group-Specific Performance

**1. Female (0 – Unprivileged)**  
- **Accuracy:** 0.7085 (slightly higher than males)  
- **F1-Score:** 0.7012 (higher than males)  
- **FPR:** 0.2664 (lower than males → fewer false positives)  
- **Precision:** 0.7198 (higher than males)  
- **ROC AUC:** 0.7638 (better discrimination ability)  
- **TPR (Recall):** 0.6835 (higher sensitivity than males)  

➡️ Females perform **better overall**: higher accuracy, F1-score, precision, recall, and ROC AUC, while also benefiting from fewer false positives.

---

**2. Male (1 – Privileged)**  
- **Accuracy:** 0.6893 (lower than females)  
- **F1-Score:** 0.6754 (lower than females)  
- **FPR:** 0.2792 (higher than females → more false positives)  
- **Precision:** 0.6952 (lower than females)  
- **ROC AUC:** 0.7449 (lower discrimination ability)  
- **TPR (Recall):** 0.6568 (lower sensitivity → more missed positives)  

➡️ Males perform **slightly worse overall**, showing weaker accuracy, recall, and precision, with higher false positive rates.

---

### Overall Conclusion
- **Females (0)** achieve consistently stronger performance across almost all metrics (accuracy, F1, precision, recall, ROC AUC).  
- **Males (1)** lag behind in all measures and face both more false positives and more false negatives.  

The Random Forest model therefore shows a **slight performance advantage for females**, but the differences are not extreme. This suggests that while the model is reasonably fair overall, it **leans in favor of females** in predictive performance.

---

### Deep Learning Model - Feed Forward Network (MLP)

In [30]:
mlp_df = pd.read_csv("CVDKaggleData_75M25F_MLP_adamtuned_predictions.csv")
print(mlp_df.head())

   gender  y_true  y_pred    y_prob
0       0       0       0  0.258611
1       0       0       1  0.848005
2       1       0       0  0.452818
3       0       0       0  0.325508
4       0       0       0  0.255966


In [31]:
# Extract common columns 
y_true_mlp = mlp_df["y_true"].values 
y_prob_mlp = mlp_df["y_prob"].values
y_pred_mlp = mlp_df["y_pred"].values
gender_mlp = mlp_df["gender"].values 

# Use gender_mlp as the protected attribute
protected_attr_mlp = gender_mlp 

In [32]:
#Run fairmlhealth bias detection for MLP 

mlp_bias = measure.summary(
    X=X_test,
    y_true=y_test,
    y_pred=y_pred_mlp,
    y_prob=y_prob_mlp,
    prtc_attr=protected_attr_mlp,
    pred_type="classification",
    priv_grp=1,
    sig_fig=4,
    skip_if=True,
    skip_performance = True
)

print(mlp_bias)

                                                        Value
Metric         Measure                                       
Group Fairness AUC Difference                          0.0084
               Balanced Accuracy Difference            0.0063
               Balanced Accuracy Ratio                 1.0089
               Disparate Impact Ratio                  0.9955
               Equal Odds Difference                  -0.0119
               Equal Odds Ratio                        0.9553
               Positive Predictive Parity Difference   0.0159
               Positive Predictive Parity Ratio        1.0223
               Statistical Parity Difference          -0.0021
Data Metrics   Prevalence of Privileged Class (%)     35.0000


In [33]:
# Flagged fairness table for MLP
styled_mlp = MyFlagger().apply_flag(
    df=mlp_bias,
    caption="MLP Fairness (Gender)",
    boundaries=bounds,
    sig_fig=4,
    as_styler=True
)
styled_mlp

Unnamed: 0_level_0,Unnamed: 1_level_0,Value
Metric,Measure,Unnamed: 2_level_1
Group Fairness,AUC Difference,0.0084
Group Fairness,Balanced Accuracy Difference,0.0063
Group Fairness,Balanced Accuracy Ratio,1.0089
Group Fairness,Disparate Impact Ratio,0.9955
Group Fairness,Equal Odds Difference,-0.0119
Group Fairness,Equal Odds Ratio,0.9553
Group Fairness,Positive Predictive Parity Difference,0.0159
Group Fairness,Positive Predictive Parity Ratio,1.0223
Group Fairness,Statistical Parity Difference,-0.0021
Data Metrics,Prevalence of Privileged Class (%),35.0


### Interpretation of MLP Fairness Metrics by Gender

The table reports fairness metrics for the **Multilayer Perceptron (MLP) model**, using gender as the sensitive attribute.

---

#### Key Observations

- **AUC Difference (0.0084):**  
  → Very small; the model’s ability to rank positive vs. negative cases is nearly identical across genders.

- **Balanced Accuracy Difference (0.0063)** and **Ratio (1.0089):**  
  → Balanced accuracy is almost the same for both genders, showing very minimal disparity.

- **Disparate Impact Ratio (0.9955):**  
  → Very close to 1, suggesting that the likelihood of receiving a positive prediction is nearly equal across genders (well within the fairness range 0.8–1.25).

- **Equal Odds Difference (-0.0119)** and **Equal Odds Ratio (0.9553):**  
  → Error rates (TPR and FPR) differ only slightly, with a very small bias in favor of one group, but the magnitude is negligible.

- **Positive Predictive Parity Difference (0.0159)** and **Ratio (1.0223):**  
  → Precision is nearly equal across genders, with just a tiny advantage for one group.

- **Statistical Parity Difference (-0.0021):**  
  → Essentially zero, indicating almost no difference in positive prediction rates.

- **Prevalence of Privileged Class (35%):**  
  → The privileged group (males) makes up 35% of the dataset, but this imbalance does not result in fairness violations.

---

#### Overall Conclusion
The MLP model demonstrates **very strong fairness across gender groups**:  
- All disparities are **extremely small** (well below common fairness thresholds, e.g., ≤0.1 for differences, 0.8–1.25 for ratios).  
- Both performance parity (AUC, balanced accuracy) and fairness parity (disparate impact, statistical parity, equal odds) are well maintained.  

In summary, the MLP model shows **no meaningful gender bias** and achieves one of the most balanced fairness performances among the tested models.

---

In [34]:
print("FairMLHealth Stratified Bias Table - MLP")
measure.bias(X_test, y_test, y_pred_mlp, features=['gender'], flag_oor=False)

FairMLHealth Stratified Bias Table - MLP


Unnamed: 0,Feature Name,Feature Value,Balanced Accuracy Difference,Balanced Accuracy Ratio,FPR Diff,FPR Ratio,PPV Diff,PPV Ratio,Selection Diff,Selection Ratio,TPR Diff,TPR Ratio
0,gender,0,-0.0063,0.9912,0.0119,1.0468,-0.0159,0.9782,0.0021,1.0045,-0.0008,0.9989
1,gender,1,0.0063,1.0089,-0.0119,0.9553,0.0159,1.0223,-0.0021,0.9955,0.0008,1.0011


### Interpretation of Stratified Bias Analysis – MLP (by Gender)

The table shows group-specific fairness metrics for the **Multilayer Perceptron (MLP) model**, stratified by gender  
(0 = Female, 1 = Male).

---

#### 1. Balanced Accuracy
- **Female (0):** Difference = **-0.0063**, Ratio = **0.9912**  
- **Male (1):** Difference = **0.0063**, Ratio = **1.0089**  
➡️ Balanced accuracy is almost identical across genders, with males showing a very small advantage.  

---

#### 2. False Positive Rate (FPR)
- **Female (0):** FPR Diff = **0.0119**, Ratio = **1.0468**  
- **Male (1):** FPR Diff = **-0.0119**, Ratio = **0.9553**  
➡️ Females experience a **slightly higher false positive rate**, while males benefit from slightly fewer false positives. The difference is minimal.

---

#### 3. Positive Predictive Value (PPV, Precision)
- **Female (0):** PPV Diff = **-0.0159**, Ratio = **0.9782**  
- **Male (1):** PPV Diff = **0.0159**, Ratio = **1.0223**  
➡️ Males have a **slight advantage in precision**, meaning their positive predictions are marginally more reliable.

---

#### 4. Selection Rate (likelihood of being predicted positive)
- **Female (0):** Selection Diff = **0.0021**, Ratio = **1.0045**  
- **Male (1):** Selection Diff = **-0.0021**, Ratio = **0.9955**  
➡️ Females are predicted positive at nearly the same rate as males, with only a negligible difference.

---

#### 5. True Positive Rate (TPR, Recall)
- **Female (0):** TPR Diff = **-0.0008**, Ratio = **0.9989**  
- **Male (1):** TPR Diff = **0.0008**, Ratio = **1.0011**  
➡️ Recall is **practically identical** across genders, with no meaningful difference.

---

### Overall Conclusion
- **Females (0):** Slight disadvantage in **false positive rate** and **precision**, but nearly equal in recall and selection rate.  
- **Males (1):** Slight advantage in **precision** and **lower FPR**, but again the differences are very small.  

Overall, the MLP model demonstrates **exceptionally balanced fairness across gender groups**.  
All disparities are negligible (≤ 0.016 difference, ratios ~1), confirming that the model does **not exhibit systematic gender bias**.

---

In [35]:
# Get the stratified performance table
perf_table_mlp = measure.performance(
    X=gender_df,
    y_true=y_test,
    y_pred=y_pred_mlp,
    y_prob=y_prob_mlp
)

# Replace NaN with a dash
perf_table_mlp = perf_table_mlp.fillna("—")

# display pretty table
display(perf_table_mlp)


  warn(f"Possible error in column(s) {cols}. {wr}\n")


Unnamed: 0,Feature Name,Feature Value,Obs.,Mean Target,Mean Prediction,Accuracy,F1-Score,FPR,PR AUC,Precision,ROC AUC,TPR
0,ALL FEATURES,ALL VALUES,11256.0,0.4976,0.4711,0.7145,0.7052,0.2578,—,0.7251,0.7674,0.6865
1,gender,0,7362.0,0.5004,0.4704,0.7165,0.708,0.2537,—,0.7306,0.7706,0.6868
2,gender,1,3894.0,0.4923,0.4725,0.7106,0.7,0.2656,—,0.7147,0.7622,0.686


### Interpretation of Stratified Performance Metrics – MLP (by Gender)

The table shows performance results for the **Multilayer Perceptron (MLP) model**, stratified by gender  
(0 = Female, 1 = Male).

---

#### Overall Performance (All Data)
- **Accuracy:** 0.7145  
- **F1-Score:** 0.7052  
- **Precision:** 0.7251  
- **ROC AUC:** 0.7674  
- **TPR (Recall):** 0.6865  
→ The MLP achieves **solid predictive performance**, with a good balance between precision and recall.

---

#### Group-Specific Performance

**1. Female (0 – Unprivileged)**  
- **Accuracy:** 0.7165 (slightly higher than males)  
- **F1-Score:** 0.7080 (slightly higher than males)  
- **FPR:** 0.2537 (lower than males → fewer false positives)  
- **Precision:** 0.7306 (higher than males → more reliable positive predictions)  
- **ROC AUC:** 0.7706 (slightly better discrimination ability)  
- **TPR (Recall):** 0.6868 (nearly identical to males)  

➡️ Females show **slightly better accuracy, F1-score, precision, and ROC AUC**, while maintaining comparable recall and benefiting from fewer false positives.

---

**2. Male (1 – Privileged)**  
- **Accuracy:** 0.7106 (slightly lower than females)  
- **F1-Score:** 0.7000 (slightly lower than females)  
- **FPR:** 0.2656 (higher than females → more false positives)  
- **Precision:** 0.7147 (lower than females)  
- **ROC AUC:** 0.7622 (slightly lower than females)  
- **TPR (Recall):** 0.6860 (nearly identical to females)  

➡️ Males show **slightly weaker predictive performance overall**, with higher false positive rates and lower precision, though recall remains on par with females.

---

### Overall Conclusion
- **Females (0)** achieve **slightly better overall performance**, with higher accuracy, precision, and F1-score, plus fewer false positives.  
- **Males (1)** perform slightly worse across most metrics, though recall (TPR) is nearly identical.  

The differences are **minor and well within acceptable bounds**, indicating that the MLP model achieves **balanced fairness across genders** while showing a **small performance advantage for females**.

---

In [36]:
from fairmlhealth import performance_metrics as pm

# Define group masks with clear names
female_mask = (protected_attr_mlp == 0)  # female = unprivileged group
male_mask   = (protected_attr_mlp == 1)  # male = privileged group 

# Function to evaluate group-specific metrics
def evaluate_group_performance(group_name, mask):
    tpr = pm.true_positive_rate(y_test[mask], y_pred_mlp[mask])
    fpr = pm.false_positive_rate(y_test[mask], y_pred_mlp[mask])
    print(f"{group_name} Results:")
    print(f"  True Positive Rate (TPR): {tpr:.4f}")
    print(f"  False Positive Rate (FPR): {fpr:.4f}")
    print("-" * 40)

# Evaluate for each group
evaluate_group_performance("Female (Unprivileged)", female_mask)
evaluate_group_performance("Male (Privileged)", male_mask)

Female (Unprivileged) Results:
  True Positive Rate (TPR): 0.6868
  False Positive Rate (FPR): 0.2537
----------------------------------------
Male (Privileged) Results:
  True Positive Rate (TPR): 0.6860
  False Positive Rate (FPR): 0.2656
----------------------------------------


### Group-Specific Error Analysis – MLP Model

This section evaluates the classification performance of the MLP model across gender groups using **True Positive Rate (TPR)** and **False Positive Rate (FPR)**.

#### Results by Gender Group

| Group                        | TPR     | FPR     |
|------------------------------|---------|---------|
| Female (Unprivileged = 0)    | 0.6868  | 0.2537  |
| Male (Privileged = 1)        | 0.6860  | 0.2656  |

#### Interpretation
- **Females (Unprivileged):**  
  - Achieve a **TPR of 68.68%**, which is almost identical to males.  
  - Benefit from a **slightly lower FPR (25.37%)**, meaning fewer false positives compared to males.

- **Males (Privileged):**  
  - Achieve a **TPR of 68.60%**, virtually the same as females.  
  - Experience a **slightly higher FPR (26.56%)**, meaning they are marginally more often misclassified as positive.

#### Conclusion
The MLP model achieves **highly balanced error rates across genders**:  
- **Recall (TPR)** is nearly identical for females and males.  
- **Females** have a small advantage with fewer false positives.  

Overall, the differences are **minimal**, indicating that the model is **fair and unbiased across gender groups** in terms of error distribution.
