### Bias Detection and Fairness Evaluation on Cardiovascular Disease (Kaggle) using FairMLhealth
(Source: https://www.kaggle.com/datasets/sulianova/cardiovascular-disease-dataset)

In [1]:
import pandas as pd

# Load X_test set
X_test = pd.read_csv("./data_splits/X_test.csv")
y_test = pd.read_csv("./data_splits/y_test.csv")

In [2]:
import fairmlhealth
import aif360
print("Environment setup successful")

Environment setup successful


In [3]:
#have a look at the details of fairmlhealth - especially the version
!pip show fairmlhealth

Name: fairmlhealth
Version: 1.0.2
Summary: Health-centered variation analysis
Home-page: https://github.com/KenSciResearch/fairMLHealth
Author: Christine Allen
Author-email: ca.magallen@gmail.com
License: 
Location: c:\users\patri\appdata\roaming\python\python310\site-packages
Requires: aif360, ipython, jupyter, numpy, pandas, requests, scikit-learn, scipy
Required-by: 


In [4]:
#have a look at the modules that are within fairmlhealth

print(dir(fairmlhealth))

['__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__']


In [5]:
#load necessary modules 

#import module measure to use measure.summary for bias detection
from fairmlhealth import measure

#import module for investigation of individual cohorts 
from fairmlhealth.__utils import iterate_cohorts

#import FairRanges to flag high values
from fairmlhealth.__utils import FairRanges

# Wrap the fairness summary function for cohort-wise analysis
@iterate_cohorts
def cohort_summary(**kwargs):
    return measure.summary(**kwargs)

pip install 'aif360[AdversarialDebiasing]'
pip install 'aif360[AdversarialDebiasing]'
  vect_normalized_discounted_cumulative_gain = vmap(
  monte_carlo_vect_ndcg = vmap(vect_normalized_discounted_cumulative_gain, in_dims=(0,))


In [6]:
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", module="inFairness")
warnings.filterwarnings("ignore", message="AdversarialDebiasing will be unavailable")

### Traditional Machine Learning Models - KNN & DT

#### K-nearest neighbors - KNN

In [7]:
import pandas as pd

# Load KNN results
knn_df = pd.read_csv("CVDKaggleData_50F50M__tunedKNN_predictions.csv")

print(knn_df.head())

   gender  y_true    y_prob  y_pred
0       0       0  0.517241       1
1       0       0  0.793103       1
2       1       0  0.413793       0
3       0       0  0.275862       0
4       0       0  0.172414       0


In [8]:
# Extract common columns
y_true_knn = knn_df["y_true"].values
y_prob_knn = knn_df["y_prob"].values
y_pred_knn = knn_df["y_pred"].values
gender_knn = knn_df["gender"].values

# Use gender_knn as the protected attribute (0/1 as in your CSV)
protected_attr_knn = gender_knn

In [9]:
knn_bias = measure.summary(
    X=X_test,
    y_true=y_true_knn,
    y_pred=y_pred_knn,
    y_prob=y_prob_knn,
    prtc_attr=protected_attr_knn,
    pred_type="classification",
    priv_grp=1,
    sig_fig=4,
    skip_if=True,   # skip inconsistency metrics that cause NearestNeighbors error
    skip_performance=True
)

print(knn_bias)

                                                        Value
Metric         Measure                                       
Group Fairness AUC Difference                          0.0070
               Balanced Accuracy Difference           -0.0022
               Balanced Accuracy Ratio                 0.9968
               Disparate Impact Ratio                  1.0142
               Equal Odds Difference                   0.0060
               Equal Odds Ratio                        1.0200
               Positive Predictive Parity Difference   0.0031
               Positive Predictive Parity Ratio        1.0045
               Statistical Parity Difference           0.0071
Data Metrics   Prevalence of Privileged Class (%)     35.0000


In [10]:
# 2) Custom scenario oriented bounds

custom_ranges = {
    "tpr diff": (-0.03, 0.03),
    "fpr diff": (-0.03, 0.03),
    "equal odds difference": (-0.04, 0.04),
    "statistical parity difference": (-0.05, 0.05),
    "disparate impact ratio": (0.9, 1.1),
    "selection ratio": (0.9, 1.1),
    "auc difference": (-0.02, 0.02),
    "balanced accuracy difference": (-0.02, 0.02),
}

bounds = FairRanges().load_fair_ranges(custom_ranges=custom_ranges)

In [11]:
#  restore Styler.set_precision to adjust the highlighting color in the styled table
import pandas as pd, numpy as np

Styler = type(pd.DataFrame({"_":[0]}).style)  

if not hasattr(Styler, "set_precision"):
    def _set_precision(self, precision=4):
        try:
            return self.format(precision=precision)
        except TypeError:
            return self.format(formatter=lambda x:
                f"{x:.{precision}g}" if isinstance(x, (int, float, np.floating)) else x
            )
    setattr(Styler, "set_precision", _set_precision)

In [12]:
#Flag metrics outside acceptable fairness bounds in current table 

from fairmlhealth.__utils import Flagger

class MyFlagger(Flagger):
    def reset(self):
        super().reset()
        self.flag_color = "#491ee6"   
        self.flag_type = "background-color"

styled_knn = MyFlagger().apply_flag(
    df=knn_bias,
    caption="KNN Fairness (Gender)",
    boundaries=bounds,
    sig_fig=4,
    as_styler=True
)
styled_knn

Unnamed: 0_level_0,Unnamed: 1_level_0,Value
Metric,Measure,Unnamed: 2_level_1
Group Fairness,AUC Difference,0.007
Group Fairness,Balanced Accuracy Difference,-0.0022
Group Fairness,Balanced Accuracy Ratio,0.9968
Group Fairness,Disparate Impact Ratio,1.0142
Group Fairness,Equal Odds Difference,0.006
Group Fairness,Equal Odds Ratio,1.02
Group Fairness,Positive Predictive Parity Difference,0.0031
Group Fairness,Positive Predictive Parity Ratio,1.0045
Group Fairness,Statistical Parity Difference,0.0071
Data Metrics,Prevalence of Privileged Class (%),35.0


### Interpretation of KNN Fairness Metrics by Gender

The table reports fairness metrics for the **K-Nearest Neighbors (KNN) model**, using gender as the sensitive attribute.

---

#### Key Observations

- **AUC Difference (0.0070):**  
  → Very small; ranking performance (ability to separate positives from negatives) is nearly identical across genders.

- **Balanced Accuracy Difference (-0.0022)** and **Ratio (0.9968):**  
  → Almost no difference in balanced accuracy; the ratio close to 1 confirms that both genders are treated equally well.

- **Disparate Impact Ratio (1.0142):**  
  → Very close to 1, indicating that the likelihood of receiving a positive prediction is nearly the same across genders.  
  → This is well within the fairness guideline range (0.8–1.25).

- **Equal Odds Difference (0.0060)** and **Equal Odds Ratio (1.0200):**  
  → Error rates (true positive rate and false positive rate) are very similar, with only a marginal difference between genders.

- **Positive Predictive Parity Difference (0.0031)** and **Ratio (1.0045):**  
  → Precision is nearly identical, with males having a negligible advantage.

- **Statistical Parity Difference (0.0071):**  
  → Suggests a very small difference in the overall rate of positive predictions across genders, but the value is close to zero.

- **Prevalence of Privileged Class (35%):**  
  → The privileged group (males) makes up 35% of the dataset. This imbalance is present in the data but does not cause strong disparities in fairness outcomes.

---

#### Overall Conclusion
The KNN model shows **very strong fairness across gender groups**:  
- All differences are **extremely small** (≤0.01), and ratios remain close to 1.  
- Both genders receive nearly equal treatment in terms of accuracy, prediction rates, and error distribution.  
- The results suggest that the model is **highly balanced and does not exhibit systematic gender bias**.

Among fairness evaluations, this KNN run demonstrates **excellent parity between groups**.

---

In [13]:
print("FairMLHealth Stratified Bias Table - KNN")
measure.bias(X_test, y_test, y_pred_knn, features=['gender'], flag_oor=False)

FairMLHealth Stratified Bias Table - KNN


Unnamed: 0,Feature Name,Feature Value,Balanced Accuracy Difference,Balanced Accuracy Ratio,FPR Diff,FPR Ratio,PPV Diff,PPV Ratio,Selection Diff,Selection Ratio,TPR Diff,TPR Ratio
0,gender,0,0.0022,1.0032,-0.006,0.9804,-0.0031,0.9955,-0.0071,0.986,-0.0016,0.9977
1,gender,1,-0.0022,0.9968,0.006,1.02,0.0031,1.0045,0.0071,1.0142,0.0016,1.0023


### Interpretation of Stratified Bias Analysis – KNN (by Gender)

The table shows group-specific fairness metrics for the **K-Nearest Neighbors (KNN) model**, stratified by gender  
(0 = Female, 1 = Male).

---

#### 1. Balanced Accuracy
- **Female (0):** Difference = **0.0022**, Ratio = **1.0032**  
- **Male (1):** Difference = **-0.0022**, Ratio = **0.9968**  
➡️ Balanced accuracy is **nearly identical** for both genders, with only a ±0.2% deviation.

---

#### 2. False Positive Rate (FPR)
- **Female (0):** FPR Diff = **-0.006**, Ratio = **0.9804**  
- **Male (1):** FPR Diff = **0.006**, Ratio = **1.0200**  
➡️ Females experience a **slightly lower false positive rate**, while males face marginally more false positives.  
The difference is negligible (~0.6%).

---

#### 3. Positive Predictive Value (PPV, Precision)
- **Female (0):** PPV Diff = **-0.0031**, Ratio = **0.9955**  
- **Male (1):** PPV Diff = **0.0031**, Ratio = **1.0045**  
➡️ Males have a **slight precision advantage**, but the difference is minimal (<0.4%).

---

#### 4. Selection Rate (likelihood of being predicted positive)
- **Female (0):** Selection Diff = **-0.0071**, Ratio = **0.9860**  
- **Male (1):** Selection Diff = **0.0071**, Ratio = **1.0142**  
➡️ Males are **slightly more likely** to be predicted positive than females, but the difference is very small (~0.7%).

---

#### 5. True Positive Rate (TPR, Recall)
- **Female (0):** TPR Diff = **-0.0016**, Ratio = **0.9977**  
- **Male (1):** TPR Diff = **0.0016**, Ratio = **1.0023**  
➡️ Recall is **almost identical** across genders, with only a negligible advantage for males.

---

### Overall Conclusion
- **Females (0):** Benefit from a **slightly lower false positive rate**, but are marginally less likely to be predicted positive and have a very small disadvantage in precision/recall.  
- **Males (1):** Enjoy a **tiny advantage in precision, recall, and selection rate**, but face a slightly higher false positive rate.  

All differences are **extremely small (≤ 1%)**, meaning the KNN model is **highly fair across gender groups** with no meaningful systematic bias.

---

In [14]:
from fairmlhealth import measure
import pandas as pd
from IPython.display import display  

# convert gender into DataFrame with a clear column name to get a nice table as output
gender_df = pd.DataFrame({"gender": X_test["gender"].astype(int)})


# Get the stratified table
perf_table_knn = measure.performance(
    X=gender_df,
    y_true=y_test,
    y_pred=y_pred_knn,
    y_prob=y_prob_knn
)

# Replace NaN with a dash
perf_table_knn = perf_table_knn.fillna("—")

# display pretty table
display(perf_table_knn)


  warn(f"Possible error in column(s) {cols}. {wr}\n")


Unnamed: 0,Feature Name,Feature Value,Obs.,Mean Target,Mean Prediction,Accuracy,F1-Score,FPR,PR AUC,Precision,ROC AUC,TPR
0,ALL FEATURES,ALL VALUES,11256.0,0.4976,0.5052,0.7007,0.7015,0.3054,—,0.6963,0.7618,0.7068
1,gender,0,7362.0,0.5004,0.5076,0.6999,0.7023,0.3075,—,0.6974,0.7639,0.7074
2,gender,1,3894.0,0.4923,0.5005,0.7021,0.6999,0.3015,—,0.6942,0.7569,0.7058


### Interpretation of Stratified Performance Metrics – KNN (by Gender)

The table shows performance results for the **K-Nearest Neighbors (KNN) model**, stratified by gender  
(0 = Female, 1 = Male).

---

#### Overall Performance (All Data)
- **Accuracy:** 0.7007  
- **F1-Score:** 0.7015  
- **Precision:** 0.6963  
- **ROC AUC:** 0.7618  
- **TPR (Recall):** 0.7068  
→ The KNN model achieves **solid and balanced predictive performance**, with a good balance of recall and precision.

---

#### Group-Specific Performance

**1. Female (0 – Unprivileged)**  
- **Accuracy:** 0.6999 (very close to males)  
- **F1-Score:** 0.7023 (slightly higher than males)  
- **FPR:** 0.3075 (slightly higher than males → more false positives)  
- **Precision:** 0.6974 (slightly higher than males)  
- **ROC AUC:** 0.7639 (slightly better discrimination ability)  
- **TPR (Recall):** 0.7074 (almost identical to males)  

➡️ Females show a **tiny advantage in F1-score, precision, and ROC AUC**, but this comes with a **slightly higher false positive rate**.

---

**2. Male (1 – Privileged)**  
- **Accuracy:** 0.7021 (slightly higher than females)  
- **F1-Score:** 0.6999 (slightly lower than females)  
- **FPR:** 0.3015 (lower than females → fewer false positives)  
- **Precision:** 0.6942 (slightly lower than females)  
- **ROC AUC:** 0.7569 (slightly lower discrimination ability)  
- **TPR (Recall):** 0.7058 (almost identical to females)  

➡️ Males perform **slightly better in accuracy and false positive rate**, but lag behind in F1-score, precision, and ROC AUC.

---

### Overall Conclusion
- **Females (0):** Slight advantage in **F1-score, precision, and ROC AUC**, but at the cost of a slightly higher false positive rate.  
- **Males (1):** Slight advantage in **accuracy and lower false positive rate**, but at the cost of slightly lower F1-score and precision.  

The differences are **tiny (≤ 1%)**, indicating that the KNN model is **very fair across gender groups**, with only negligible trade-offs in error distribution.

---

In [15]:
#group specific error analysis

from fairmlhealth import performance_metrics as pm

# Define group masks with clear names
female_mask = (protected_attr_knn == 0)  # unprivileged group (female)
male_mask   = (protected_attr_knn == 1)  # privileged group (male)

# Function to evaluate group-specific metrics
def evaluate_group_performance(group_name, mask):
    tpr = pm.true_positive_rate(y_test[mask], y_pred_knn[mask])
    fpr = pm.false_positive_rate(y_test[mask], y_pred_knn[mask])
    print(f"{group_name} Results:")
    print(f"  True Positive Rate (TPR): {tpr:.4f}")
    print(f"  False Positive Rate (FPR): {fpr:.4f}")
    print("-" * 40)

# Evaluate for each group
evaluate_group_performance("Female (Unprivileged)", female_mask)
evaluate_group_performance("Male (Privileged)", male_mask)


Female (Unprivileged) Results:
  True Positive Rate (TPR): 0.7074
  False Positive Rate (FPR): 0.3075
----------------------------------------
Male (Privileged) Results:
  True Positive Rate (TPR): 0.7058
  False Positive Rate (FPR): 0.3015
----------------------------------------


### Group-Specific Error Analysis – KNN Model

This section reports the classification performance of the KNN model across gender groups using **True Positive Rate (TPR)** and **False Positive Rate (FPR)**.

#### Results by Gender Group

| Group                        | TPR     | FPR     |
|------------------------------|---------|---------|
| Female (Unprivileged = 0)    | 0.7074  | 0.3075  |
| Male (Privileged = 1)        | 0.7058  | 0.3015  |

#### Interpretation
- **Females (Unprivileged):**  
  - Have a **TPR of 70.74%**, almost identical to males.  
  - Experience a **slightly higher FPR (30.75%)**, meaning they receive marginally more false positives.

- **Males (Privileged):**  
  - Have a **TPR of 70.58%**, virtually the same as females.  
  - Benefit from a **slightly lower FPR (30.15%)**, meaning fewer false alarms.

#### Conclusion
The KNN model shows **very balanced fairness** across genders:  
- Both groups achieve nearly identical recall (TPR).  
- Females incur a **tiny disadvantage in terms of false positives**, but the gap (~0.6%) is negligible.  

Overall, the results indicate that the KNN model is **highly fair and unbiased with respect to gender**.

---

### Decision Tree - DT

In [16]:
import pandas as pd

# Load KNN results
dt_df = pd.read_csv("CVDKaggleData_50F50M_DT_tunedpruned_predictions.csv")

print(dt_df.head())

   gender  y_true  y_pred_dt    y_prob
0       0       0          0  0.301981
1       0       0          1  0.862548
2       1       0          0  0.301981
3       0       0          0  0.301981
4       0       0          0  0.301981


In [17]:
import re

# Extract common columns
y_true_dt = dt_df["y_true"].values
y_prob_dt = dt_df["y_prob"].values
y_pred_dt = dt_df["y_pred_dt"].values
gender_dt = dt_df["gender"].values


# Use gender_knn as the protected attribute (0/1 as in your CSV)
protected_attr_dt = gender_dt


In [18]:
# Decision Tree Gender Bias Report
print("\n--- Decision Tree Gender Bias Report ---")

dt_bias = measure.summary(
    X=X_test,
    y_true=y_test,
    y_pred=y_pred_dt,
    y_prob=y_prob_dt,
    prtc_attr=protected_attr_dt,
    pred_type="classification",
    priv_grp=1,  # 1 = Male = Privileged
    sig_fig=4,
    skip_if=True,  
    skip_performance = True
)

print(dt_bias)


--- Decision Tree Gender Bias Report ---
                                                        Value
Metric         Measure                                       
Group Fairness AUC Difference                          0.0047
               Balanced Accuracy Difference            0.0069
               Balanced Accuracy Ratio                 1.0098
               Disparate Impact Ratio                  0.9489
               Equal Odds Difference                  -0.0357
               Equal Odds Ratio                        0.8775
               Positive Predictive Parity Difference   0.0268
               Positive Predictive Parity Ratio        1.0382
               Statistical Parity Difference          -0.0254
Data Metrics   Prevalence of Privileged Class (%)     35.0000


In [19]:
#Flag metrics outside acceptable fairness bounds in current table 

styled_dt = MyFlagger().apply_flag(
    df=dt_bias,
    caption="Decision Tree Fairness (Gender)",
    boundaries=bounds,
    sig_fig=4,
    as_styler=True
)
styled_dt

Unnamed: 0_level_0,Unnamed: 1_level_0,Value
Metric,Measure,Unnamed: 2_level_1
Group Fairness,AUC Difference,0.0047
Group Fairness,Balanced Accuracy Difference,0.0069
Group Fairness,Balanced Accuracy Ratio,1.0098
Group Fairness,Disparate Impact Ratio,0.9489
Group Fairness,Equal Odds Difference,-0.0357
Group Fairness,Equal Odds Ratio,0.8775
Group Fairness,Positive Predictive Parity Difference,0.0268
Group Fairness,Positive Predictive Parity Ratio,1.0382
Group Fairness,Statistical Parity Difference,-0.0254
Data Metrics,Prevalence of Privileged Class (%),35.0


### Interpretation of Decision Tree Fairness Metrics by Gender

The table reports fairness metrics for the **Decision Tree (DT) model**, using gender as the sensitive attribute.

---

#### Key Observations

- **AUC Difference (0.0047):**  
  → Extremely small; ranking ability is nearly identical between genders.

- **Balanced Accuracy Difference (0.0069)** and **Ratio (1.0098):**  
  → Indicates a very slight advantage in balanced accuracy for one gender, but the gap is minimal (~0.7%).

- **Disparate Impact Ratio (0.9489):**  
  → Slightly below 1, suggesting that females (unprivileged group) are **less likely to receive positive predictions** compared to males.  
  → Still within the fairness guideline range (0.8–1.25).

- **Equal Odds Difference (-0.0357)** and **Equal Odds Ratio (0.8775):**  
  → This is the **largest disparity** observed.  
  - The negative difference suggests females have an advantage in error distribution (lower FPR or higher TPR).  
  - The ratio indicates that males may face relatively higher error rates.  
  → This shows some imbalance, but the difference (~3.6%) is still moderate.

- **Positive Predictive Parity Difference (0.0268)** and **Ratio (1.0382):**  
  → Males have slightly **higher precision**, meaning their positive predictions are somewhat more reliable.

- **Statistical Parity Difference (-0.0254):**  
  → Suggests females are **predicted positive at a slightly lower rate** than males, though the disparity is small.

- **Prevalence of Privileged Class (35%):**  
  → Males represent 35% of the dataset. Despite this imbalance, fairness outcomes remain reasonably balanced.

---

#### Overall Conclusion
The Decision Tree model shows **generally fair outcomes across gender groups**, but with a few **minor disparities**:
- **Females (unprivileged group):** Slightly lower rate of positive predictions but benefit from more favorable error distribution.  
- **Males (privileged group):** Gain a small advantage in precision but appear slightly disadvantaged in equalized odds.  

Overall, disparities remain **moderate and within acceptable ranges**, though the **Equal Odds metric (-0.0357)** indicates the largest imbalance compared to other fairness measures.

---

In [20]:
print("FairMLHealth Stratified Bias Table - DT")
measure.bias(X_test, y_test, y_pred_dt, features=['gender'], flag_oor=False)

FairMLHealth Stratified Bias Table - DT


Unnamed: 0,Feature Name,Feature Value,Balanced Accuracy Difference,Balanced Accuracy Ratio,FPR Diff,FPR Ratio,PPV Diff,PPV Ratio,Selection Diff,Selection Ratio,TPR Diff,TPR Ratio
0,gender,0,-0.0069,0.9903,0.0357,1.1395,-0.0268,0.9632,0.0254,1.0538,0.0219,1.0318
1,gender,1,0.0069,1.0098,-0.0357,0.8775,0.0268,1.0382,-0.0254,0.9489,-0.0219,0.9692


### Interpretation of Stratified Bias Analysis – Decision Tree (by Gender)

The table shows group-specific fairness metrics for the **Decision Tree (DT) model**, stratified by gender  
(0 = Female, 1 = Male).

---

#### 1. Balanced Accuracy
- **Female (0):** Difference = **-0.0069**, Ratio = **0.9903**  
- **Male (1):** Difference = **0.0069**, Ratio = **1.0098**  
➡️ Males have a **slight advantage in balanced accuracy** (~0.7%), but the difference is very small.

---

#### 2. False Positive Rate (FPR)
- **Female (0):** FPR Diff = **0.0357**, Ratio = **1.1395**  
- **Male (1):** FPR Diff = **-0.0357**, Ratio = **0.8775**  
➡️ Females face a **higher false positive rate**, meaning they are more often incorrectly classified as positive.  
➡️ Males benefit from fewer false positives.

---

#### 3. Positive Predictive Value (PPV, Precision)
- **Female (0):** PPV Diff = **-0.0268**, Ratio = **0.9632**  
- **Male (1):** PPV Diff = **0.0268**, Ratio = **1.0382**  
➡️ Precision is **higher for males**, meaning their positive predictions are more reliable.

---

#### 4. Selection Rate (likelihood of being predicted positive)
- **Female (0):** Selection Diff = **0.0254**, Ratio = **1.0538**  
- **Male (1):** Selection Diff = **-0.0254**, Ratio = **0.9489**  
➡️ Females are **more likely to be predicted positive**, while males are predicted positive less often.

---

#### 5. True Positive Rate (TPR, Recall)
- **Female (0):** TPR Diff = **0.0219**, Ratio = **1.0318**  
- **Male (1):** TPR Diff = **-0.0219**, Ratio = **0.9692**  
➡️ Females achieve **higher recall**, meaning they are more likely to have their true positives correctly identified.  
➡️ Males experience slightly more false negatives.

---

### Overall Conclusion
- **Females (0):**  
  - Advantages: **Higher recall (TPR)** and **higher selection rate**.  
  - Disadvantages: **Higher false positive rate** and **lower precision**.  

- **Males (1):**  
  - Advantages: **Lower false positive rate** and **higher precision**.  
  - Disadvantages: **Lower recall** and **less likely to be predicted positive**.  

The disparities are **moderate**: females trade precision for recall (more positives detected but with more false alarms), while males enjoy more reliable predictions but risk missing true cases. This suggests the Decision Tree model introduces **noticeable but not extreme fairness trade-offs across genders**.


---

In [21]:
# Get the stratified performance table
perf_table_dt = measure.performance(
    X=gender_df,
    y_true=y_test,
    y_pred=y_pred_dt,
    y_prob=y_prob_dt
)

# Replace NaN with a dash
perf_table_dt = perf_table_dt.fillna("—")

# Display pretty table
display(perf_table_dt)


  warn(f"Possible error in column(s) {cols}. {wr}\n")


Unnamed: 0,Feature Name,Feature Value,Obs.,Mean Target,Mean Prediction,Accuracy,F1-Score,FPR,PR AUC,Precision,ROC AUC,TPR
0,ALL FEATURES,ALL VALUES,11256.0,0.4976,0.4808,0.7133,0.707,0.2686,—,0.7193,0.7394,0.6951
1,gender,0,7362.0,0.5004,0.472,0.7157,0.7076,0.2561,0.2609,0.7289,0.741,0.6876
2,gender,1,3894.0,0.4923,0.4974,0.7088,0.7058,0.2919,—,0.7021,0.7364,0.7094


### Interpretation of Stratified Performance Metrics – Decision Tree (by Gender)

The table shows performance results for the **Decision Tree model**, stratified by gender  
(0 = Female, 1 = Male).

---

#### Overall Performance (All Data)
- **Accuracy:** 0.7133  
- **F1-Score:** 0.7070  
- **Precision:** 0.7193  
- **ROC AUC:** 0.7394  
- **TPR (Recall):** 0.6951  
→ The DT achieves **balanced predictive performance**, with moderate recall and relatively good precision.

---

#### Group-Specific Performance

**1. Female (0 – Unprivileged)**  
- **Accuracy:** 0.7157 (slightly higher than males)  
- **F1-Score:** 0.7076 (slightly higher)  
- **FPR:** 0.2561 (lower than males → fewer false positives)  
- **Precision:** 0.7289 (higher precision than males)  
- **ROC AUC:** 0.7410 (slightly higher discrimination ability)  
- **TPR (Recall):** 0.6876 (lower than males)  

➡️ Females show **stronger performance in precision, accuracy, and fewer false positives**, but **recall is slightly weaker** compared to males.

---

**2. Male (1 – Privileged)**  
- **Accuracy:** 0.7088 (slightly lower than females)  
- **F1-Score:** 0.7058 (slightly lower)  
- **FPR:** 0.2919 (higher → more false positives)  
- **Precision:** 0.7021 (lower precision than females)  
- **ROC AUC:** 0.7364 (slightly weaker discrimination ability)  
- **TPR (Recall):** 0.7094 (higher than females)  

➡️ Males achieve **better recall (higher TPR)**, meaning more true positives are detected, but this comes with **more false positives** and lower precision.

---

### Overall Conclusion
- **Females (0):** Advantage in **precision, accuracy, ROC AUC, and lower false positive rate**, but recall is weaker.  
- **Males (1):** Advantage in **recall (TPR)**, but face more false positives and lower precision.  

The Decision Tree model shows a **clear trade-off**:  
- **Females** → more reliable predictions (fewer false alarms, higher precision).  
- **Males** → more sensitive predictions (higher recall), but with more false positives.  

The differences are noticeable but not extreme, reflecting **moderate gender trade-offs** rather than systematic bias.

---

In [22]:
from fairmlhealth import performance_metrics as pm

# Define group masks with clear names
female_mask = (protected_attr_dt == 0)  # female = unprivileged group
male_mask   = (protected_attr_dt == 1)  # male = privileged group 

# Function to evaluate group-specific metrics
def evaluate_group_performance(group_name, mask):
    tpr = pm.true_positive_rate(y_test[mask], y_pred_dt[mask])
    fpr = pm.false_positive_rate(y_test[mask], y_pred_dt[mask])
    print(f"{group_name} Results:")
    print(f"  True Positive Rate (TPR): {tpr:.4f}")
    print(f"  False Positive Rate (FPR): {fpr:.4f}")
    print("-" * 40)

# Evaluate for each group
evaluate_group_performance("Female (Unprivileged)", female_mask)
evaluate_group_performance("Male (Privileged)", male_mask)

Female (Unprivileged) Results:
  True Positive Rate (TPR): 0.6876
  False Positive Rate (FPR): 0.2561
----------------------------------------
Male (Privileged) Results:
  True Positive Rate (TPR): 0.7094
  False Positive Rate (FPR): 0.2919
----------------------------------------


### Group-Specific Error Analysis – Decision Tree Model

This section reports the classification performance of the Decision Tree model across gender groups using **True Positive Rate (TPR)** and **False Positive Rate (FPR)**.

#### Results by Gender Group

| Group                        | TPR     | FPR     |
|------------------------------|---------|---------|
| Female (Unprivileged = 0)    | 0.6876  | 0.2561  |
| Male (Privileged = 1)        | 0.7094  | 0.2919  |

#### Interpretation
- **Females (Unprivileged):**  
  - Achieve a **TPR of 68.76%**, slightly lower than males, meaning fewer true positives are detected.  
  - Benefit from a **lower FPR (25.61%)**, which means they are less frequently misclassified as positive.

- **Males (Privileged):**  
  - Achieve a **TPR of 70.94%**, indicating better sensitivity (more true positives detected).  
  - However, they also face a **higher FPR (29.19%)**, meaning they are more often falsely flagged as positive.

#### Conclusion
The Decision Tree model demonstrates a **trade-off in error distribution**:  
- **Males** → better recall (higher TPR), but at the cost of more false positives.  
- **Females** → fewer false positives, but with slightly weaker recall.  

The differences (≈2% in TPR and ≈3.5% in FPR) are **moderate but not extreme**, indicating that the model remains **reasonably fair**, though it slightly favors males in sensitivity and females in prediction reliability.

---

### Ensemble Model - Random Forest - RF

In [23]:
rf_df = pd.read_csv("CVDKaggleData_50M50F_RF_tuned_predictions.csv")
print(rf_df.head())

   gender  y_true  y_pred_rf    y_prob
0       0       0          0  0.363411
1       0       0          1  0.812063
2       1       0          0  0.313309
3       0       0          0  0.276039
4       0       0          0  0.319559


In [24]:
# Extract common columns
y_true_rf = rf_df["y_true"].values
y_pred_rf = rf_df["y_pred_rf"].values
y_prob_rf = rf_df["y_prob"].values
gender_rf = rf_df["gender"].values


# Use gender_knn as the protected attribute (0/1 as in your CSV)
protected_attr_rf = gender_rf

In [25]:
# Random Forest Gender Bias Report
print("\n--- Random Forest Gender Bias Report ---")

rf_bias = measure.summary(
    X=X_test,
    y_true=y_test,
    y_pred=y_pred_rf,
    y_prob=y_prob_rf,
    prtc_attr=protected_attr_rf,
    pred_type="classification",
    priv_grp=1,  # 1 = Male = Privileged
    sig_fig=4,
    skip_if=True,
    skip_performance = True
)

print(rf_bias)


--- Random Forest Gender Bias Report ---
                                                        Value
Metric         Measure                                       
Group Fairness AUC Difference                          0.0137
               Balanced Accuracy Difference            0.0026
               Balanced Accuracy Ratio                 1.0037
               Disparate Impact Ratio                  0.9674
               Equal Odds Difference                  -0.0215
               Equal Odds Ratio                        0.9209
               Positive Predictive Parity Difference   0.0183
               Positive Predictive Parity Ratio        1.0258
               Statistical Parity Difference          -0.0155
Data Metrics   Prevalence of Privileged Class (%)     35.0000


In [26]:
# Flagged fairness table for Random Forest
styled_rf = MyFlagger().apply_flag(
    df=rf_bias,
    caption="Random Forest Fairness (Gender)",
    boundaries=bounds,
    sig_fig=4,
    as_styler=True
)
styled_rf

Unnamed: 0_level_0,Unnamed: 1_level_0,Value
Metric,Measure,Unnamed: 2_level_1
Group Fairness,AUC Difference,0.0137
Group Fairness,Balanced Accuracy Difference,0.0026
Group Fairness,Balanced Accuracy Ratio,1.0037
Group Fairness,Disparate Impact Ratio,0.9674
Group Fairness,Equal Odds Difference,-0.0215
Group Fairness,Equal Odds Ratio,0.9209
Group Fairness,Positive Predictive Parity Difference,0.0183
Group Fairness,Positive Predictive Parity Ratio,1.0258
Group Fairness,Statistical Parity Difference,-0.0155
Data Metrics,Prevalence of Privileged Class (%),35.0


### Interpretation of Random Forest Fairness Metrics by Gender

The table reports fairness metrics for the **Random Forest (RF) model**, using gender as the sensitive attribute.

---

#### Key Observations

- **AUC Difference (0.0137):**  
  → Very small; the model’s ability to rank predictions is nearly identical across genders.

- **Balanced Accuracy Difference (0.0026)** and **Ratio (1.0037):**  
  → Almost no disparity in balanced accuracy. Both genders are treated similarly well.

- **Disparate Impact Ratio (0.9674):**  
  → Slightly below 1, suggesting that females are **somewhat less likely to receive positive predictions** compared to males.  
  → Still falls within the general fairness guideline (0.8–1.25).

- **Equal Odds Difference (-0.0215)** and **Equal Odds Ratio (0.9209):**  
  → Indicates a moderate imbalance in error distribution (TPR/FPR).  
  - The negative difference suggests **females may be slightly advantaged** in terms of lower error rates.  
  - The ratio below 1 shows some discrepancy between genders.

- **Positive Predictive Parity Difference (0.0183)** and **Ratio (1.0258):**  
  → Precision is slightly higher for males, meaning their positive predictions are somewhat more reliable.

- **Statistical Parity Difference (-0.0155):**  
  → Indicates females are **predicted positive at a slightly lower rate** than males, but the disparity is small.

- **Prevalence of Privileged Class (35%):**  
  → Males represent 35% of the dataset. Despite being the minority, the fairness outcomes remain relatively balanced.

---

#### Overall Conclusion
The Random Forest model demonstrates **fairness with only small disparities**:  
- **Females (unprivileged group):** Slightly less likely to receive positive predictions, but they may experience somewhat more favorable error distribution.  
- **Males (privileged group):** Enjoy a small advantage in **precision** (positive predictions are more reliable).  

The **largest observed gap** is in **Equal Odds Difference (-0.0215)**, suggesting a small trade-off in error distribution. However, all disparities are **minor and within acceptable ranges**, indicating the model is **reasonably fair across genders**.

---

In [27]:
print("FairMLHealth Stratified Bias Table - RF")
measure.bias(X_test, y_test, y_pred_rf, features=['gender'], flag_oor=False)

FairMLHealth Stratified Bias Table - RF


Unnamed: 0,Feature Name,Feature Value,Balanced Accuracy Difference,Balanced Accuracy Ratio,FPR Diff,FPR Ratio,PPV Diff,PPV Ratio,Selection Diff,Selection Ratio,TPR Diff,TPR Ratio
0,gender,0,-0.0026,0.9963,0.0215,1.0859,-0.0183,0.9748,0.0155,1.0337,0.0163,1.0243
1,gender,1,0.0026,1.0037,-0.0215,0.9209,0.0183,1.0258,-0.0155,0.9674,-0.0163,0.9763


### Interpretation of Stratified Bias Analysis – Random Forest (by Gender)

The table shows group-specific fairness metrics for the **Random Forest (RF) model**, stratified by gender  
(0 = Female, 1 = Male).

---

#### 1. Balanced Accuracy
- **Female (0):** Difference = **-0.0026**, Ratio = **0.9963**  
- **Male (1):** Difference = **0.0026**, Ratio = **1.0037**  
➡️ Balanced accuracy is almost identical, with males having a negligible advantage (~0.3%).

---

#### 2. False Positive Rate (FPR)
- **Female (0):** FPR Diff = **0.0215**, Ratio = **1.0859**  
- **Male (1):** FPR Diff = **-0.0215**, Ratio = **0.9209**  
➡️ Females experience a **slightly higher false positive rate** (~2%), while males benefit from fewer false positives.

---

#### 3. Positive Predictive Value (PPV, Precision)
- **Female (0):** PPV Diff = **-0.0183**, Ratio = **0.9748**  
- **Male (1):** PPV Diff = **0.0183**, Ratio = **1.0258**  
➡️ Precision is **higher for males**, meaning their positive predictions are somewhat more reliable.

---

#### 4. Selection Rate (likelihood of being predicted positive)
- **Female (0):** Selection Diff = **0.0155**, Ratio = **1.0337**  
- **Male (1):** Selection Diff = **-0.0155**, Ratio = **0.9674**  
➡️ Females are **slightly more likely** to be predicted positive, while males are predicted positive at a slightly lower rate.

---

#### 5. True Positive Rate (TPR, Recall)
- **Female (0):** TPR Diff = **0.0163**, Ratio = **1.0243**  
- **Male (1):** TPR Diff = **-0.0163**, Ratio = **0.9763**  
➡️ Females achieve **higher recall**, meaning more true positives are detected.  
➡️ Males have slightly lower recall (more false negatives).

---

### Overall Conclusion
- **Females (0):**  
  - Advantages: **Higher recall and selection rate** (more positives detected).  
  - Disadvantages: **Higher false positive rate** and **slightly lower precision**.  

- **Males (1):**  
  - Advantages: **Lower false positive rate** and **higher precision** (positive predictions are more reliable).  
  - Disadvantages: **Lower recall** and **less likely to be predicted positive**.  

The disparities are **small (1–2%)** and reflect a **trade-off**:  
- **Females** → better sensitivity (more positives detected) but more false alarms.  
- **Males** → fewer false alarms, but at the cost of missing some positives.  

Overall, the Random Forest model remains **reasonably fair across genders** with only minor imbalances.

---

In [28]:
# Get the stratified performance table
perf_table_rf = measure.performance(
    X=gender_df,
    y_true=y_test,
    y_pred=y_pred_rf,
    y_prob=y_prob_rf
)

# Replace NaN with a dash
perf_table_rf = perf_table_rf.fillna("—")

# display pretty table
display(perf_table_rf)


  warn(f"Possible error in column(s) {cols}. {wr}\n")


Unnamed: 0,Feature Name,Feature Value,Obs.,Mean Target,Mean Prediction,Accuracy,F1-Score,FPR,PR AUC,Precision,ROC AUC,TPR
0,ALL FEATURES,ALL VALUES,11256.0,0.4976,0.4657,0.7092,0.6981,0.2576,—,0.7221,0.7667,0.6758
1,gender,0,7362.0,0.5004,0.4603,0.71,0.6981,0.2501,—,0.7285,0.7715,0.6702
2,gender,1,3894.0,0.4923,0.4759,0.7078,0.6981,0.2716,—,0.7102,0.7578,0.6865


### Interpretation of Stratified Performance Metrics – Random Forest (by Gender)

The table shows performance results for the **Random Forest (RF) model**, stratified by gender  
(0 = Female, 1 = Male).

---

#### Overall Performance (All Data)
- **Accuracy:** 0.7092  
- **F1-Score:** 0.6981  
- **Precision:** 0.7221  
- **ROC AUC:** 0.7667  
- **TPR (Recall):** 0.6758  
→ The RF model achieves **solid overall performance**, balancing precision and recall with good discriminative ability.

---

#### Group-Specific Performance

**1. Female (0 – Unprivileged)**  
- **Accuracy:** 0.7100 (slightly higher than males)  
- **F1-Score:** 0.6981 (same as males)  
- **FPR:** 0.2501 (lower than males → fewer false positives)  
- **Precision:** 0.7285 (higher than males → more reliable predictions)  
- **ROC AUC:** 0.7715 (higher than males → better discrimination ability)  
- **TPR (Recall):** 0.6702 (lower recall → more false negatives)  

➡️ Females benefit from **better precision, fewer false positives, and higher ROC AUC**, but they sacrifice recall.

---

**2. Male (1 – Privileged)**  
- **Accuracy:** 0.7078 (slightly lower than females)  
- **F1-Score:** 0.6981 (same as females)  
- **FPR:** 0.2716 (higher → more false positives)  
- **Precision:** 0.7102 (lower precision)  
- **ROC AUC:** 0.7578 (lower than females)  
- **TPR (Recall):** 0.6865 (higher recall → fewer false negatives)  

➡️ Males achieve **better recall**, but this comes at the cost of **more false positives and less precise predictions**.

---

### Overall Conclusion
- **Females (0):** Stronger in **precision, ROC AUC, and lower FPR** → more reliable predictions, fewer false alarms, but slightly worse sensitivity.  
- **Males (1):** Stronger in **recall (TPR)** → more positives detected, but with higher false positive rates and lower precision.  

The Random Forest model therefore shows a **typical fairness trade-off**:  
- **Females** → more accurate and reliable predictions.  
- **Males** → more sensitive detection but at the cost of increased false alarms.  

The differences are **small to moderate (≈2%–3%)**, suggesting the model is **reasonably balanced** but introduces a subtle gender trade-off.

---

In [29]:
from fairmlhealth import performance_metrics as pm

# Define group masks with clear names
female_mask = (protected_attr_rf == 0)  # female = unprivileged group
male_mask   = (protected_attr_rf == 1)  # male = privileged group 

# Function to evaluate group-specific metrics
def evaluate_group_performance(group_name, mask):
    tpr = pm.true_positive_rate(y_test[mask], y_pred_rf[mask])
    fpr = pm.false_positive_rate(y_test[mask], y_pred_rf[mask])
    print(f"{group_name} Results:")
    print(f"  True Positive Rate (TPR): {tpr:.4f}")
    print(f"  False Positive Rate (FPR): {fpr:.4f}")
    print("-" * 40)

# Evaluate for each group
evaluate_group_performance("Female (Unprivileged)", female_mask)
evaluate_group_performance("Male (Privileged)", male_mask)

Female (Unprivileged) Results:
  True Positive Rate (TPR): 0.6702
  False Positive Rate (FPR): 0.2501
----------------------------------------
Male (Privileged) Results:
  True Positive Rate (TPR): 0.6865
  False Positive Rate (FPR): 0.2716
----------------------------------------


### Group-Specific Error Analysis – Random Forest Model

This section presents the classification performance of the Random Forest model across gender groups using **True Positive Rate (TPR)** and **False Positive Rate (FPR)**.

#### Results by Gender Group

| Group                        | TPR     | FPR     |
|------------------------------|---------|---------|
| Female (Unprivileged = 0)    | 0.6702  | 0.2501  |
| Male (Privileged = 1)        | 0.6865  | 0.2716  |

#### Interpretation
- **Females (Unprivileged):**  
  - Achieve a **TPR of 67.02%**, slightly lower than males → fewer true positives are identified.  
  - Benefit from a **lower FPR (25.01%)**, meaning they are less often incorrectly classified as positive.

- **Males (Privileged):**  
  - Achieve a **TPR of 68.65%**, indicating stronger recall and more true positives captured.  
  - However, they also face a **higher FPR (27.16%)**, meaning more false positives occur.

#### Conclusion
The Random Forest model demonstrates a **performance trade-off**:  
- **Males** → better recall (higher TPR) but at the cost of more false alarms (higher FPR).  
- **Females** → fewer false positives, but slightly weaker sensitivity.  

The differences (≈1.6% in TPR and ≈2.1% in FPR) are **small**, suggesting the model is **reasonably fair across genders**, with only minor imbalances.

---

### Deep Learning Model - Feed Forward Network (MLP)

In [30]:
mlp_df = pd.read_csv("CVDKaggleData_50M50F_MLP_adamtuned_predictions.csv")
print(mlp_df.head())

   gender  y_true  y_pred    y_prob
0       0       0       0  0.321685
1       0       0       1  0.868185
2       1       0       0  0.412735
3       0       0       0  0.285116
4       0       0       0  0.242904


In [31]:
# Extract common columns 
y_true_mlp = mlp_df["y_true"].values 
y_prob_mlp = mlp_df["y_prob"].values
y_pred_mlp = mlp_df["y_pred"].values
gender_mlp = mlp_df["gender"].values 

# Use gender_mlp as the protected attribute
protected_attr_mlp = gender_mlp 

In [32]:
#Run fairmlhealth bias detection for MLP 

mlp_bias = measure.summary(
    X=X_test,
    y_true=y_test,
    y_pred=y_pred_mlp,
    y_prob=y_prob_mlp,
    prtc_attr=protected_attr_mlp,
    pred_type="classification",
    priv_grp=1,
    sig_fig=4,
    skip_if=True,
    skip_performance = True
)

print(mlp_bias)

                                                        Value
Metric         Measure                                       
Group Fairness AUC Difference                          0.0073
               Balanced Accuracy Difference            0.0034
               Balanced Accuracy Ratio                 1.0047
               Disparate Impact Ratio                  0.9596
               Equal Odds Difference                  -0.0267
               Equal Odds Ratio                        0.9052
               Positive Predictive Parity Difference   0.0209
               Positive Predictive Parity Ratio        1.0294
               Statistical Parity Difference          -0.0199
Data Metrics   Prevalence of Privileged Class (%)     35.0000


In [33]:
# Flagged fairness table for MLP
styled_mlp = MyFlagger().apply_flag(
    df=mlp_bias,
    caption="MLP Fairness (Gender)",
    boundaries=bounds,
    sig_fig=4,
    as_styler=True
)
styled_mlp

Unnamed: 0_level_0,Unnamed: 1_level_0,Value
Metric,Measure,Unnamed: 2_level_1
Group Fairness,AUC Difference,0.0073
Group Fairness,Balanced Accuracy Difference,0.0034
Group Fairness,Balanced Accuracy Ratio,1.0047
Group Fairness,Disparate Impact Ratio,0.9596
Group Fairness,Equal Odds Difference,-0.0267
Group Fairness,Equal Odds Ratio,0.9052
Group Fairness,Positive Predictive Parity Difference,0.0209
Group Fairness,Positive Predictive Parity Ratio,1.0294
Group Fairness,Statistical Parity Difference,-0.0199
Data Metrics,Prevalence of Privileged Class (%),35.0


### Interpretation of MLP Fairness Metrics by Gender

The table reports fairness metrics for the **Multilayer Perceptron (MLP) model**, using gender as the sensitive attribute.

---

#### Key Observations

- **AUC Difference (0.0073):**  
  → Very small; ranking ability is nearly identical across genders.

- **Balanced Accuracy Difference (0.0034)** and **Ratio (1.0047):**  
  → Balanced accuracy is almost the same, with only a tiny advantage for one group.  

- **Disparate Impact Ratio (0.9596):**  
  → Slightly below 1, meaning females are **a bit less likely to receive positive predictions** compared to males.  
  → Still within the fairness guideline range (0.8–1.25).

- **Equal Odds Difference (-0.0267)** and **Equal Odds Ratio (0.9052):**  
  → Shows the largest disparity: males have **slightly higher error rates** (FPR or lower TPR), while females benefit from marginally more favorable error distribution.  
  → The imbalance is modest (~2.7%).

- **Positive Predictive Parity Difference (0.0209)** and **Ratio (1.0294):**  
  → Precision is somewhat higher for males, meaning their positive predictions are more reliable.

- **Statistical Parity Difference (-0.0199):**  
  → Females are **predicted positive at a slightly lower rate** than males, but the gap remains small.

- **Prevalence of Privileged Class (35%):**  
  → Males account for 35% of the dataset. Despite this lower representation, the fairness metrics remain well balanced.

---

#### Overall Conclusion
The MLP model demonstrates **generally fair performance across genders**, with small disparities:
- **Females (unprivileged group):** Slightly less likely to be predicted positive but with somewhat more favorable error rates.  
- **Males (privileged group):** Benefit from higher precision but face slightly higher error rates.  

The most notable gap is in **Equal Odds (-0.0267)**, but all values remain **close to parity**. Overall, the model can be considered **reasonably fair**, with no strong systematic bias detected.

---

In [34]:
print("FairMLHealth Stratified Bias Table - MLP")
measure.bias(X_test, y_test, y_pred_mlp, features=['gender'], flag_oor=False)

FairMLHealth Stratified Bias Table - MLP


Unnamed: 0,Feature Name,Feature Value,Balanced Accuracy Difference,Balanced Accuracy Ratio,FPR Diff,FPR Ratio,PPV Diff,PPV Ratio,Selection Diff,Selection Ratio,TPR Diff,TPR Ratio
0,gender,0,-0.0034,0.9953,0.0267,1.1047,-0.0209,0.9714,0.0199,1.0421,0.02,1.029
1,gender,1,0.0034,1.0047,-0.0267,0.9052,0.0209,1.0294,-0.0199,0.9596,-0.02,0.9718


### Interpretation of Stratified Bias Analysis – MLP (by Gender)

The table shows group-specific fairness metrics for the **Multilayer Perceptron (MLP) model**, stratified by gender  
(0 = Female, 1 = Male).

---

#### 1. Balanced Accuracy
- **Female (0):** Difference = **-0.0034**, Ratio = **0.9953**  
- **Male (1):** Difference = **0.0034**, Ratio = **1.0047**  
➡️ Balanced accuracy is almost identical, with only a tiny advantage for males.

---

#### 2. False Positive Rate (FPR)
- **Female (0):** FPR Diff = **0.0267**, Ratio = **1.1047**  
- **Male (1):** FPR Diff = **-0.0267**, Ratio = **0.9052**  
➡️ Females have a **higher false positive rate**, meaning they are more often incorrectly classified as positive.  
➡️ Males benefit from fewer false positives.

---

#### 3. Positive Predictive Value (PPV, Precision)
- **Female (0):** PPV Diff = **-0.0209**, Ratio = **0.9714**  
- **Male (1):** PPV Diff = **0.0209**, Ratio = **1.0294**  
➡️ Males achieve **higher precision**, meaning their positive predictions are more reliable.  
➡️ Females’ positive predictions are slightly less reliable.

---

#### 4. Selection Rate (likelihood of being predicted positive)
- **Female (0):** Selection Diff = **0.0199**, Ratio = **1.0421**  
- **Male (1):** Selection Diff = **-0.0199**, Ratio = **0.9596**  
➡️ Females are **more likely to be predicted positive**, while males are predicted positive at a slightly lower rate.

---

#### 5. True Positive Rate (TPR, Recall)
- **Female (0):** TPR Diff = **0.0200**, Ratio = **1.0290**  
- **Male (1):** TPR Diff = **-0.0200**, Ratio = **0.9718**  
➡️ Females achieve **higher recall**, meaning more of their true positives are detected.  
➡️ Males experience more false negatives.

---

### Overall Conclusion
- **Females (0):**  
  - Advantages: **Higher recall and higher selection rate** (more positives detected).  
  - Disadvantages: **Higher false positive rate** and **slightly lower precision**.  

- **Males (1):**  
  - Advantages: **Lower false positive rate** and **higher precision** (positive predictions are more reliable).  
  - Disadvantages: **Lower recall** and **less likely to be predicted positive**.  

The differences (≈2–2.7%) are **moderate** and represent a **classic trade-off**:  
- **Females** → better sensitivity but less reliability.  
- **Males** → fewer false alarms and higher reliability, but more missed positives.  

Overall, the MLP model is **reasonably balanced**, though it leans slightly toward **favoring females in recall** and **males in precision**.

---

In [35]:
# Get the stratified performance table
perf_table_mlp = measure.performance(
    X=gender_df,
    y_true=y_test,
    y_pred=y_pred_mlp,
    y_prob=y_prob_mlp
)

# Replace NaN with a dash
perf_table_mlp = perf_table_mlp.fillna("—")

# display pretty table
display(perf_table_mlp)


  warn(f"Possible error in column(s) {cols}. {wr}\n")


Unnamed: 0,Feature Name,Feature Value,Obs.,Mean Target,Mean Prediction,Accuracy,F1-Score,FPR,PR AUC,Precision,ROC AUC,TPR
0,ALL FEATURES,ALL VALUES,11256.0,0.4976,0.4793,0.7161,0.7093,0.2644,—,0.7229,0.7728,0.6963
1,gender,0,7362.0,0.5004,0.4724,0.7172,0.7093,0.255,—,0.7303,0.7755,0.6895
2,gender,1,3894.0,0.4923,0.4923,0.7139,0.7094,0.2817,—,0.7094,0.7682,0.7094


### Interpretation of Stratified Performance Metrics – MLP (by Gender)

The table shows performance results for the **Multilayer Perceptron (MLP) model**, stratified by gender  
(0 = Female, 1 = Male).

---

#### Overall Performance (All Data)
- **Accuracy:** 0.7161  
- **F1-Score:** 0.7093  
- **Precision:** 0.7229  
- **ROC AUC:** 0.7728  
- **TPR (Recall):** 0.6963  
→ The MLP achieves **balanced performance overall**, with good discrimination ability and a fair trade-off between recall and precision.

---

#### Group-Specific Performance

**1. Female (0 – Unprivileged)**  
- **Accuracy:** 0.7172 (slightly higher than males)  
- **F1-Score:** 0.7093 (same as males)  
- **FPR:** 0.2550 (lower than males → fewer false positives)  
- **Precision:** 0.7303 (higher than males → more reliable positive predictions)  
- **ROC AUC:** 0.7755 (slightly better discrimination ability)  
- **TPR (Recall):** 0.6895 (lower recall → more false negatives)  

➡️ Females benefit from **higher precision, accuracy, and fewer false positives**, but recall is weaker than for males.

---

**2. Male (1 – Privileged)**  
- **Accuracy:** 0.7139 (slightly lower)  
- **F1-Score:** 0.7094 (same as females)  
- **FPR:** 0.2817 (higher → more false positives)  
- **Precision:** 0.7094 (lower precision than females)  
- **ROC AUC:** 0.7682 (slightly lower discrimination ability)  
- **TPR (Recall):** 0.7094 (higher recall → fewer false negatives)  

➡️ Males achieve **better recall**, but this comes at the cost of **higher false positives and lower precision**.

---

### Overall Conclusion
- **Females (0):** More reliable predictions (higher precision, lower FPR, slightly higher ROC AUC), but weaker sensitivity.  
- **Males (1):** Better sensitivity (higher TPR/recall), but more false positives and less precise predictions.  

The disparities are **small (≈2% in FPR and ≈2% in recall)**, reflecting a **balanced model with minor trade-offs**:
- **Females** → fewer false positives, better reliability.  
- **Males** → more true positives detected, but noisier predictions.  

Overall, the MLP demonstrates **good fairness** with **only minor gender differences**.

---

In [36]:
from fairmlhealth import performance_metrics as pm

# Define group masks with clear names
female_mask = (protected_attr_mlp == 0)  # female = unprivileged group
male_mask   = (protected_attr_mlp == 1)  # male = privileged group 

# Function to evaluate group-specific metrics
def evaluate_group_performance(group_name, mask):
    tpr = pm.true_positive_rate(y_test[mask], y_pred_mlp[mask])
    fpr = pm.false_positive_rate(y_test[mask], y_pred_mlp[mask])
    print(f"{group_name} Results:")
    print(f"  True Positive Rate (TPR): {tpr:.4f}")
    print(f"  False Positive Rate (FPR): {fpr:.4f}")
    print("-" * 40)

# Evaluate for each group
evaluate_group_performance("Female (Unprivileged)", female_mask)
evaluate_group_performance("Male (Privileged)", male_mask)

Female (Unprivileged) Results:
  True Positive Rate (TPR): 0.6895
  False Positive Rate (FPR): 0.2550
----------------------------------------
Male (Privileged) Results:
  True Positive Rate (TPR): 0.7094
  False Positive Rate (FPR): 0.2817
----------------------------------------


### Group-Specific Error Analysis – MLP Model

This section reports the classification performance of the **Multilayer Perceptron (MLP)** model across gender groups using **True Positive Rate (TPR)** and **False Positive Rate (FPR)**.

#### Results by Gender Group

| Group                        | TPR     | FPR     |
|------------------------------|---------|---------|
| Female (Unprivileged = 0)    | 0.6895  | 0.2550  |
| Male (Privileged = 1)        | 0.7094  | 0.2817  |

#### Interpretation
- **Females (Unprivileged):**  
  - Achieve a **TPR of 68.95%**, which is slightly lower than for males, meaning more missed positives (false negatives).  
  - Benefit from a **lower FPR (25.50%)**, meaning they are less often incorrectly classified as positive.

- **Males (Privileged):**  
  - Achieve a **TPR of 70.94%**, slightly higher than females, meaning more true positives are correctly identified.  
  - However, they face a **higher FPR (28.17%)**, which means they are more often incorrectly flagged as positive.

#### Conclusion
The MLP model shows a **clear trade-off**:
- **Males** → stronger recall (higher TPR) but at the cost of more false positives.  
- **Females** → fewer false positives (lower FPR) but weaker recall.  

The differences (≈2% in TPR and ≈2.7% in FPR) are **modest**, suggesting the model is **reasonably fair across genders**, with only minor performance imbalances.

---
