<a href="https://colab.research.google.com/github/pitthexai/IEEE_ICHI_EBAICWorkshop/blob/main/EBAIC2024_Workshop/Track03_Fairlearn/EBAIC2024_Fairlearn_Diabetes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# EBAIC 2024 Track III: Fairlearn: An open-source package to improve fairness of AI

The field of fairness in AI systems is an interdisciplinary area of research and practice focused on understanding and mitigating the negative impacts of AI on society. In this tutorial, we utilize Fairlearn -- an open-source library designed to help improve the fairness of AI systems. In this tutorial, we consider an automated system for recommending patients for high-risk care management programs

## Dataset and Task
Using a dataset included with the Fairlearn library, we will be working with a clincial dataset of containing re-admissions over a ten-year period (1998-2008) for diabetic patients across 130 different hospitals within the US. Features included within the dataset include:

- demographics,
- diagnoses,
- diabetic medications,
- number of visits in the year preceding the encounter,
- payer information,
- whether the patient was readmitted after release,
- whether the readmission occurred within 30 days of the release

Out goal is to develop a classification model that decides whether the patients should be suggested to their primary care physicians for an enrollment into a high-risk care management program.

## Package Setup

In [None]:
!pip install --upgrade fairlearn==0.10.0
!pip install --upgrade scikit-learn
!pip install --upgrade seaborn

In [None]:
import numpy as np
import pandas as pd

pd.set_option("display.float_format", "{:.3f}".format)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

sns.set()

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.utils import Bunch
from sklearn.metrics import (
    balanced_accuracy_score,
    roc_auc_score,
    accuracy_score,
    recall_score,
    confusion_matrix,
    roc_auc_score,
    roc_curve)
from sklearn import set_config

set_config(display="diagram")

In [None]:
from fairlearn.metrics import (
    MetricFrame,
    true_positive_rate,
    false_positive_rate,
    false_negative_rate,
    selection_rate,
    count,
    false_negative_rate_difference
)

from fairlearn.datasets import fetch_diabetes_hospital
from fairlearn.postprocessing import ThresholdOptimizer, plot_threshold_optimizer
from fairlearn.postprocessing._interpolated_thresholder import InterpolatedThresholder
from fairlearn.postprocessing._threshold_operation import ThresholdOperation
from fairlearn.reductions import ExponentiatedGradient, EqualizedOdds, TruePositiveRateParity

## Data Exploration

The first step is to explore the data provided for any fairness issues that may occur. Specifically, we look at:
1. sample sizes of different demographic groups, and in particular different racial groups
2. balance of the class labels

### Loading the Dataset
We first load the dataset using the Fairlearn library. We then construct our target column ```readmit_30_days```.

In [None]:
diabetes_data = fetch_diabetes_hospital()

In [None]:
diabetes_df = diabetes_data.data

In [None]:
diabetes_df["readmit_30_days"] = np.where(diabetes_df.readmitted == "<30", 1, 0)

### Group Sizes

For assessing fairness, a key data characteristic is the sample size of groups we are conducting a fairness assessment for. Small sample sizes have two key implications:

- **assessment**: Smaller groups are harder to assess due to fewer data points, which leads to a much larger uncertainty in our estimates

- **model training**: fewer training data points can cause our model to not appropriately capture any data patterns specific to smaller groups. This can lead to worse predictive performance on these groups.


#### Race Group Sizes

In [None]:
diabetes_df["race"].value_counts()

In [None]:
diabetes_df["race"].value_counts().plot(kind='bar', rot=45);

In [None]:
diabetes_df["race"].value_counts(normalize=True) # frequencies

In [None]:
# drop gender group Unknown/Invalid
diabetes_df = diabetes_df.query("gender != 'Unknown/Invalid'")

# retain the original race as race_all, and merge Asian+Hispanic+Other
diabetes_df["race_all"] = diabetes_df["race"]
diabetes_df["race"] = diabetes_df["race"].replace({"Asian": "Other", "Hispanic": "Other"})

#### Gender

In [None]:
diabetes_df["gender"].value_counts()

In [None]:
diabetes_df["gender"].value_counts().plot(kind='bar', rot=45);

### Label Imbalance

We next look at the frequency of our class labels. The frequency of the labels is important because:
- some classification algorithms and evaluation metrics won't work well with data sets that contain extreme class imbalances
- extreme class imbalance may make bias towards certain groups worse due to smaller group sizes in fairness assessment

In [None]:
diabetes_df["readmit_30_days"].value_counts()

Due to the large imbalance between the negative and positive class, we will use balanced accuracy to evaluate our predictive model.

## Training a Model
Next, we train a a classification model. Here, we utilize logistic regression for both its interpretability and the model expresiveness.

### Training/Test Splits
We split the data into train/test splits with a 50/50 split. Because our evaluation metric is balanced accuracy, we will resample the data set to have the same number of positive and negative examples for training.

In [None]:
target_variable = "readmit_30_days"
demographic = ["race", "gender"]
sensitive = ["race"]

In [None]:
target_variable = "readmit_30_days"
demographic = ["race", "gender"]
sensitive = ["race"]

Y, A = diabetes_df.loc[:, target_variable], diabetes_df.loc[:, sensitive]

X = pd.get_dummies(diabetes_df.drop(columns=[
    "race",
    "race_all",
    "discharge_disposition_id",
    "readmitted",
    "readmit_binary",
    "readmit_30_days"
]))

In [None]:
random_seed = 45
np.random.seed(random_seed)

X_train, X_test, Y_train, Y_test, A_train, A_test, df_train, df_test = train_test_split(
    X,
    Y,
    A,
    diabetes_df,
    test_size=0.50,
    stratify=Y,
    random_state=random_seed
)

In [None]:
def resample_dataset(X_train, Y_train, A_train):

  negative_ids = Y_train[Y_train == 0].index
  positive_ids = Y_train[Y_train == 1].index
  balanced_ids = positive_ids.union(np.random.choice(a=negative_ids, size=len(positive_ids)))

  X_train = X_train.loc[balanced_ids, :]
  Y_train = Y_train.loc[balanced_ids]
  A_train = A_train.loc[balanced_ids, :]
  return X_train, Y_train, A_train

In [None]:
X_train_bal, Y_train_bal, A_train_bal = resample_dataset(X_train, Y_train, A_train)

### Logistic Regression Model

In [None]:
unmitigated_pipeline = Pipeline(steps=[
    ("preprocessing", StandardScaler()),
    ("logistic_regression", LogisticRegression(max_iter=1000))
])

In [None]:
unmitigated_pipeline.fit(X_train_bal, Y_train_bal)

In [None]:
Y_pred_proba = unmitigated_pipeline.predict_proba(X_test)[:,1]
Y_pred = unmitigated_pipeline.predict(X_test)

In [None]:
balanced_accuracy_score(Y_test, Y_pred)

In [None]:
coef_series = pd.Series(data=unmitigated_pipeline.named_steps["logistic_regression"].coef_[0], index=X.columns)
coef_series.sort_values().plot.barh(figsize=(4, 12), legend=False);

## Fairness Assessment

In the healthcare scenario, when patients who can benefit from a care management program but are not recommended, they experience allocation harm. In classification, these patients are referred to as false negatives. Here, we focus on groups defined by race.

To evaluate the fairness we use two metrics to quantify the harms and benefits:
- **false negative rates (quantifying harm)**: the fraction of patients that are readmitted within 30 days, but that are not recommended for the care management program
- **selection rate (quantifying benefits)**: the overall fraction of patients that are recommended for the care management program

To easily compare false negative rate across groups defined by race we report group specific false negative rates as well as the largest distance, smallest ratio, and maximum worst-case false-negative rate.  

In [None]:
# You can also evaluate multiple metrics by providing a dictionary

metrics_dict = {
    "selection_rate": selection_rate,
    "false_negative_rate": false_negative_rate,
    "balanced_accuracy": balanced_accuracy_score,
}

metricframe_unmitigated = MetricFrame(metrics=metrics_dict,
                  y_true=Y_test,
                  y_pred=Y_pred,
                  sensitive_features=df_test['race'])

# The disaggregated metrics are then stored in a pandas DataFrame:

metricframe_unmitigated.by_group

In [None]:
pd.DataFrame({'difference': metricframe_unmitigated.difference(),
              'ratio': metricframe_unmitigated.ratio(),
              'group_min': metricframe_unmitigated.group_min(),
              'group_max': metricframe_unmitigated.group_max()}).T

In [None]:
metricframe_unmitigated.by_group.plot.bar(subplots=True, layout= [1,3], figsize=(12, 4),
                      legend=False, rot=-45, position=1.5);

From the plots, the Unknown groups is selected for the care management program less often than other groups and a larger fraction of group members that are likely to benefit from a care management program are not selected for it.

## Mitigating Fairness-related Harms through Postprocessing

Postprocessing techniques are a class of unfairness-mitigation algorithms that take a trained model and a dataset as an input and fits a transformation function to model's outputs to satisfy some (group) fairness constraint(s), in our case the false negative rate. Here, we use the ```ThresholdOptimizer``` which uses a models predictions as a scoring function to identify a separate thrceshold for each sensitive group to optimize a specific metric. This metric is subject to specified fairness constraints. Here we use **false negative rate parity**, which requires that all the groups have equal values of false negative rate.

In [None]:
# Now we instantite ThresholdOptimizer with the logistic regression estimator
postprocess_est = ThresholdOptimizer(
    estimator=unmitigated_pipeline,
    constraints="false_negative_rate_parity",
    objective="balanced_accuracy_score",
    prefit=True,
    predict_method='predict_proba'
)

postprocess_est.fit(X_train_bal, Y_train_bal, sensitive_features=A_train_bal)

In [None]:
Y_pred_postprocess = postprocess_est.predict(X_test, sensitive_features=A_test)
metricframe_postprocess = MetricFrame(
    metrics=metrics_dict,
    y_true=Y_test,
    y_pred=Y_pred_postprocess,
    sensitive_features=A_test
)

In [None]:
pd.concat([metricframe_unmitigated.by_group,
           metricframe_postprocess.by_group],
           keys=['Unmitigated', 'ThresholdOptimizer'],
           axis=1)

In [None]:
pd.concat([metricframe_unmitigated.difference(),
           metricframe_postprocess.difference()],
          keys=['Unmitigated: difference', 'ThresholdOptimizer: difference'],
          axis=1).T

In [None]:
metricframe_postprocess.by_group.plot.bar(subplots=True, layout=[1,3], figsize=(12, 4), legend=False, rot=-45, position=1.5)


## Credit
This tutorial is based off of the following notebook:
[<br>_Fairness in AI systems: From social context to practice using Fairlearn_](https://colab.research.google.com/github/fairlearn/talks/blob/main/2021_scipy_tutorial/fairness-in-AI-systems-student.ipynb#scrollTo=Sch9KDWg7SL8)