# RAI Exercise 1: Algorithmic Fairness ⚖️

# Important information
This exercise is part of the RAI course (02517 - **Responsible AI: Algorithmic Fairness and Explainability**) at the Technical University of Denmark (DTU). You can find more details about the course [here](https://kurser.dtu.dk/course/02517). This specific version is for the Fall 2024 semester.

If you have any questions related to this notebook, feel free to reach out to Nina Weng at *ninwe@dtu.dk*.

**Credits**:  
We thank:
* NIH dataset team for collecting such dataset [link to the paper](https://openaccess.thecvf.com/content_cvpr_2017/papers/Wang_ChestX-ray8_Hospital-Scale_Chest_CVPR_2017_paper.pdf);
* Authors from [this paper](https://link.springer.com/chapter/10.1007/978-3-031-45249-9_14) for providing the splits;
* [Memes generator webpage imgflip](https://imgflip.com/) for all the excellent Memes template.


# PART 1: Fairness assessment and Bias mitigation using Fairlearn

The goal of this exercise is to learn how to use [Fairlearn](https://fairlearn.org/) to approach basic fairness assessments and apply post-processing bias mitigation methods. 

Fairlearn is an open-source Python package originally developed by Microsoft Research. Since 2021, it has become completely community-driven. For more information about Fairlearn, you can visit [this page]((https://fairlearn.org/v0.10/about/index.html)). 

Although Fairlearn is likely the most well-developed package targeting fairness issues, it has its limitations. The most notable limitation, that might need to be mentioned at very beginning for this exercise, is that Fairlearn is primarily designed for tabular data ([this page](https://fairlearn.org/main/faq.html) under question: *Does Fairlearn work for image and text data?*). Therefore, when working with other types of data, such as image data, unexpected issues may arise. Fortunate enough, there are workarounds for most of these issues, which will be discussed later in this exercise.

While Fairlearn is a good resource and offers an easy approach for learning fairness concepts and handling lighter tasks, it may not be the best solution for researchers working extensively in this area. Keep that in mind :-)

## 🧠 Objective of this Exercise (PART 1)
By the end of this exercise, you should be able to:

* Assess fairness using Fairlearn with provided predictions/probabilities and target labels. This includes calculating metrics, generating ROC curves, and interpreting their meaning.
* Apply post-processing bias mitigation techniques using Fairlearn, and clearly understand and explain the outcomes.

![](./support4notebook/getstarted.jpg)

## 1. Dataset: Chest X-ray and lung/heart related disease


In this exercise, we will use a chest X-ray dataset and a basic deep learning model as the setup. It requires the following:

* Download the dataset/metadata/pretrained ResNet model. Note that in this exercise, we only use part of the data, and details are listed below. The full dataset can be found [here](https://nihcc.app.box.com/v/ChestXray-NIHCC). (For students in the class, you can find a download link on DTU Learn. For those not in the class, you can find the pre-processing scripts in [this repository](https://github.com/nina-weng/detecting_causes_of_gender_bias_chest_xrays).)
* After downloading the materials, put the `NIH_train_val_test_rs0_f50.csv` under `./datafiles/`; and `nih_pneumothorax.pth` under `./pretrained_model/`.
* Prepare your virtual environment: 
`conda env create -f env.yml`


### Chest xray samples
We use [NIH chest xray dataset](https://www.nih.gov/news-events/news-releases/nih-clinical-center-provides-one-largest-publicly-available-chest-x-ray-datasets-scientific-community) in this exercise. Let’s take a closer look at the dataset. </br>

This dataset contains 108,948 images from 32,717 patients, each labeled with one of 14 types of lung or heart-related diseases/symptoms. For detailed information on each disease, you can find explanations [here](https://nihcc.app.box.com/v/ChestXray-NIHCC/file/220660789610).

For simplicity, we will use only one sample per patient and preprocess the images to a size of 224x224. Both the dataset and the metadata (in CSV format) are available. The dataset split is also specified in the metadata under the column 'split'.

Note: The split was designed for a different task, which required a larger test set than usual. As a result (as you’ll notice), the test set is relatively large (around 8k for training, 2k for validation, and 8k for testing). If you find that the mitigation or prediction process takes too long, feel free to downsample the test set. Just ensure you validate that the proportions of samples across different sensitive groups and disease labels remain roughly consistent with the original test set.  

In [None]:
#TODO: change this to your data directory
datadir = "/Users/joyceabisaleh/Desktop/Responsible AI/detecting_causes_of_gender_bias_chest_xrays-master"

In [None]:
import os
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import pandas as pd
import seaborn as sns
sns.set_theme(style="whitegrid")
import torch
from tqdm import tqdm
from sklearn.metrics import accuracy_score
from fairlearn.metrics import MetricFrame, selection_rate,false_positive_rate,true_positive_rate,false_negative_rate,true_negative_rate

import sys
sys.path.append(os.path.abspath("."))

from train.model import ResNet
from train.prediction import validate
from analysis.plot import plot_roc_simple
from analysis.tools import from_loader_to_tensor

Let us take a look of some samples, with filter query. 

In [None]:
dataset_pth = datadir + '/NIH_part/'
metadata_csv ='./datafiles/NIH_train_val_test_rs0_f50.csv'
metadata = pd.read_csv(metadata_csv)

# display(metadata.head(5))

# randomly choose some samples from PAD-UFES-20
def show_random_images(datadir,metadata,seed=None,filter_str=None,num_sample=5):
    fig = plt.figure(figsize=(num_sample*3, 3),dpi=200)
    files = os.listdir(datadir)
    if filter_str:
        metadata = metadata.query(filter_str)
        # display(metadata.head(5))
    if seed is not None:
        random_sample = metadata.sample(n=num_sample, random_state=seed)
    else: random_sample = metadata.sample(n=num_sample)
    
    for i in range(len(random_sample)):
        row = random_sample.iloc[i]
        disease = row['Pneumothorax']
        sex = row['Patient Gender']

        img = mpimg.imread(datadir + row['Image Index'])
        ax = fig.add_subplot(1, len(random_sample), i + 1)
        # add diagnosis as subtitle
        ax.set_title(f'{disease=},{sex=}')
        ax.imshow(img, cmap='gray')
        ax.axis('off')
    plt.suptitle(f'{filter_str}')
    plt.show()

show_random_images(dataset_pth,metadata=metadata,seed=42,filter_str='Pneumothorax==1 and `Patient Gender`=="F"',num_sample=5)
show_random_images(dataset_pth,metadata=metadata,seed=42,filter_str='Pneumothorax==1 and `Patient Gender`=="M"',num_sample=5)

### Basic statistics

We can also take a look at the distribution of some sensitive attributes:

In [None]:
def plot_distribution_by_value(metadata, column_name):
    if isinstance(column_name, str):
        nan_count = metadata[column_name].isna().sum()
        nan_series = pd.Series([nan_count], index=['NaN'])
        counts_ = metadata[column_name].value_counts().sort_index()
        counts_with_nan = pd.concat([counts_, nan_series])

        counts_with_nan.plot(kind='bar',title='Distribution of {}'.format(column_name))
        plt.ylabel('count')
    elif isinstance(column_name, list):
        fig, axes = plt.subplots( 1, len(column_name), figsize=( len(column_name)*4,3),dpi=200)
        for i,col in enumerate(column_name):
            nan_count = metadata[col].isna().sum()
            nan_series = pd.Series([nan_count], index=['NaN'])
            counts_ = metadata[col].value_counts().sort_index()
            counts_with_nan = pd.concat([counts_, nan_series])

            counts_with_nan.plot(kind='bar',title='Distribution of {}'.format(col),ax=axes[i])
            axes[i].set_ylabel('count')
    plt.tight_layout()
    plt.show()

metadata['age_range'] = pd.cut(metadata['Patient Age'], bins=[0,10,20,30,40,50,60,70,80,90,100], right=False)
plot_distribution_by_value(metadata, ['age_range','Patient Gender'])

You may have noticed that, for this specific split, we maintain an equal number of male and female samples. This balance is consistent across all three splits: training, validation, and testing.

### 📃 Further Reading:
If you're interested, here are some studies that explore potential biases and confounders in chest xray datasets:
* [Lauren Oakden-Rayner: Exploring the ChestXray14 dataset: problems](https://laurenoakdenrayner.com/2017/12/18/the-chestxray14-dataset-problems/)
* [Amelia Jiménez-Sánchez et al.: Detecting Shortcuts in Medical Images -- A Case Study in Chest X-rays](https://arxiv.org/abs/2211.04279)
* [Judy Wawira Gichoya et al.: AI recognition of patient race in medical imaging: a modelling study](https://www.thelancet.com/journals/landig/article/PIIS2589-7500(22)00063-2/fulltext)

## 2. Fairness assessment
**Recap of Key Concepts**:

In th class, we have learned:
* The three key criteria for fairness assessment. What are they?
* Evaluation metrics corresponding to each criterion.
* ROC curves.

For this exercise, we’ve provided a pre-trained ResNet classifier for Part 1, where the disease label is `Pneumothorax` and the sensitive attribute is `sex` (in metadata, you can get the binarized sex label from column `sex label`, where 0 represents female and 1 represents male). However, feel free to train your own model if you'd like.



### Load pretrained model

In [None]:
# load the pretrained model
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(' device:', device)

ds_name = 'NIH'

# load the model
lr=1e-6
pretrained = True
model_scale = '18'
num_epochs =20
img_size = (1, 224, 224)

#def load(f, map_location='cpu', pickle_module=pickle, **pickle_load_args):
classifier = ResNet(num_classes=1, lr=lr, pretrained=pretrained, model_scale=model_scale, in_channel=img_size[0])
classifier.load_state_dict(torch.load('./pretrained_model/nih_pneumothorax.pth', map_location=torch.device('cpu')))
classifier.to(device)

classifier.eval()


### Load test and validation data

In [None]:
save_model_at = './pretrained_models/'

img_size = (1,224,224)
batch_size = 16

csv_pth = './datafiles/NIH_train_val_test_rs0_f50.csv' if ds_name == 'NIH' else None

disease_label = 'Pneumothorax' 
sensitive_label = 'sex'
augmentation = False

from train.train_chestxray import create_datasets

train_dataset, val_dataset, test_dataset = create_datasets(dataset_pth, 
                                                               ds_name,
                                                               csv_pth, 
                                                               image_size=img_size, 
                                                               device=device,
                                                               disease_label = disease_label,
                                                               sensitive_label = sensitive_label,
                                                               augmentation=augmentation)
# train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True) # we dont need it here
val_loader = torch.utils.data.DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=batch_size, shuffle=False)


In [None]:
print(test_dataset)

### Predict the results for test set

In [None]:
test_lab, test_pred, test_prob, test_a= validate(classifier, test_loader, device=device)

### Assess fairness using Fairlearn
#### A simple example first

Fairlearn provides the `fairlearn.metrics.MetricFrame` class to help with this quantification. 

Given: 
<pre>
y_true = [0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1]
y_pred = [0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0]
sf_data = ['b', 'b', 'a', 'b', 'b', 'c', 'c', 'c', 'a',
           'a', 'c', 'a', 'b', 'c', 'c', 'b', 'c', 'c']
</pre>
           


![](./support4notebook/exercise_time.gif)

Try: 
* measure: recall, selection rate and false positive rate for *each group*;
* plot the above result out; ([Hint](https://fairlearn.org/main/user_guide/assessment/plotting.html))
* measure the difference in eqaulized odd between different groups;


Hint: The documentation page of [MetricFrame](https://fairlearn.org/main/api_reference/generated/fairlearn.metrics.MetricFrame.html#fairlearn.metrics.MetricFrame)

In [None]:
import numpy as np
from sklearn.metrics import recall_score, confusion_matrix

# Create a function to calculate recall, selection rate, and FPR by group
def calculate_group_metrics(true_labels, predictions, group_labels):
    # Convert lists to NumPy arrays for easier indexing
    true_labels = np.array(true_labels)
    predictions = np.array(predictions)
    group_labels = np.array(group_labels)
    
    group_metrics = {}
    groups = np.unique(group_labels)  # Get unique group labels (e.g., male, female)

    for group in groups:
        # Get the indices of the current group
        group_idx = np.where(group_labels == group)
        
        # Extract true labels and predictions for this group
        true_group = true_labels[group_idx]
        pred_group = predictions[group_idx]

        # Recall (True Positive Rate)
        recall = recall_score(true_group, pred_group)
        
        # Selection Rate
        selection_rate = np.mean(pred_group)

        # False Positive Rate (FPR)
        tn, fp, fn, tp = confusion_matrix(true_group, pred_group).ravel()
        fpr = fp / (fp + tn) if (fp + tn) > 0 else 0

        # Store the metrics for the group
        group_metrics[group] = {
            "Recall": recall,
            "Selection Rate": selection_rate,
            "False Positive Rate": fpr
        }
    
    return group_metrics

# Example usage
#test_lab, test_pred, test_prob, test_a = validate(classifier, test_loader, device=device)

# Calculate metrics by group (test_a is the sensitive attribute like gender)
group_metrics = calculate_group_metrics(test_lab, test_pred, test_a)

# Print metrics for each group
for group, metrics in group_metrics.items():
    print(f"Group: {group}")
    print(f"  Recall: {metrics['Recall']:.2f}")
    print(f"  Selection Rate: {metrics['Selection Rate']:.2f}")
    print(f"  False Positive Rate: {metrics['False Positive Rate']:.2f}")
    print()


### Now measure the fairness metrics for our data

![](./support4notebook/exercise.jpg)

In [None]:
# Define a custom function to calculate False Positive Rate (FPR)
def false_positive_rate(y_true, y_pred):
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    fpr = fp / (fp + tn) if (fp + tn) > 0 else 0
    return fpr

# Define a custom function to calculate Selection Rate
def selection_rate(y_true, y_pred):
    return np.mean(y_pred)

# Define a custom function to calculate Recall
def recall_metric(y_true, y_pred):
    return recall_score(y_true, y_pred)
    

mf = MetricFrame(metrics={
                    'accuracy': accuracy_score,
                    'Recall': recall_metric,
                    'Selection Rate': selection_rate,
                    'False Positive Rate': false_positive_rate
                },
                 y_true=test_lab,
                 y_pred=test_pred,
                 sensitive_features=test_a)

print("Test set fairness metrics (before mitigation):")
print(mf.by_group)

### Draw the ROC curve

Here we provide the funtion `plot_roc_simple` to draw the ROC curve for each groups.

In [None]:
plot_roc_simple(test_lab, test_prob, test_a, test_pred,
        sensitive_attribute_name = 'sex',
        )

### 💥 Exercise and discusssion:
* What do you see from the metrics and the ROC curve?

* Performance Differences:

There is a noticeable difference in the ROC curves between the two groups (0.0 for Female, 1.0 for Male). The AUC for females (0.74) is lower than the AUC for males (0.80), suggesting that the model performs better at distinguishing between positive and negative cases for males compared to females.

Fairness Concerns:

The difference in AUC indicates that there is an imbalance in model performance, soo there is potential unfairness towards the female group. The true positive rate (TPR) for males appears to be consistently higher across most false positive rates, which means the model is more likely to correctly identify positives for males than for females.

* Accuracy:

The model has an accuracy of 0.769 for the female group, it presents a higher accuracy of 0.849 for the male group. This means that the model is generally more accurate in predicting outcomes for males.

* Recall:

The recall for the mfeale group is 0.619, so the model identifies about 61.9% of all actual positives correctly in the female group.
The recall is 0.556 for the male group meaning only  55.6% of actual positives are correctly identified for males.

This suggests a weaker sensitivity towards positives in the male group despite the higher observed accuracy.

* Selection Rate:

The selection rate for the female group is 0.244, it is lower for the male group at 0.156. This indicates that the model is less likely to predict positive outcomes for individuals in the male group.

* False Positive Rate (FPR):

The FPR for the female group is 0.222. It is lower for the male group at 0.138. This means the model is less likely to incorrectly label negative cases as positive in the male group, aligning with the higher accuracy and AUC noted.

The differences observed through the AUC curve in addition to recall, selection rate, and FPR raise questions about the fairness of the model. 

While the model performs better overall for the male group in terms of accuracy and discriminatory power (AUC)The lower recall and selection rate for the male group indicate that the model has higher standards for classifying positives in this group and might classify only the most obvious cases as positive. Adjusting the decision threshold mighthelp balance the performance across the two group groups, improving fairness.
This would be done by incresing recall and classifying more true positives correctly. It is important to make sure this does not increase the false positive rate too much. 


* Try to measure and desribe the fairness wrt the 3 creterias we learned from class

In [None]:
# Fairness analysis with respect to the three criteria
def fairness_analysis(metrics_frame):
    print("\nFairness Analysis:")
    
    # Demographic Parity: Check if selection rates are similar across groups
    selection_rates = metrics_frame.by_group['Selection Rate']
    print(f"Demographic Parity - Selection Rates by Group:\n{selection_rates}")
    
    # Equalized Odds: Check if recall (TPR) and FPR are similar across groups
    recalls = metrics_frame.by_group['Recall']
    fprs = metrics_frame.by_group['False Positive Rate']
    print(f"\nEqualized Odds - Recall and False Positive Rate by Group:\nRecall:\n{recalls}\nFalse Positive Rate:\n{fprs}")
    
    # Equal Opportunity: Check if recall (TPR) is similar across groups
    print(f"\nEqual Opportunity - Recall by Group:\n{recalls}")

fairness_analysis(mf)



In [None]:
#Demographic Parity, Equalized Odds, and Equal Opportunity.
*Demographic Parity - because it treats the two groups differently in terms of selection rate. In practice, this might mean that females are receiving more positive diagnoses than males, which could lead to unfair treatment depending on the context.

*Equalized Odds is satisfied if both the recall (true positive rate) and false positive rate (FPR) are similar across groups, but:

*Recall is higher for females compared to males, meaning that the model is better at identifying true positives for females. FPR is also higher for females, indicating that the model is more likely to produce false positives for females than for males.

*The differences in both recall and FPR mean that the model fails to satisfy Equalized Odds, as its behavior varies between the two groups. This could mean that the model is less reliable for males, possibly missing more true positive cases and making fewer incorrect positive predictions.

* Equal Opportunity requires that the recall be similar across groups. The recall for females is higher than for males, indicating that the model is better at correctly identifying positive cases for females. This suggests that the model provides unequal opportunities for males, as they are less likely to receive a true positive prediction compared to females.

## 3. Bias mitigation using Fairlearn

### Recall from the class
* what kinds of mitigation methods have we learned?

In this exercise, we will try to use the one pf the post-preprocessing bias mitiagtion method Fairlearn provided: **Threshold Optimization**, to implement the bias mitigation steps. 

### The theory of Threshold Optimization


The idea could be simply visualized as below (figure from the [original paper](https://arxiv.org/pdf/1610.02413)):  
![](./support4notebook/threshold_op.png)

Where blue/green represent two sensitive groups, any points in the overlapping region meet the requirement of equalized odds:

$$
\gamma_0(\hat{Y}) = \gamma_1(\hat{Y}),
$$

where 

$$
\gamma_a (\hat{Y}) = \left(Pr(\hat{Y} = 1 | A = a, Y = 0), Pr(\hat{Y} = 1 | A = a, Y=1)\right).
$$

The goal of the threshold optimizer is to find the point in the overlapping region that optimizes the objective function, such as balanced accuracy.

To achieve this, **randomization** is introduced. The idea is starightforward: any point under the ROC curve can be estimated by weighting two points on the ROC curve (which could be achieved by simply thresholding); or in another word, a new decision threshold $T_a$ can be a randomized mixture of two decision thresholds $\underline{t}_a$ and $\overline{t}_a$.

(See the figure below, which is from [this paper](https://arxiv.org/abs/2202.08536)).

![Randomization Figure](./support4notebook/randomization.png)


📃 Further Reading:
* [Fairlearn *ThresholdOptimizier* page](https://fairlearn.org/v0.5.0/api_reference/fairlearn.postprocessing.html).
* The original paper (See section 3): [Equality of opportunity in supervised learning](https://arxiv.org/pdf/1610.02413).




### 🪄 Trick: A fake classifier class

Fairlearn has some limitations when implementing the `ThresholdOptimizer` method. To work around these issues, a fake classifier is provided to bypass minor problems. If you use the provided classifier, this fake class should work just as well.

However, if you're curious about what went wrong or want to use your own classifier, please read below:

* **Problem 1: `estimator` in `ThresholdOptimizer` only accepts 2D input (tabular data).** This doesn’t make sense for post-processing mitigation methods, as the only relevant aspect here is the prediction scores from the test set. The classifier itself and the input data are irrelevant when optimizing the threshold.
  
* **Problem 2: The `prefit` parameter checks whether the model has been fitted in a simplistic way, leading to errors.** You can read more about this fit check function [here](https://scikit-learn.org/stable/modules/generated/sklearn.utils.validation.check_is_fitted.html).
  
* **Problem 3: It requires the prediction function to return scores for both classes in binary classification.** This might be an issue if your classifier only provides the probability for class 1.

**How we solve this**: We create a fake classifier that accepts 2D input and reshapes it back to the original image size before feeding it into the prediction function. We trick the fit check by defining a fake variable and manually modify the output of the prediction function to include both classes if it only returns the probability for class 1.

Note: If you trained your own classifier, you will need to implement a custom fake classifier yourself.

In [None]:
from sklearn.base import BaseEstimator, ClassifierMixin

class FakeClassifierInput2D(BaseEstimator,ClassifierMixin):
    '''
    Fake classifier that takes 2D input, with pre-trained model that does not take 2D input data
    '''
    def __init__(self,model, img_size):
        self.model = model
        self.img_size = img_size
        self.input_from_2D_func = lambda x: torch.reshape(x,(-1,)+self.img_size)
        # self.input_from_2D_func = input_from_2D_func
        self.fit_ = True # fake the check_fit function inside the fairlearn library

    def fit(self, X, y):
        # Do not need to fit
        return

    def predict(self, X_2D, ):
        #assert len(X_2D.shape) == 2
        #X = self.input_from_2D_func(X_2D)
        return self.model.predict(X)
    
    def predict_proba(self, X_2D):
        X = self.input_from_2D_func(X_2D)
        return self.model.predict_proba(X)
    


### Optimize on validation set

In [None]:
X_test, y_test, a_test = from_loader_to_tensor(test_loader,device)
X_val, y_val, a_val = from_loader_to_tensor(val_loader,device)

X_test_2D = torch.reshape(X_test,(X_test.shape[0],-1))
X_val_2D = torch.reshape(X_val,(X_val.shape[0],-1))

In [None]:
classifier_fake = FakeClassifierInput2D(model=classifier.to('cpu'),
                                       img_size = img_size,)


Your code here:

In [None]:
from fairlearn.postprocessing import ThresholdOptimizer
#fake classifier already fitted



postprocessor = ThresholdOptimizer(
    estimator=classifier_fake,
    constraints="equalized_odds",
    objective = "balanced_accuracy_score",
    #constraints="false_negative_rate_parity",
    prefit=True,
    predict_method="predict_proba"
)

postprocessor.fit(X_val_2D, y_val, sensitive_features=a_val)

In [None]:
import matplotlib.pyplot as plt
from fairlearn.postprocessing import plot_threshold_optimizer

fig, ax = plt.subplots(figsize=(10, 6))
plot_threshold_optimizer(postprocessor, ax=ax)
plt.show()

### Use the new threshold for the test set

Your code here:

In [None]:
y_pred_fair_test = postprocessor.predict(X_test, sensitive_features=a_test)

In [None]:
#test_pred = postprocessor.predict(X_test_2D, sensitive_features=a_test)
y_pred_fair_test = postprocessor.predict(X_test, sensitive_features=a_test)

mf = MetricFrame(metrics={
                    'accuracy': accuracy_score,
                    'Recall': recall_metric,
                    'Selection Rate': selection_rate,
                    'False Positive Rate': false_positive_rate
                },
                 y_true=X_test_2D,
                 y_pred=y_pred_fair_test,
                 sensitive_features=test_a)

print("Test set fairness metrics (before mitigation):")
print(mf.by_group)

In [None]:
plot_roc_simple(test_lab, test_prob, test_a, y_pred_fair_test,
        sensitive_attribute_name = 'sex',
        )

### To find out how the prediction come from (the new threshold $T_a$)

In [None]:
import json
threshold_rules_by_group = to.interpolated_thresholder_.interpolation_dict
print(json.dumps(threshold_rules_by_group, default=str, indent=4))

### 💥 Exercise and Discussion:
* Can you write down the new threshold function? ([Hint](https://fairlearn.org/v0.10/user_guide/mitigation/postprocessing.html#postprocessing))
* Compare the results. What do you observe, and does this model seem fair to you?
* Hint: After optimization, you may notice that accuracy (or other metrics) is more balanced between groups. However, the overall accuracy (or other metrics of interest) may decrease for both groups. Do you think this is still a good or acceptable solution?


![](./support4notebook/dilemma.jpg)

# PART 2: Potential pitfall: Algorithmic fairness in the presence of label noise
As mentioned in class, it is easy to diagnose algorithmic bias. This is, however, only true if we have access to correct target labels for the test set. In this part of the project, we will simulate a situation where our test set ground truth target labels are incorrect in a biased way: You will simulate overdiagnosis among male individuals, by manually distorting some of their labels. Next, you will analyze how this affects the diagnosis and mitigation of algorithmic bias.

## Write a script to distort the labels for the male individuals according to the following recipe:
* Please create a new set of distorted target labels
* Initialize these as identical to the supplied target labels
* Manually distort them by flipping 30% of the healthy labels for male individuals to diseased. These should be selected at random.

## Now repeat your analysis from Part 1 for your classifier from Part 1 using the distorted labels. 
You don’t need to retrain the classifier – you will only repeat the diagnosis and mitigation parts.

* Diagnose algorithmic bias with respect to your distorted labels. Do your conclusions change?
* Mitigate algorithmic bias with respect to your distorted labels. Following this, repeat your diagnostic pipeline both with respect to your distorted and original labels. What do you see? Did mitigation ensure improved fairness with respect to the distorted labels? What happened with respect to the actual (original) labels? Is the mitigated algorithm actually fair?





In [None]:
import random
import numpy as np

# Assuming test_lab and test_a are numpy arrays, and you have two groups (e.g., 'male' and 'female')
test_lab_new = []
groups = np.unique(test_a)  # Unique sensitive groups
test_lab = np.array(test_lab)  # Convert test_lab to a numpy array for easier manipulation

# Split test_lab by group and store in test_lab_new
for group in groups:
    group_idx = np.where(test_a == group)  # Get indices for the group
    true_group = test_lab[group_idx]  # Extract corresponding labels for the group
    test_lab_new.append(true_group)  # Append the extracted group labels to the list

# Assuming group 1 (index 1) is male, modify that group (male_distorted)
male_distorted = test_lab_new[1]  # Assuming male is at index 1 (adjust accordingly)

# Get indices where the values are 0 in the male_distorted array
zero_indices = [i for i, x in enumerate(male_distorted) if x == 0]

# Randomly sample 30 indices to flip
random_indices_to_flip = random.sample(zero_indices, 30)

# Flip values at the selected indices
for idx in random_indices_to_flip:
    male_distorted[idx] = 1

# Rebuild the original test_lab array with the changes in male_distorted
# Now, you replace the unchanged parts back into the main test_lab array

# Initialize an empty list to rebuild the test_lab
distorted_test_lab = np.copy(test_lab)

# Replace the 'male' group with the modified male_distorted array
male_idx = np.where(test_a == groups[1])  # Assuming 'male' is group[1], adjust if needed
distorted_test_lab[male_idx] = male_distorted

print("Modified test_lab with changes to male group:")
print(distorted_test_lab)


In [None]:

# Calculate metrics by group (test_a is the sensitive attribute like gender)
group_metrics = calculate_group_metrics(distorted_test_lab, test_pred, test_a)

# Print metrics for each group
for group, metrics in group_metrics.items():
    print(f"Group: {group}")
    print(f"  Recall: {metrics['Recall']:.2f}")
    print(f"  Selection Rate: {metrics['Selection Rate']:.2f}")
    print(f"  False Positive Rate: {metrics['False Positive Rate']:.2f}")
    print()


In [None]:
# Define a custom function to calculate False Positive Rate (FPR)
def false_positive_rate(y_true, y_pred):
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    fpr = fp / (fp + tn) if (fp + tn) > 0 else 0
    return fpr

# Define a custom function to calculate Selection Rate
def selection_rate(y_true, y_pred):
    return np.mean(y_pred)

# Define a custom function to calculate Recall
def recall_metric(y_true, y_pred):
    return recall_score(y_true, y_pred)
    

mf = MetricFrame(metrics={
                    'accuracy': accuracy_score,
                    'Recall': recall_metric,
                    'Selection Rate': selection_rate,
                    'False Positive Rate': false_positive_rate
                },
                 y_true=distorted_test_lab,
                 y_pred=test_pred,
                 sensitive_features=test_a)

print("Test set fairness metrics (before mitigation):")
print(mf.by_group)

In [None]:
plot_roc_simple(distorted_test_lab, test_prob, test_a, test_pred,
        sensitive_attribute_name = 'sex',
        )

In [None]:
# Fairness analysis with respect to the three criteria
def fairness_analysis(metrics_frame):
    print("\nFairness Analysis:")
    
    # Demographic Parity: Check if selection rates are similar across groups
    selection_rates = metrics_frame.by_group['Selection Rate']
    print(f"Demographic Parity - Selection Rates by Group:\n{selection_rates}")
    
    # Equalized Odds: Check if recall (TPR) and FPR are similar across groups
    recalls = metrics_frame.by_group['Recall']
    fprs = metrics_frame.by_group['False Positive Rate']
    print(f"\nEqualized Odds - Recall and False Positive Rate by Group:\nRecall:\n{recalls}\nFalse Positive Rate:\n{fprs}")
    
    # Equal Opportunity: Check if recall (TPR) is similar across groups
    print(f"\nEqual Opportunity - Recall by Group:\n{recalls}")

fairness_analysis(mf)

