# Tutorial: Measuring Group Fairness in Pre-Trial Risk Assessment
**Instructor:** Hilde Weerts

In this tutorial we will explore how we can measure notions of group fairness in Python using [Fairlearn](https://fairlearn.org). As a running example, we consider pre-trial risk assessment scores produced by the COMPAS recidivism risk assessment tool.

---
**Learning Objectives**. After completing this tutorial, you will be able to:
* apply group fairness metrics in Python;
* explain several trade-offs between different group fairness criteria;
* explain how threats to construct validity may impact downstream fairness-related harms;

---
### COMPAS: A Pre-Trial Risk Assessment Tool
COMPAS is a decision support tool used by courts in the United States to assess the likelihood of a defendant becoming a recidivist; i.e., relapses into criminal behavior. In particular, COMPAS risk scores are used in **pre-trial risk assessment**.

> #### What is pre-trial risk assessment in the US judicial system?
After somebody has been arrested, it will take some time before they go to trial. The primary goal of pre-trial risk assessment is to determine the likelihood that the defendant will re-appear in court at their trial. Based on the assessment, a judge decides whether a defendent will be detained or released while awaiting trial. In case of release, the judge also decides whether bail is set and for which amount. Bail usually takes the form of either a cash payment or a bond. If the defendant can't afford to pay the bail amount in cash - which can be as high as \$50,000 - they can contract a bondsmen. For a fee, typically around 10\% of the bail, the bondsmen will post the defendant's bail.

> If the defendant cannot afford bail nor a bail bond, they have to prepare for their trial while in jail, which is very difficult*. The time between getting arrested and a bail hearing can take days, weeks, months, or even years. In some cases, the decision is between pleading guilty and going home. Consequently, people who cannot afford bail are much more likely to plead guilty to a crime they did not commit. If the judge's decision is a **false positive**, this has a big impact on the defendant's prospects. On the other extreme, **false negatives** could mean that dangerous individuals are released into society.

> \**I highly recommend checking out this [tutorial](https://facctconference.org/2018/livestream_vh210.html) to get a better understanding of the implications of pre-trial risk assessment.*

Proponents of risk assessment tools argue that they can lead to more efficient, less biased, and more consistent decisions compared to human decision makers. However, concerns have been raised that the scores can replicate historical inequalities. Moreover, critics have argued that even if it is possible to produce a "fair" risk assessment, the mere existence of money bail may still disproprotionally affect those wo cannot afford to pay bail. 


#### Propublica's Analysis of COMPAS
In May 2016, investigative journalists of Propublica released a critical analysis of COMPAS. **Propublica's assessment: COMPAS wrongly labeled black defendants as future criminals at almost twice the rate as white defendants**, while white defendants were mislabeled as low risk more often than black defendants ([Propublica, 2016](https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing)). 

The analysis of COMPAS is likely one of the most well-known examples of algorithmic fairness assessments. Within the machine learning research community, the publication sparked a renewed interest in fairness of machine learning models. 

## Group Fairness
So how fair are COMPAS scores? There are several commonly used metrics that can be used to measure (un)fairness. In this tutorial, we will focus on a class of metrics that measure **group fairness**.

> **Group fairness** is the extent to which particular groups of individuals are at risk of fairness-related harms.

We usually refer to these groups as *sensitive groups*. Sensitive groups could be defined based on legally protected sensitive characteristics, such as race and gender, but may also be specific to the context of your machine learning model. For example, when we analyze the fairness of an automated essay grading tool, we may want to consider whether the model is fair for non-native speakers.

In the remainder of this tutorial, we will explain and apply three group fairness metrics: *demographic parity*, *equalized odds*, and *equal calibration*.

In [None]:
# data wrangling
import pandas as pd
import numpy as np

# visualization
import matplotlib.pyplot as plt

# measuring fairness
from fairlearn.metrics import (
    count,
    demographic_parity_difference,
    equalized_odds_difference,
    false_positive_rate,
    false_negative_rate,
    MetricFrame,
    make_derived_metric,
    selection_rate,  
)
from sklearn.metrics import precision_score
from sklearn.calibration import calibration_curve

## Load Dataset
We will use the [data](https://github.com/propublica/compas-analysis/blob/master/compas-scores-two-years.csv) that was collected by ProPublica through public records requests in Broward County, Florida. We use pre-processing steps similar to the ones Propublica used in their analysis.

In [None]:
# load data
data = pd.read_csv("compas-scores-two-years.csv")
# filter similar to propublica
data = data[
    (data["days_b_screening_arrest"] <= 30)
    & (data["days_b_screening_arrest"] >= -30)
    & (data["is_recid"] != -1)
    & (data["c_charge_degree"] != "O")
    & (data["score_text"] != "N/A")
]
# select columns
data = data[["sex", "age", "race", "priors_count", "decile_score", "two_year_recid"]]
# cut-off score 5
data["decile_score_cutoff"] = data["decile_score"] >= 5
# inspect
data.head()

In [None]:
data.race.unique()

The data now contains the following features:

* *sex*. The defendant's sex, measured as US census sex categories (either *Male* or *Female*).
* *race*. The defendant's race, measured as an adapted version of US census race categories.
* *age*. The defendant's age on the COMPAS screening date. 
* *decile_score*. The COMPAS score expressed in deciles of the raw risk score. The deciles are obtained by ranking scale scores of a normative group and dividing these scores into ten equal-sized groups. Normative groupss are gender-specific. For example, females are scored against a female normative group. According to [the COMPAS documentation](http://www.northpointeinc.com/files/technical_documents/FieldGuide2_081412.pdf), a decile score of 1-4 is low, 5-7 medium, and 8-10 high.
* *priors_count*. The number of prior charges up to but not including the current offense.
* *two_year_recid*. Recidivism, which is set to `True` if a *registered* charge that occurred within two years of the COMPAS screening date and `False` otherwise.
* *decile_score_cutoff*. The binarized COMPAS score based on a cut-off score of 5.

--- 

Let's have a look at the distribution of the demographics.

In [None]:
display(data["sex"].value_counts())
display(data["race"].value_counts())

Clearly, *male* individuals are overrepresented in the dataset compared to *female* individuals, which is consistent with known differences in arrest rates. Additionally, the majority of invididuals are categorized as *African-American* or *Caucasian*. Similar to Propublica, we will limit our analysis to these two largest groups.

In [None]:
# select two largest groups
data = data[(data["race"] == "African-American") | (data["race"] == "Caucasian")]

## Demographic Parity
In a classification scenario, the **selection rate** is the proportion of positive predictions. If selection rates differ across groups, there is a risk of **allocation harm**.

> **Allocation Harm**: the system disproportionally extends or witholds opportunities, resources or information to some groups.

For example, in a hiring scenario, the selection rate of applicants who identify as men may be higher compared to other applicants. The risk of allocation harm is particularly prevalent in cases where historical discrimination has resulted in disparities in the observed data, which are subsequently replicated by the machine learning model.

> **Demographic Parity** holds if, for all values of y and a, $$P(\hat{Y} = y | A = a) = P(\hat{Y} = y | A = a')$$ where $\hat{Y}$ is the output of our model and $A$ the set of sensitive characteristics.

In other words, the output of the model should be **independent** of sensitive group membership. We can quantify the extent to which demographic parity is violated through a fairness metric.

#### When should we use demographic parity as a fairness metric? 
The [underlying assumption about fairness of demographic parity](https://arxiv.org/abs/1609.07236) is  that, **regardless of what the measured target variable says**, either:
1. *Everybody **is** equal*. For example, we may believe that traits relevant for a job are independent of somebody's gender. However, due to social biases in historical hiring decisions, this may not be represented as such in the data.
2. *Everybody **should be** equal*. For example, we may believe that different genders are not equally suitable for the job, but this is due to factors outside of the individual's control, such as lacking opportunities due to social gender norms.

Enforcing demographic parity might lead to differences in treatment across sensitive groups, causing otherwise similar people to be treated differently. For example, two people with the exact same features, apart from race, would get a different score prediction. This can be seen a form of *procedural harm*. Consequently, demographic parity is only a suitable metric if one of the two underlying assumptions (everybody *is* or *should be* equal) holds. A limitation of demographic parity is that it does not put any constraints on the scores. For example, to fulfill demographic parity, you do not have to select the most risky people from different racial groups as long as you pick the same proportion for each group.


--- 

### Measuring Demographic Parity using Fairlearn

We can use Fairlearn's `MetricFrame` class to investigate the selection rate across groups. 

The class has the following parameters: 
* `metrics`: a callable metric (e.g., `selection_rate`, or `false_positive_rate`) or a dictionary of callables
* `y_true` : the ground-truth labels
* `y_pred` : the predicted labels
* `sensitive_features`: the sensitive features. Note that there can be multiple sensitive features.
* `control_features`: the control features. Control features are features for which you'd like to investigate disparaties separately (i.e., "control for"). For example, because you expect the feature can explain some of the observed disparities between sensitive groups.

At initialization, the `MetricFrame` object computes the input metric(s) for each group defined by sensitive features.
* `MetricFrame.bygroup`: a pandas dataframe with the metric value for each group
* `MetricFrame.overall`: a float (or dataframe, if `control_features` are used) with the metric value as computed over the entire dataset

We can also summarize the results of the `MetricFrame` using one of the following methods:
* `MetricFrame.difference()` : return the maximum absolute difference between groups for each metric
* `MetricFrame.ratio()` : return the minimum ratio between groups for each metric.
* `MetricFrame.group_max()` : return the maximum value of the metric over the sensitive features.
* `MetricFrame.group_min()` : return the minimum value of the metric over the sensitive features.

The `MetricFrame` object is useful to do a thorough investigation of disparities. When we have (already) identified a definition of fairness that is relevant in our scenario, we may want to optimize for it during model selection. For this, it can be useful to have a single value that summarizes the disparity in a fairness metric.

We can directly summarize the extent to which demographic parity is violated using `demographic_parity_difference()` metric. This metric can also be used in, for example, a grid search. All fairness metrics in Fairlearn have the following arguments:
* `y_true`
* `y_pred`
* `sensitive_features`
* `method`: the method that is used to summarize the difference or ratio across groups. 
    * `'between_groups'`: aggregate the difference as the max difference between any two groups
    * `'to_overall'`: aggregate the difference as the max difference between any group and the metric value as computed over the entire dataset.
    
There are several predefined metrics, such as `fairlearn.metrics.demographic_parity_difference()` and `fairlearn.metrics.equalized_odds_ratio()`. It is also possible to define your own fairness metric, based on e.g., a scikit-learn performance metric, using `fairlearn.metrics.make_derived_metric()`.

---
In the pre-trial risk assessment scenario, unequal selection rates mean that we predict, on average, recidivism more often for one group than the other. Let's investigate the selection rate of COMPAS.

### *Exercise*: use `MetricFrame` to compute difference in `selection_rate`

In [None]:
# TODO: compute and display selection_rate (i.e., proportion of positives in 'decile_score_cutoff') for each category
mf = MetricFrame(
    metrics={'selection rate' : }, # dictionary of metrics
    y_true=, # 'ground truth'
    y_pred=, # model predictions
    sensitive_features= , # column with sensitive feature
)

# TODO: display results
display() # by group
print("Overall selection rate: %.2f" % ) #overall

In [None]:
# compute demographic parity as the max difference between groups
print("demographic parity difference: %.2f" % mf) # compute difference 'between_groups'

### *Exercise*: use `demographic_parity_difference` to compute difference in `selection_rate`

In [None]:
# TODO: summarize demographic parity using the metric (this should give the exact same result as mf.difference())
dpd = demographic_parity_difference(
    y_true=,
    y_pred=,
    sensitive_features=,
    method=,
)  # summarize as the max difference between any of the groups
print("demographic parity difference: %.2f" % dpd)

Clearly, **COMPAS' selection rate is higher for African-Americans**.

At this point we may wonder whether this disparity is introduced by COMPAS, or whether can we see a similar pattern in the original data. The selection rate observed in the data is also referred to as **base rate**.

### *Exercise*: use `MetricFrame` to compute difference in *base rate* (i.e., `selection_rate` in ground-truth)

In [None]:
# TODO: compute and display selection_rate of ground truth ('two_year_recid') for each category
mf = MetricFrame(
    metrics={"base rate": selection_rate},
    y_true=, 
    y_pred=,
    sensitive_features=,
)
display(mf.by_group)

In [None]:
# summarize demographic parity as the max difference between groups
print("base rate diff: %.2f" % mf.difference(method="between_groups"))

Although the difference is substantially smaller compared to COMPAS' selection rates, the base rates do differ across groups. There are several possible explanations of why these disparities arise in the data, for example:

* **The observed recidivism rates may not represent the actual recidivism rates.** Our target variable considers *re-arrests*, which is only a subset of the true cases of recidivism. It could be the case that the observed disparities reflect racist policing practices, rather than the true crime rate.
* **Social deprivations may have caused the true underlying recidivism rate to be different across groups.** In other words, African-American defendants may truely be more likely to fall back into criminal behavior, due to personal circumstances.

**Note that we cannot know which explanation holds from the data alone!** For this, we need a deeper understanding of the social context and data collection practices.

> #### Intermezzo: Construct Validity of Target Variables
A useful concept for thinking more deeply about the suitability of data for a particular purpose **construct validity**. Construct validity is a concept from the social sciences that refers to *the extent to which a measurement actually measures the phenomenon we are trying to measure*. 

> Construct validity is important to consider when you define the target variable of your machine learning model. In the context of fairness, [a lack of construct validity in the target variable can be a source of downstream model unfairness](https://arxiv.org/abs/1912.05511). For example:
> * [Healthcare costs can be a biased measurement of healthcare needs](https://science.sciencemag.org/content/366/6464/447.abstract), as costs may reflect patients' economic circumstances rather than their health.
> * Historical hiring decisions are not necessarily equivalent to historical employee quality, due to systemic and/or (unconscious) social biases in the hiring process.
> * Observed fraud is only a subsample of actual fraud. If potential cases of fraud are not selected randomly, there is a risk of selection bias. If the selection biass is associated with sensitive group membership, some groups may be overscrutinized causing the observed fraud rate to be inflated.

> Each of these cases of 'bias' is much easier to spot when we look at our data through the lens of construct validity. For more information, check out [this section](https://fairlearn.org/v0.7.0/user_guide/fairness_in_machine_learning.html#construct-validity) in Fairlearn's user guide.

## Equalized Odds

If error rates differ across groups, there is a risk of **quality-of-service harm**.

> **Quality-of-service Harm**: the algorithm makes more mistakes for some groups than for others. 

For example, in a hiring scenario, we may mistakingly reject strong female candidates more often than strong male candidates. The risk of quality-of-service harm is particularly prevalent if the relationship between the features and target variable is different across groups. The risk is further amplified if less data is available for some groups.  For example, strong candidates for a data science position may have either a quantitative social science background or a computer science background. Now imagine that in the past, hiring managers have mostly hired people with a computer science degree but hardly any social scientists. As a result, a machine learning model could mistakingly penalize people who do not have a computer science degree. If particular groups are overrepresented in the candidate pool of social scientists, the error rates may be be higher for those groups, resulting in a quality-of-service harm.

One way to measure quality-of-service harm is through the [equalized odds](https://arxiv.org/abs/1610.02413) constraint.

> **Equalized Odds** holds if, for all values of y and a, $$P(\hat{Y} = y | A = a, Y = y) = P(\hat{Y} = y | A = a', Y = y)$$ where $\hat{Y}$ is the output of our model, $Y$ the observed outcome, and $A$ the set of sensitive characteristics.

In other words, the **false positive rate** and **true positive rate** (or, equivalently, **false negative rate**) should be equal across groups. 

#### When should we use equalized odds as a fairness metric?
Equalized odds quantifies the understanding of fairness that we should not make more mistakes for some groups than for other groups. Similar to demographic parity, the equalized odds criterion acknowledges that the relationship between the features and the target may differ across groups and that this should be accounted for. However, as opposed to the *everybody is or should be equal* assumptions of demographic parity, **equalized odds implicitly assumes that the target variable is a good representation of what we are actually interested in**.

--- 
As we have seen in the introduction, a false positive prediction in pre-trial risk assessment can have large consequences for the involved defendant. It may even result in the defendant pleading guilty to a crime they did not commit. Let's compute the false positive rates and false negative rates.

### *Exercise*: use `MetricFrame` and `equalized_odds_difference` to compute difference in *false positive rate* and *false negative rate*

In [None]:
# TODO: compute and display false_positive_rate and false_negative_rate
mf = MetricFrame(
    metrics={
        
    },
    y_true=data["two_year_recid"],
    y_pred=data["decile_score_cutoff"],
    sensitive_features=data["race"],
)
display(mf.by_group)

In [None]:
# print differences
for i in mf.difference("between_groups").iteritems():
    print("%s diff: %.2f" % i)

In [None]:
# Alternatively: summarize equalized odds in one metric using equalized_odds_difference (which is the max of fpr diff and fnr diff)
dpd = equalized_odds_difference(
    y_true=data["two_year_recid"],
    y_pred=data["decile_score_cutoff"],
    sensitive_features=data["race"],
    method="between_groups",
)
print("equalized odds diff: %.2f" % dpd)

Similar to Propublica's assessment, we find that **the false positive rate is almost twice as high for African Americans compared to Caucasians**. In other words, African Americans are more often falsely predicted to be re-arrested. At the same time, the false negative rate is much higher for Caucasians, indicating that Caucasians are more often released even though they will re-offend. 

> #### Intermezzo: The Problem of Small Sample Sizes
As we have seen, group fairness metrics heavily rely on the estimation of group statistics such as the selection rate or the false positive rate. In many cases, the number of individuals in the data that belong to a particular subgroup can be very small. For example, the number of Asian and Native Americans in the COMPAS data set is extremely small, comprising of only 31 and 11 instances respectively. **With small sample sizes, statistical estimates can become very uncertain.** In those cases, it is impossible to even accurately *assess* the risk of fairness-related harms - let alone mitigate them. The problem of small sample sizes is further exacerbated when we consider intersectional subgroups (e.g., Black women). This is problematic, as harms often accumulate at the intersection of marginalized groups. 

### *Exercise*: use `MetricFrame` with `count` to compute the number of instances in each group

In [None]:
# TODO: compute and display counts for each group
mf = MetricFrame(
    metrics={
        "count": ,
    },
    y_true=data["two_year_recid"],
    y_pred=data["decile_score_cutoff"],
    sensitive_features=data[["race"]],
)
display(mf.by_group)

## Equal Calibration

Northpointe, the developers of COMPAS, responded to Propublica's analysis that COMPAS scores are fair because the scores are **equally calibrated** across racial groups. In other words, for each possible risk score, the probability that you belong to a particular class is the same, regardless of the group to which you belong.

> **Equal Calibration** holds if, for all values of y, a, and r $$P(Y = y | A = a, \hat{Y} = y) = P(Y = y | A = a', \hat{Y} = y)$$ where $\hat{Y}$ is the output of our model, $Y$ the observed outcome, and $A$ the set of sensitive characteristics.

For example, given that an instance is predicted to belong to the negative class, the probability of actually belonging to the negative class is independent of sensitive group membership.  In the binary classification scenario, equal calibration implies that the **positive predictive value** (which is equivalent to *precision*) and **negative predictive value** are equal across groups.


#### When should we use equal calibration as a fairness metric?

Equal calibration quantifies an understanding of fairness that a score should have the same *meaning*, regardless of sensitive group membership. Similar to equalized odds, the underlying assumption is that the target variable is a reasonable representation of what reality looks or should look like. However, as opposed to equalized odds, equal calibration does not acknowledge that the relationship between features and target variable may be different across groups.

As opposed to demographic parity and equalized odds, requiring equal calibration usually does not require an active intervention. That is, we usually get equal calibration "for free" when we use machine learning approaches. As such, learning without explicit fairness constraints often [implicitly optimizes for equal calibration](https://arxiv.org/abs/1808.10013).

--- 

Let's verify Northpointe's claim regarding the calibration of COMPAS scores. The positive predictive value is equivalent to precision, so we can simply use `sklearn.metrics.precision_score`. To compute the negative predictive value, we can define a new function.

In [None]:
# first, we define a function to compute the negative predictie value
def negative_predictive_value_score(y_true, y_pred, **kwargs):
    """
    NPV is not in scikit-learn, but is the same as PPV but with 0 and 1 swapped.
    """
    return precision_score(y_true, y_pred, pos_label=0, **kwargs)

### *Exercise*: use `MetricFrame` to compute difference in *positive predictive value* and *negative predictive value*

In [None]:
# TODO: compute and display precision_score and negative_predictive_value_score for each group
mf = MetricFrame(
    metrics={
        "ppv": ,
        "npv": ,
    },
    y_true=data["two_year_recid"],
    y_pred=data["decile_score_cutoff"],
    sensitive_features=data["race"],
)
display(mf.by_group)

In [None]:
# summarize differences
for i in mf.difference("between_groups").iteritems():
    print("%s diff: %.2f" % i)

> #### Intermezzo: Construct Validity of Sensitive Characteristics

> Many sensitive characteristics, such as race and gender, are *social constructs*, which are multidimensional and dynamic. There are many different ways to measure *'race'* as a feature in your data set. Similar to construct validity of the target variable, some measurements may be more appropriate in a particular context than others. For example, dimensions of race include self-reported racial identity, observed race based on appearance, or observed race based on interactions. [How you measure sensitive group membership changes the conclusions you can draw](https://arxiv.org/abs/1912.03593). 

> The racial categories in the COMPAS dataset are based on those that are used by Broward County Sheriff’s Office. It is unclear whether the measurements are the result of self-identification, observed by police officers, or something else. Moreover, it is unclear whether it was possible to enter multi-categorical race, even though some people may identify with multiple races.

### Customized Fairness Metrics using `make_derived_metric`
We can also define a custom fairness metric for NPV using `fairlearn.metrics.make_derived_metric`. This function takes the following parameters:
* `metric`: a callable metric.
* `transform` : a string indicating the type of transformation, one of `['difference', 'group_min', 'group_max', 'ratio'`
* `sample_param_names` : a list of parameters names of the underlying metric which should be treated as sample parameters. This defaults to a list with a single entry of `sample_weight` (as used by many scikit-learn metrics). If `None` or an empty list is supplied, then no parameters will be treated as sample parameters.

The function returns a function with the same signature as the supplied metric, but with additional `sensitive_features=` and `method=` arguments. Under the hood, this function uses `MetricFrame` to compute the metric disaggregated per group.

### *Exercise*: use `make_derived_metric` to create a custom fairness metric for NPV and use it to compute the difference

In [None]:
# TODO: make custem metric for npv
npv_score_diff = make_derived_metric(
    metric=, 
    transform= # compute difference
)

In [None]:
# TODO: use the new metric to compute the difference between groups
npvd = npv_score_diff(
    data["two_year_recid"],
    data["decile_score_cutoff"],
    sensitive_features=data["race"],
    method=,
)
print("npv diff: %.2f" % npvd)

We can further investigate the calibration of the original COMPAS scores (i.e., before we binarized them using a cut-off value of 5) in more detail by plotting a **calibration curve** for each racial group.

### Plot group-specific calibration curves

In [None]:
plt.plot([0, 1], [0, 1], "k:", label="Perfectly calibrated")
for race in ["Caucasian", "African-American"]:
    prob_true, prob_pred = calibration_curve(
        y_true=data[data["race"] == race]["two_year_recid"],
        y_prob=data[data["race"] == race]["decile_score"],
        n_bins=10,
        normalize=True,
    )
    plt.plot(prob_pred, prob_true, label=race)
plt.title("Calibration Curves COMPAS scores")
plt.xlabel("Mean Predicted Value")
plt.ylabel("Proportion of Positives")
plt.legend()
plt.show()

Indeed, we see that the calibration curves are similar for both groups, indicating that COMPAS scores are equally calibrated for African-Americans and Caucasians.

## Impossibilities 

In this tutorial, we have seen that some understanding of fairness (equal calibration) holds for COMPAS scores, whereas others (equalized odds and demographic parity) do not. These findings are not specific to the COMPAS case. 

It has been proven mathematically that in cases where *sensitive group membership is **not** independent of the target variable* and the classifier's output is well calibrated, [it is impossible for these three fairness criteria to hold at the same time](https://arxiv.org/pdf/1609.05807.pdf).

* *Demographic Parity and Equal Calibration*. If group membership is related to the target variable, one group has a higher base rate (i.e., proportion of positives) than the other. If we want to enforce demographic parity in this scenario, we need to select more positives in the disadvantaged group than suggested by the observed target outcome. Consequently, the positive predictive value of our classifier will be different for each group, because the proportion of true positives from all instances we predicted to be positive will be lower in the disadvantaged group.

* *Demographic parity and Equalized Odds*. As before, the only way to satisfy demographic parity with unequal base rates is to classify some instances of the disadvantaged group as positives, even if they should be negatives according to the observed target variable. Hence, provided that the scores are well-calibrated, we cannot satisfy both demographic parity and equalized odds at the same time. In a binary scenario, calibration corresponds to using the same cut-off score for each group.

* *Equal Calibration and Equalized Odds*. When a classifier is imperfect, it is impossible to satisfy both equal calibration and equalized odds at the same time. An intuitive explanation of this impossibility is to recall that equal calibration requires equal *positive predictive value* across groups (a.k.a., precision), whereas equalized odds requires equal false negative rate, which corresponds to equal true positive rate (a.k.a. recall). If we adjust our classifier such that the precision is equal across groups, this will decrease the recall, and vice versa.

It is important to realize that **the impossibilities are not so much a mathematical dispute, but a dispute of the underlying theoretical understanding of what we consider fair**. Which notion of fairness is relevant depends on your assumptions about the context and your underlying moral values. In practice, I encourage you to make your assumptions explicit when discussing fairness with other stakeholders.


---
## Challenges and Limitations of Group Fairness Metrics

#### Practical Challenges
* **Identifying sensitive groups**: identifying which groups are at risk for fairness-related harms (and how to measure group membership!) is non-trivial and requires a deep understanding of the sociotechnical context.
* **Access to sensitive features**: due to privacy regulations or practical availability, sensitive features may not be available.
* **Imprecize estimations**: small sample sizes and the problem of multiple comparisons can lead to imprecise estimations of group statistics.

#### Limitations
* **Ignore within-group differences**: group statistics may disguise differences within groups.
* **Merely observational**: group fairness metrics are observational; they do not consider *how* the prediction was achieved (e.g., using which features).
* **Disregard individual experience**: group metrics disregard individual experiences. In practice, some outcomes may not be universally beneficial or harmful. For example, selection rate may not be an adequate measurement of benefit if getting selected is not equally beneficial for each individual.
* **Narrow scope**: group metrics only consider direct outcomes of the model rather than outcomes of the system. For example, the final outcome of the COMPAS model is determined by how judges interpret the provided risk scores.

--- 
## Concluding Remarks
The main take aways of this tutorial are:
* Different **fairness metrics** represent different **theoretical understandings** of fairness, which is reflected in their incompatibility.
* **Construct validity** is central to assessing fairness and should be considered when you define a target variable, measure sensitive group membership, and choose a fairness metric.

### Discussion Points
* What notion of fairness is most appropriate in the pre-trial risk assessment scenario, in your opinion? Why? If you feel like you don't have enough information to answer this question, which information would you need to make an informed judgment?
* A way to account for unequal selection rates is to use a different cut-off score for each group. Note that this policy has the consequence that two defendants with the same risk score but a different race may be classified differently. Under what conditions would you consider such a policy fair, if any?
* How equal is equal enough? How much overall performance would you sacrifice for optimizing for a fairness metric, if any?

### Further Resources on Algorithmic Fairness
In addition to the clickable links in the tutorial, the following resources may be helpful:
* [*Fairness and Machine Learning - Limitations and Opportunities*](https://fairmlbook.org) by Solon Barocas, Moritz Hardt, and Arvind Narayanan. 
* [*An Introduction to Algorithmic Fairness*](https://arxiv.org/abs/2105.05595) by yours truly.