# Measuring Fairness

The first step to addressing algorithmic bias is to be able to measure the degree of fairness (or lack thereof) in a given dataset. This chapter covers a number of commonly used fairness measures. For concreteness, we will demonstrate these fairness measures by calculating them on the [Compas Recidivism Dataset](https://www.propublica.org/datastore/dataset/compas-recidivism-risk-score-data-and-analysis).

```{admonition} Data: Compas Recidivism Dataset
:class: note
- Location: "data/recidivism/compas-scores-two-years.csv"
- Shape: (7214, 53)
- Source: ProPublica
```

The following script imports the dataset and prints the first few rows.

In [5]:
# import compas two-year dataset
import pandas as pd
compas = pd.read_csv('../data/recidivism/compas-scores-two-years.csv')
compas.head()

Unnamed: 0,id,name,first,last,compas_screening_date,sex,dob,age,age_cat,race,...,v_decile_score,v_score_text,v_screening_date,in_custody,out_custody,priors_count.1,start,end,event,two_year_recid
0,1,miguel hernandez,miguel,hernandez,2013-08-14,Male,1947-04-18,69,Greater than 45,Other,...,1,Low,2013-08-14,2014-07-07,2014-07-14,0,0,327,0,0
1,3,kevon dixon,kevon,dixon,2013-01-27,Male,1982-01-22,34,25 - 45,African-American,...,1,Low,2013-01-27,2013-01-26,2013-02-05,0,9,159,1,1
2,4,ed philo,ed,philo,2013-04-14,Male,1991-05-14,24,Less than 25,African-American,...,3,Low,2013-04-14,2013-06-16,2013-06-16,4,0,63,0,1
3,5,marcu brown,marcu,brown,2013-01-13,Male,1993-01-21,23,Less than 25,African-American,...,6,Medium,2013-01-13,,,1,0,1174,0,0
4,6,bouthy pierrelouis,bouthy,pierrelouis,2013-03-26,Male,1973-01-22,43,25 - 45,Other,...,1,Low,2013-03-26,,,2,0,1102,0,0


In [7]:
# following the ProPublica analysis, we remove several rows with missing data
# see https://github.com/propublica/compas-analysis/blob/master/Compas%20Analysis.ipynb for more details
compas = compas[['age', 'c_charge_degree', 'race', 'age_cat', 'score_text', 'sex', 'priors_count', 'days_b_screening_arrest', 'decile_score', 'is_recid', 'two_year_recid', 'c_jail_in', 'c_jail_out']]
compas = compas[(compas['days_b_screening_arrest'] <= 30) & 
                (compas['days_b_screening_arrest'] >= -30) &  
                (compas['is_recid'] != -1) &
                (compas['c_charge_degree'] != 'O') & 
                (compas['score_text'] != 'N/A')]
compas.shape

(6172, 13)

In [16]:
# finally, we focus on "race" as the protected attribute and African American vs. Caucasian as the two groups
compas = compas[compas['race'].isin(['African-American', 'Caucasian'])]
compas.shape

(5278, 13)

## Fairness Measures for Binary Classifier

To begin with, we consider measuring fairness of a classification model. In the ```Compas``` dataset, the ```score_text``` columns contains model-predicted risk level (low, medium, high) and the ```two_year_recid``` column contains the actual two-year recidivism label (1, 0). Because the outcome label is binary, we focus on the "low" and "high" predicted classes here.

In [17]:
compas_binary = compas.copy()
# remove "medium" in score_text
compas_binary = compas_binary[compas_binary['score_text'] != 'Medium']

### Statistical Parity

**Statistical parity** is one of the most straightforward fairness measures. It simply asserts that one group should not receive systematically more favorable predicted outcomes than the other group. Despite its simplicity, it has been extensively used / discussed in the prior literature, such as {cite:t}`calders2010three,calders2009building,kamiran2009classifying,kamishima2011fairness`, among many others.

In the context of recidivism prediction, with the protected attribute being race and the two groups to be compared being African American vs. White, statistical parity can be defined as:

```{admonition} Definition: Statistical Parity for Binary Classifier
:class: tip
$$
\Pr(prediction = high | race = AfricanAmerican) = \Pr(prediction = high | race = White)
$$
```

In [18]:
# evaluate statistical parity
# percentage of "high" scores among African Americans
AA_pct = len(compas_binary[(compas_binary['score_text'] == 'High') & (compas_binary['race'] == 'African-American')]) / len(compas_binary[compas_binary['race'] == 'African-American'])
# percentage of "high" scores among Whites
W_pct = len(compas_binary[(compas_binary['score_text'] == 'High') & (compas_binary['race'] == 'Caucasian')]) / len(compas_binary[compas_binary['race'] == 'Caucasian'])
print("Percentage of 'high' predicted scores among African Americans: ", AA_pct)
print("Percentage of 'high' predicted scores among Whites: ", W_pct)

Percentage of 'high' predicted scores among African Americans:  0.38566864445458693
Percentage of 'high' predicted scores among Whites:  0.13680981595092023


As we can see, around 38.6\% of African American defendents received high-risk predictions whereas only 13.7\% of White defendents received high-risk predictions. This is a violation of statistical parity.

### Conditional Statistical Parity

Statistical parity is a rather crude measure and, specifically, does not take into account any systematic differences across the two groups that may explain some of the disparity in predicted outcomes. **Conditional Statistical Parity** seeks to amend this issue by conditioning on other observable characteristics. It has also been discussed in prior literature, including for example, {cite:t}`vzliobaite2011handling,kamiran2013quantifying`.

```{admonition} Definition: Conditional Statistical Parity for Binary Classifier
:class: tip
Let $\boldsymbol{X}$ denote a vector of observable characteristics.

$$
\Pr(prediction = high | race = AfricanAmerican, \boldsymbol{X}) = \Pr(prediction = high | race = White, \boldsymbol{X})
$$
```

In [25]:
# evaluate conditional statistical parity
# following ProPublica's analysis, X may contain: age, gender, number of prior offenses, and severity of charge
# they can be used as control variables in a logistic regression of predicted risk score on race
import statsmodels.formula.api as smf
# convert score_text to 1 and 0
compas_binary['Y'] = compas_binary['score_text'].apply(lambda x: 1 if x == 'High' else 0)
model = smf.logit(formula = "Y ~ race + age + sex + priors_count + c_charge_degree", data = compas_binary).fit()
model.summary()


Optimization terminated successfully.
         Current function value: 0.360191
         Iterations 8


0,1,2,3
Dep. Variable:,Y,No. Observations:,3821.0
Model:,Logit,Df Residuals:,3815.0
Method:,MLE,Df Model:,5.0
Date:,"Tue, 07 May 2024",Pseudo R-squ.:,0.3921
Time:,15:22:57,Log-Likelihood:,-1376.3
converged:,True,LL-Null:,-2263.9
Covariance Type:,nonrobust,LLR p-value:,0.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,2.5816,0.224,11.531,0.000,2.143,3.020
race[T.Caucasian],-0.6532,0.105,-6.229,0.000,-0.859,-0.448
sex[T.Male],-0.0371,0.128,-0.290,0.772,-0.288,0.214
c_charge_degree[T.M],-0.4953,0.106,-4.682,0.000,-0.703,-0.288
age,-0.1396,0.007,-19.506,0.000,-0.154,-0.126
priors_count,0.3782,0.016,23.279,0.000,0.346,0.410


We can see that even after controlling for age, gender, prior offenses and the severity of charge, being African American still significantly increased the odds of receiving high-risk predictions. This is a violation of conditional statistical parity.

### Error Balance / Equalized Odds