<h1 align='center'>Ethics in Artificial Intelligence</h1>

<h3 align='center'>Laura G. Funderburk</h3>

<h3 align='center'>Data Scientist, Cybera</h3>



<h2 align='center'>Workshop features Fairlearn</h2>

Literature and code in this notebook was inspired by Selbst et al. "Fairness and Abstraction in Sociotechnical Systems", Fairlearn's Python package documentation,  as well as Fairlearn's 2021 SciPy tutorial: 

SciPy 2021 Tutorial: Fairness in AI systems: From social context to practice using Fairlearn by Manojit Nandi, Miroslav Dudík, Triveni Gandhi, Lisa Ibañez, Adrin Jalali, Michael Madaio, Hanna Wallach, Hilde Weerts is licensed under
[CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)._


<h2 align='center'>About Fairlearn</h2>

Fairlearn is an open-source, community-driven project to help data scientists improve fairness of AI systems. It includes:

* A Python library for fairness assessment and improvement (fairness metrics, mitigation algorithms, plotting, etc.)

* Educational resources covering organizational and technical processes for unfairness mitigation (user guide, case studies, Jupyter notebooks, etc.)

The project was started in 2018 at Microsoft Research. In 2021 it adopted neutral governance structure and since then it is completely community-driven.

Read more: https://fairlearn.org

<h2 align='center'>Why Ethics in AI matter</h2>


 AI systems can behave unfairly for a variety of reasons: 

1. Societal biases are reflected in the training data. 

2. Societal biases are reflected and in the decisions made during the development and deployment of these systems. 

3. AI systems behave unfairly because of characteristics of the data or characteristics of the systems themselves. 

$$\Rightarrow \text{They are not mutually exclusive and often exacerbate one another} \Leftarrow$$ 



<h2 align='center'>Motivating example</h2>

Consider the use of "risk scores" to decide the outcomes individuals
will experience. 

- Risk of recidivism in crime.
- Risk of failure within a new role during recruitment.
- Risk of failure in a patient's chosen treatment.

$\Rightarrow$ How successful is a model to correctly asses risk in an individual? i.e. does the score accurately depict reality? 

$\Rightarrow$  What is the impact of a wrongful assignment (is an innocent individual incarcerated vs is a non qualified individual hired).

<h2 align='center'>How can we determine whether an AI is behaving unfairly?</h2>


1. Through identifying underlying misconceptions within AI - (causal: societal-based contenxt/bias, or in terms of intent, such as prejudice)
2. Through study of impact of AI on people - (outcome: harms vs gains)

**For this workshop, we define whether an AI system is behaving unfairly in terms of its impact on people.**

<h2 align='center'>A note on the word bias</h2>

For this presentation I will refer to fairness in terms of **harms** rather than specific **causes** (such as societal-based context), I avoid the usage of the words *bias* and focus more on the relevant social context that might lead to a harm.


<h2 align='center'>Types of harms</h2>

From keynote by K. Crawford at NeurIPS 2017 <https://www.youtube.com/watch?v=fMym_BKWQzk>

* *Allocation harms* can occur when AI systems extend or withhold
  opportunities, resources, or information. 
  
  **Sample key applications: hiring, school admissions, and lending.**
  

<h2 align='center'>Types of harms</h2>

From keynote by K. Crawford at NeurIPS 2017 <https://www.youtube.com/watch?v=fMym_BKWQzk>


* *Quality-of-service harms* can occur when a system does not work as well for
  one person as it does for another, even if no opportunities, resources, or
  information are extended or withheld. 
  
  **Sample key applications: accuracy in face recognition, document search, or product recommendation.**

<h2 align='center'>How we work with information: abstraction</h2>


**Abstracting in computer science:**  

This is the process of removing physical, spatial, or temporal details or attributes in the study of objects or systems with the goal of focusing attention on details deemed more important. 



<h2 align='center'>How we work with information: abstraction</h2>


**Abstracting in mathematics:** 

This is the process of 

1. Extracting the underlying structures, patterns or properties of a mathematical concept;

2. Removing any dependence on real world objects with which it might originally have been connected;

3. Generalizing it so that it has wider applications.


<h2 align='center'>How we work with information: abstraction</h2>


**Abstracting in machine learning:** 

Machine Learning encompasses all approaches (design and development of algorithms) that allow a computer to “learn”, based on a database of examples or sensor data (abstractions of real situations).

In Machine Learning abstraction manifests in algorithms learning from example (supervised/unsupervised learning), and learning from reinforcement (reinforcement learning).

<h2 align='center'>How we work with information: abstraction</h2>


**Abstracting in machine learning:** 

Machine Learning algorithms can be classified into three broad classes [1]

1. In supervised learning, an algorithm learns a function that maps inputs to a given set of labels (classes).

2. In unsupervised learning, an algorithm learns how to unravel hidden structures in unlabeled data (also called observations).

3. In reinforcement learning, an algorithm learns how an agent ought to take actions in an environment so as to maximize some reward.




<h2 align='center'>How do we fall into misrepresentations in machine learning?</h2>


$\Rightarrow$ When we abstract and remove attributes or properties that have dependence on a social context, how do we determine which properties are *worth* preserving and describing?

$\Rightarrow$ What is the trade off we make when discarding a property, and are we aware of the consequences of that trade off?

$\Rightarrow$ Can we guarantee capability to identify and quantify the consequences of removing a social-based attribute? Who is on the receiving end of these consequences?

<h2 align='center'>How we frame a problem is key to identifying problems in abstraction</h2>
<h3 align='center'>The algorithmic frame</h3>

Frame centered around choices made when abstracting a problem in the form of representations (input data) and labelling (outcome). 

Evaluated on accuracy and generalizability to data the model did not train on. 

**Fairness cannot be defined in this frame $\Rightarrow$ goal is to produce a model that best captures the relationship between representations and labels.** 

<h2 align='center'>How we frame a problem is key to identifying problems in abstraction</h2>
<h3 align='center'>The data frame</h3>

This frame is concerned with the quality of representations (input data) and the outcomes that result. 

**This additional frame allows us to question the inherent (un)fairness present in input and output data.**

For example, under this frame we can question whether our training dataset incorporates demographic and socio-economic information related to an algorithm providing recommendations, and assesses the quality of those recommendations based on demographic and socio-economic information.

<h2 align='center'>How we frame a problem is key to identifying problems in abstraction</h2>
<h3 align='center'>The sociotechnical frame</h3>

$$\text{Technical Component} + \text{Social Component} = \text{Sociotechnical system}$$


This frame recognizes that a machine learning model is part of the interaction between people and technology, and thus any social components of this interaction need to be modelled and incorporated accordingly.

**Designers of machine learning systems who fail to consider the way in which social contexts and technologies
are interrelated are at risk of falling into "abstraction traps".**

<h2 align='center'>An example of the impact of using different frames in situations affecting human lives</h2>

Take the example of assessing risk of re-engagement in criminal behaviour in an individual
charged with an offense, and appropriate measures to prevent relapse, and failing
to consider factors such as race, socio-economic status, mental health, along with
socially-dependent views present in judges, police officers, or any actors responsible
for recommending a course of action.



In the algorithmic framework, for example, input variables may contain previous criminal history,
statements taken by the accused, witnesses and police officers. Labels (outcome)
include recommendations by the algorithm on an appropriate course of action based
on a computed risk score. Model is limited in assessing fairness out outcome.


The data framework could attempt to reduce unfairness by studying socio-economic
information regarding the accused, their upbringing and how it relates to their
current status, along with a recommendation that incorporates these factors into their
recovery. Limited if it does not take into account societal context surrounding individuals making decisions. 



Within the sociotechnical framework the model incorporates not only more nuanced
data on the history of the case, but also the social context in which judging and
charging people with offenses take place. This model incorporates the processes
associated with crime reporting, the offense-trial pipeline, and identifies areas
in which different people interact with one another as outcomes are recommended.

<h2 align='center'>What traps can we fall into when modeling a social problem?</h2>


Machine learning systems used in the real world are inherently sociotechnical
systems.

Designers of machine learning systems typically translate a real-world context into a machine learning
model through abstraction: 

* Focus on 'relevant' aspects of that context (input, output, relationship)

* Problems arise when we abstract away the social context. I refer to this as 'abstraction traps': failures to consider how social context and technology
are interrelated.

<h2 align='center'>Common traps we can fall into</h2>

Selbst *et al.* [1] identify five traps we can fall into when implementing a machine learning model:

* The Solutionism Trap

* The Ripple Effect Trap

* The Formalism Trap

* The Portability Trap

* The Framing Trap

<h2 align='center'>Common traps we can fall into</h2>
<h3 align='center'>The Solutionism Trap</h3>


This trap occurs when we assume that the best solution to a problem
may involve technology, and fail to recognize other possible solutions
outside of this realm. Solutionist approaches may also not be appropriate
in situations where definitions of fairness may change over time

<h3 align='center'>The Solutionism Trap: example</h3>


Example: consider the problem of internet connectivity in rural communities.

An example of the solutionism trap is assuming that by using data science to
study internet speed in a given region, insights we gain from using data
science can help us in negotiating deals or discovering potential for good
policies. However, if there are additional socioeconomic challenges within
a community, for example with education, infrastructure, information
technology, or health services, then an algorithmic solution purely
focused on internet speed may fail to meaningfully address the needs of
the community.

<h2 align='center'>Common traps we can fall into</h2>
<h3 align='center'>The Ripple Effect Trap</h3>

This trap occurs when we do not consider the unintended consequences of
introducing technology into an existing social system. 

Changes in behaviors, outcomes, individual experiences, or changes
in underlying social values and incentives of a given social system, for
instance by increasing perceived value of quantifiable metrics over
non-quantifiable ones.

<h3 align='center'>The Ripple Effect Trap: example</h3>


Example: consider the problem of banks deciding whether an individual should
be approved for a loan. Prior to using machine learning algorithms
to compute a "score", banks might rely on loan officers engaging in
conversations with clients, recommending a plan based on their unique
situation, and discussing with other team members to obtain feedback.
By introducing an algorithm, it is possible that loan officers may limit
their conversations with team members and clients, assuming the algorithm's
recommendations are good enough without those additional sources of information.

<h2 align='center'>Common traps we can fall into</h2>
<h3 align='center'>The Formalism Trap</h3>

Many tasks of a data scientist involve some form of formalization: from
measuring real-world phenomena as data to translating business KPI's
and constraints into metrics, loss functions, or parameters. We fall into the
formalism trap when we fail to account for the full meaning of social
concepts like fairness.



<h2 align='center'>Common traps we can fall into</h2>
<h3 align='center'>The Formalism Trap</h3>

Fairness is a complex construct that is contested: different people may
have different ideas of what is fair in a particular scenario. 

Mathematical fairness metrics may capture some aspects of fairness, but they
fail to capture all relevant aspects. 

For example, group fairness metrics
do not account for differences in individual experiences nor do they
account for procedural justice.

Another area in which mathematical abstraction encounters a limitation is when
capturing information regarding procedurality, contextualityand contestability.

<h2 align='center'>Common traps we can fall into</h2>
<h3 align='center'>The Portability Trap</h3>


This trap occurs when we fail to understand how reusing a model or
algorithm that is designed for one specific social context may not
necessarily apply to a different social context. Reusing an algorithmic
solution and failing to take into account differences in involved social
contexts can result in misleading results and potentially harmful consequences.

<h3 align='center'>The Portability Trap: example</h3>

For instance, reusing a machine learning algorithm used to screen
job applications in the nursing industry, for a system used to screen
job applications in the information technology sector could fall into the
portability trap. 

Factor I: difference in skills required to succeed in both industries.

Factor II: difference between these contexts involves the demographic
differences (in terms of gender) of employees in each of these industries,
which may result from wording in job postings, social constructs on gender
and societal roles, and the male-female ratio of successful applicants in
each field.

<h2 align='center'>Common traps we can fall into</h2>
<h3 align='center'>The Framing Trap</h3>

This trap occurs when we fail to consider the full picture surrounding
a particular social context when abstracting a social problem. 

* The social landscape that the chosen phenomenon exists in

* Characteristics of individuals or circumstances of the chosen situation

* Third parties involved along with their circumstances

* The task that is being set out to abstract (i.e. calculating a risk score, choosing between a pool of candidates, selecting an appropriate treatment, etc).


<h4 align='center'>Recommendation: taking a socio-technical frame helps us reduce our risk to fall into this trap</h4>

<h2 align='center'>What does this look like in practice?</h2>



Reusing a machine learning algorithm used to screen job applications in the
nursing industry, for job applications in the information technology sector. An intuitive
yet important difference between both contexts is the difference in skills required to
succeed in both industries. A slightly more subtle difference is the demographic differences
in numbers of workers from different genders who typically work in each of these industries, resulting from wording in job postings,
social constructs on gender and societal roles, and the male-female ratio of successful
applicants in each field.

<h2 align='center'>Introducing FairLearn:<br> a Python library focused on decreasing unfairness in machine learning models </h2>

<h2 align='center'>Introduction to the health care scenario</h2>

Our scenario builds on previous research that highlighted racial disparities in how health care resources are allocated in the U.S. ([Obermeyer et al., 2019](https://science.sciencemag.org/content/366/6464/447.full)).
Motivated by that work, in this tutorial we consider an automated system for recommending patients for _high-risk care management_ programs, which are described by Obermeyer et al. 2019 as follows:


> These programs seek to improve the care of patients with complex health needs by providing additional resources such as greater attention from trained providers, to help ensure that care is well coordinated. 

> Because the programs are themselves expensive—with costs going toward teams of dedicated nurses, extra primary care appointment slots, and other scarce resources—**health systems rely extensively on algorithms to identify patients who will benefit the most.**


**Convenience restriction**

* In practice, the modeling of health needs would use large data sets covering a wide range of diagnoses. In this tutorial, we will work with a [publicly available clinical dataset](https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008) that focuses on _diabetic patients only_ ([Strack et al., 2014](https://www.hindawi.com/journals/bmri/2014/781670/)).

**Dataset and task**

* Clincial dataset of hospital re-admissions over a ten-year period (1998-2008) for diabetic patients across 130 different hospitals in the US. 

* Each record represents the hospital admission records for a patient diagnosed with diabetes whose stay lasted one to fourteen days.

* The features include: demographics, diagnoses, diabetic medications, number of visits in the year preceding the encounter, and payer information, whether the patient was readmitted after release, and whether the readmission occurred within 30 days of the release.

**Goal:** develop a classification model, which decides whether the patients should be suggested to their primary care physicians for an enrollment into the high-risk care management program. The positive prediction will mean recommendation into the care program.

**Decision point: Task definition**

* A hospital **readmission within 30 days** can be viewed as a proxy that the patients needed more assistance at the release time, so it will be the label we wish to predict.

* Because of the class imbalance, we will be measuring our performance via **balanced accuracy**. Another key performance consideration is how many patients are recommended for care, metric we refer to as **selection rate**.

Ideally, health care professionals would be involved in both designing and using the model, including formalizing the task definition. 

**Fairness considerations**

* _Which groups are most likely to be disproportionately negatively affected?_ Previous work suggests that groups with different race and ethnicity can be differently affected.

* _What are the harms?_ The key harms here are allocation harms. In particular, false negatives, i.e., don't recommend somebody who will be readmitted.

* _How should we measure those harms?_


**Hands on code section**

In [None]:
# !pip install --upgrade fairlearn==0.7.0
# !pip install --upgrade scikit-learn
# !pip install --upgrade seaborn
# !pip install model-card-toolkit

In [None]:
%run -i process_health_data.py

From the perspective of fairness assessment, a key data characteristic is the sample size of groups with respect to which we conduct fairness assessment.

Small sample sizes have two implications:

* **assessment**: the impacts of the AI system on smaller groups are harder to assess, because due to fewer data points we have a much larger uncertainty (error bars) in our estimates

* **model training**: fewer training data points mean that our model fails to appropriately capture any data patterns specific to smaller groups, which means that its predictive performance on these groups could be worse

Let's examine the sample sizes of the groups according to `race`:

In [None]:
df["race"].value_counts(normalize=True).plot(kind='bar', rot=45);

In [None]:
df["gender"].value_counts(normalize=True).plot(kind='bar', rot=45); # frequencies

In our dataset, our patients are predominantly *Caucasian* (75%). The next largest racial group is *AfricanAmerican*, making up 19% of the patients. The remaining race categories (including *Unknown*) compose only 6% of the data.

Gender is in our case effectively binary (and we have no further information how it was operationalized), with both *Female* represented at 54% and *Male* represented at 46%. There are only 3 samples annotated as *Unknown/Invalid*.

**Exercise**

Please examine the distribution of the `age` feature in the dataset.

In [None]:
## Answer here

**Predictive validity**

We would like to show that our measurement `readmit_30_days` is correlated with patient characteristics that are related to our construct "benefiting from care management". One such characteristic is the general patient health, where we expect that patients that are less healthy are more likely to benefit from care management.

While our data does not contain full health records that would enable us to holistically measure general patient health, the data does contain two relevant features: `had_emergency` and `had_inpatient_days`, which indicate whether the patient spent any days in the emergency room or in the hospital (but non-emergency) in the preceding year.

To establish predictive validity, we would like to show that our measurement `readmit_30_days` is predictive of these two observable characteristics.

In [None]:
plot_pointplot(df, "race")

The patients in the group *Unknown* have a substantially lower rate of emergency visits in the prior year, regardless of whether they are readmitted in 30 days. The readmission is still positively correlated with `had_emergency`, but note the large error bars (due to small sample sizes).

We also see that the group with feature value *AfricanAmerican* has a higher rate of emergency visits compared with other groups. However, generally the groups *Caucasian*, *AfricanAmerican* and *Other* follow similar dependence patterns.

Again, for *Unknown* the rate of (non-emergency) hospital visits in the previous year is lower than for other groups. In all groups there is a strong positive correlation between `readmit_30_days` and `had_inpatient_days`.



In all cases, we see that readmission in 30 days is predictive of our two measurements of general patient health.

The analysis is also surfacing the fact that patients with the value of race *Unknown* have fewer hospital visits in the preceding year (both emergency and non-emergency) than other groups. In practice, this would be a good reason to reach out to health professionals to investigate this patient cohort, to make sure that we understand why there is the systematic difference.

Note that we have only investigated _predictive validity_, but there are other important aspects of construct validity which we may want to establish (see [Jacobs and Wallach, 2021](https://arxiv.org/abs/1912.05511)).

<a name="exercise-predictive-validity"></a>
### Exercise

Check the predictive validity with respect to `gender` and `age`. Do you see any differences? Can you form a hypothesis why?

In [None]:
# Answer here

In [None]:
df["readmit_30_days"].value_counts(normalize=True) # frequencies

As we can see, the target label is heavily skewed towards the patients not being readmitted within 30 days. In our dataset, only 11% of patients were readmitted within 30 days.

Since there are fewer positive examples, we expect that we will have a much larger uncertainty (error bars) in our estimates of *false negative rates* (FNR), compared with *false positive rates* (FPR). This means that there will be larger differences between training FNR and test FNR, even if there is no overfitting, simply because of the smaller sample sizes. 

Our target metric is *balanced error rate*, which is the average of FPR and FNR. The value of this metric is robust to different frequencies of positives and negatives. However, since half of the metric is contributed by FNR, we expect the uncertainty in balanced error values to behave similarly to the uncertainty of FNR.

## Label imbalance

* some classification algorithms and performance measures might not work well with data sets with extreme class imbalance
* in binary classification settings, our ability to evaluate error is often driven by the size of the smaller of the two classes (again, the smaller the sample the larger the uncertainty in estimates)
* label imbalance may exacerbate the problems due to smaller group sizes in fairness assessment


In [None]:
# Now, let's examine how much the label frequencies vary within each group defined by `race`:

sns.barplot(x="readmit_30_days", y="race", data=df, ci=95);

We see the rate of *30-day readmission* is similar for the *AfricanAmerican* and *Caucasian* groups, but appears smaller for *Other* and smallest for *Unknown* (this is consistent with an overall lower rate of hospital visits in the prior year). The smaller sample size of the *Other* and *Unknown* groups mean that there is more uncertainty around the estimate for these two groups.

## Prepare training and test datasets

As we mentioned in the task definition, our target variable is **readmission within 30 days**, and our sensitive feature for the purposes of fairness assessment is **race**.


In [None]:
[X_train, X_test, Y_train, Y_test, A_train, A_test, df_train, df_test, X] = train_model(df)

Our performance metric is **balanced accuracy**, so for the purposes of training (but not evaluation!) we will resample the data set, so that it has the same number of positive and negative examples. This means that we can use estimators that optimize standard accuracy (although some estimators allow the use us importance weights).


In [None]:
X_train_bal, Y_train_bal, A_train_bal = resample_dataset(X_train, Y_train, A_train)

## Save descriptive statistics of training and test data

We next evaluate and save descriptive statistics of the training dataset. These will be provided as part of _model cards_ to document our training.

In [None]:
plot_descriptive_stats(A_train_bal, Y_train_bal, A_test, Y_test)

## Train the model

We train a logistic regression model and save its predictions on test data for analysis.

In [None]:
# Set up pipeline
unmitigated_pipeline = Pipeline(steps=[
    ("preprocessing", StandardScaler()),
    ("logistic_regression", LogisticRegression(max_iter=1000))
])

# Fit data
unmitigated_pipeline.fit(X_train_bal, Y_train_bal)

In [None]:
# Predict

Y_pred_proba = unmitigated_pipeline.predict_proba(X_test)[:,1]
Y_pred = unmitigated_pipeline.predict(X_test)

In [None]:
# Plot ROC curve of probabilistic predictions
plot_roc_curve(unmitigated_pipeline, X_test, Y_test);

In [None]:
# Show balanced accuracy rate of the 0/1 predictions
balanced_accuracy_score(Y_test, Y_pred)

As we see, the performance of the model is well above the performance of a coin flip (whose performance would be 0.5 in both cases), albeit it is quite far from a perfect classifier (whose performance would be 1.0 in both cases).


## Inspect the coefficients of trained model

We check the coefficients of the fitted model to make sure that they "makes sense". While subjective, this step is important and helps catch mistakes and might point out to some fairness issues. However, we will systematically assess the fairness of the model in the next section.

*Note that coefficients are also a proxy for "feature importance", but this interpretation can be misleading when features are highly correlated.*

In [None]:
coef_series = pd.Series(data=unmitigated_pipeline.named_steps["logistic_regression"].coef_[0], index=X.columns)
coef_series.sort_values().plot.barh(figsize=(4, 12), legend=False);

## Fairness assessment with `MetricFrame`

Fairlearn provides the data structure called `MetricFrame` to enable evaluation of disaggregated metrics. We will show how to use a `MetricFrame` object to assess the trained `LogisticRegression` classifier for potential fairness-related harms.




In [None]:
# In its simplest form MetricFrame takes four arguments:
#    metric_function with signature metric_function(y_true, y_pred)
#    y_true: array of labels
#    y_pred: array of predictions
#    sensitive_features: array of sensitive feature values

mf1 = MetricFrame(metrics=false_negative_rate,
                  y_true=Y_test,
                  y_pred=Y_pred,
                  sensitive_features=df_test['race'])

# The disaggregated metrics are stored in a pandas Series mf1.by_group:

mf1.by_group

In [None]:
# The largest difference, smallest ratio and worst-case performance are accessed as
#   mf1.difference(), mf1.ratio(), mf1.group_max()

print(f"difference: {mf1.difference():.3}\n"
      f"ratio: {mf1.ratio():.3}\n"
      f"max across groups: {mf1.group_max():.3}")

In [None]:
# You can also evaluate multiple metrics by providing a dictionary

metrics_dict = {
    "selection_rate": selection_rate,
    "false_negative_rate": false_negative_rate,
    "balanced_accuracy": balanced_accuracy_score,
}

metricframe_unmitigated = MetricFrame(metrics=metrics_dict,
                  y_true=Y_test,
                  y_pred=Y_pred,
                  sensitive_features=df_test['race'])

# The disaggregated metrics are then stored in a pandas DataFrame:

metricframe_unmitigated.by_group

In [None]:
# The largest difference, smallest ratio, and the maximum and minimum values
# across the groups are then all pandas Series, for example:

metricframe_unmitigated.difference()

In [None]:
# You'll probably want to view them transposed:

pd.DataFrame({'difference': metricframe_unmitigated.difference(),
              'ratio': metricframe_unmitigated.ratio(),
              'group_min': metricframe_unmitigated.group_min(),
              'group_max': metricframe_unmitigated.group_max()}).T

In [None]:
# You can also easily plot all of the metrics using DataFrame plotting capabilities

metricframe_unmitigated.by_group.plot.bar(subplots=True, layout= [1,3], figsize=(12, 4),
                      legend=False, rot=-45, position=1.5);

According to the above bar chart, it seems that the group *Unknown* is selected for the care management program less often than other groups as reflected by the selection rate. Also this group experiences the largest false negative rate, so a larger fraction of group members that are likely to benefit from the care management program are not selected. Finally, the balanced accuracy on this group is also the lowest.


We observe disparity, even though we did not include race in our model. There's a variety of reasons why such disparities may occur. It could be due to representational issues (i.e., not enough instances per group), or because the feature distribution itself differs across groups (i.e., different relationship between features and target variable, obvious example would be people with darker skin in facial recognition systems, but can be much more subtle). Real-world applications often exhibit both kinds of issues at the same time.

## Postprocessing with `ThresholdOptimizer`

**Postprocessing** techniques are a class of unfairness-mitigation algorithms that take an already trained model and a dataset as an input and seek to fit a transformation function to model's outputs to satisfy some (group) fairness constraint(s). They might be the only feasible unfairness mitigation approach when developers cannot influence training of the model, due to practical reasons or due to security or privacy.

Here we use the `ThresholdOptimizer` algorithm from Fairlearn, which follows the approach of [Hardt, Price, and Srebro (2016)](https://arxiv.org/abs/1610.02413).

`ThresholdOptimizer` takes in an exisiting (possibly pre-fit) machine learning model whose predictions act as a scoring function and identifies a separate thrceshold for each group in order to optimize some specified objective metric (such as **balanced accuracy**) subject to specified fairness constraints (such as **false negative rate parity**). Thus, the resulting classifier is just a suitably thresholded version of the underlying machinelearning model.

The constraint **false negative rate parity** requires that all the groups have equal values of false negative rate.



To instatiate our `ThresholdOptimizer`, we pass in:

*   An existing `estimator` that we wish to threshold. 
*   The fairness `constraints` we want to satisfy.
*   The `objective` metric we want to maximize.



In [None]:
# Now we instantite ThresholdOptimizer with the logistic regression estimator
postprocess_est = ThresholdOptimizer(
    estimator=unmitigated_pipeline,
    constraints="false_negative_rate_parity",
    objective="balanced_accuracy_score",
    prefit=True,
    predict_method='predict_proba'
)

In order to use the `ThresholdOptimizer`, we need access to the sensitive features **both during training time and once it's deployed**.

In [None]:
postprocess_est.fit(X_train_bal, Y_train_bal, sensitive_features=A_train_bal)

In [None]:
# Record and evaluate the output of the trained ThresholdOptimizer on test data

Y_pred_postprocess = postprocess_est.predict(X_test, sensitive_features=A_test)
metricframe_postprocess = MetricFrame(
    metrics=metrics_dict,
    y_true=Y_test,
    y_pred=Y_pred_postprocess,
    sensitive_features=A_test
)

In [None]:
pd.concat([metricframe_unmitigated.by_group,
           metricframe_postprocess.by_group],
           keys=['Unmitigated', 'ThresholdOptimizer'],
           axis=1)

In [None]:
pd.concat([metricframe_unmitigated.difference(),
           metricframe_postprocess.difference()],
          keys=['Unmitigated: difference', 'ThresholdOptimizer: difference'],
          axis=1).T

In [None]:
metricframe_postprocess.by_group.plot.bar(subplots=True, layout=[1,3], figsize=(12, 4), legend=False, rot=-45, position=1.5)
postprocess_performance = figure_to_base64str(plt)

## Future readings and exercises:

You can learn more about Fairlearn https://fairlearn.org/v0.7.0/about/index.html and visit their open source code https://github.com/fairlearn/fairlearn

Fairlearn is a community-based effort to improve ethics in AI. 

Curious to contribute? Visit https://fairlearn.org/v0.7.0/contributor_guide/index.html 

[1] Selbst, Andrew D. and Boyd, Danah and Friedler, Sorelle and Venkatasubramanian,
      Suresh and Vertesi, Janet, "Fairness and Abstraction in Sociotechnical Systems"
      (August 23, 2018). 2019 ACM Conference on Fairness, Accountability, and Transparency
      (FAT*), 59-68, Available at `SSRN: 	<https://ssrn.com/abstract=3265913>`_,