# Using Fairlearn with heart disease data

This notebook shows how to use `Fairlearn` and their visualizations dashboards to understand a binary classification model. The classification model has been trained with an autogenerated heart disease data based on **UCI Heart Disease Dataset**, which given a range of data about 303 individuals, predicts whether their tendency to have disease or not. You can find the UCI dataset on https://archive.ics.uci.edu/ml/datasets/Heart+Disease

For the purposes of this notebook, we will treat this as a classification problem. We will pretend that the label indicates whether or not each individual has heart disease. We will use the data to train a predictor to predict whether or not previously seen individuals will have heart disease. It is assumed that the model predictions are used to decide whether to continue with a treatment or not.

We will first train a fairness-unaware predictor and show that it leads to unfair decisions under a specific notion of fairness called *demographic parity*. We then mitigate unfairness by applying the `GridSearch` algorithm from `Fairlearn` package.

In this notebook also, you will learn to use the Fairlearn open-source Python package with Azure Machine Learning to perform the following tasks:

1. Assess the fairness of your model predictions. To learn more about fairness in machine learning, see the fairness in machine learning article.

2. Upload, list and download fairness assessment insights to/from Azure Machine Learning studio.

3. See a fairness assessment dashboard in Azure Machine Learning studio to interact with your model(s)' fairness insights.

## Install the AzureML Fairness module

To upload our Fairlearn dashboard we need to import azureml fairness library, you'll need to ensure that you have the latest version of the Azure ML SDK installed, and install the fairness module; so run the following cell to do that:

## Load and preprocess the data set

For simplicity, we import the data set from the `shap` package, which contains the data in a cleaned format. We start by importing the various modules we're going to use:

In [None]:
import os
import joblib
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

from sklearn.pipeline import Pipeline
from fairlearn.reductions import GridSearch
from sklearn.compose import ColumnTransformer
from azureml.core.model import Model, Dataset
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
from fairlearn.reductions import DemographicParity, ErrorRate
from azureml.core import Workspace, Dataset, Datastore, Experiment
from fairlearn.metrics._group_metric_set import _create_group_metric_set
from interpret.ext.blackbox import KernelExplainer

sys.path.append(os.path.abspath("../utils"))
from workspace import get_workspace
from dataset import upload_dataset

### Initialize Workspace

In [None]:
ws = Workspace.from_config("../notebooks-settings/config.json")
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\n')

### Get the Default datastore (Azure Blob storage)

In [None]:
def_blob_store = Datastore(ws, "workspaceblobstore")
print("Blobstore's name: {}".format(def_blob_store.name))

In [None]:
fairlearn_dataset = upload_dataset(ws, def_blob_store, 'complete_patients_dataset',
                                  'heart-disease/complete_patients_dataset.csv', 
                                  '../../dataset/complete_patients_dataset.csv',
                                  use_datadrift=False, type_dataset="Standard")

## Upload fairness insights for a single model

In [None]:
fairlearn_df = fairlearn_dataset.to_pandas_dataframe()
fairlearn_df

We are going to treat the sex of each individual as a protected attribute (where 0 indicates female and 1 indicates male), and in this particular case we are going separate this attribute out and drop it from the main data. We then perform some standard data preprocessing steps to convert the data into a format suitable for the ML algorithms

In [None]:
X_raw = fairlearn_df.drop(['target', 'address', 'city', 'state','postalCode',
                            'name', 'ssn', 'observation'], axis=1)
Y = fairlearn_df['target']

In [None]:
A = X_raw[['sex', 'pregnant', 'diabetic', 'asthmatic', 'smoker']]
X = X_raw.drop(labels=['sex', 'pregnant', 'diabetic', 'asthmatic', 'smoker'],axis = 1)

Finally, we split the data into training and test sets:

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test, A_train, A_test = train_test_split(X_raw, 
                                                    Y, 
                                                    A,
                                                    test_size = 0.3,
                                                    random_state=0,
                                                    stratify=Y)

X_train = X_train.reset_index(drop=True)
A_train = A_train.reset_index(drop=True)
X_test = X_test.reset_index(drop=True)
A_test = A_test.reset_index(drop=True)

A_test.sex.loc[(A_test['sex'] == 0)] = 'female'
A_test.sex.loc[(A_test['sex'] == 1)] = 'male'

A_test.pregnant.loc[(A_test['pregnant'] == 0)] = 'not pregnant'
A_test.pregnant.loc[(A_test['pregnant'] == 1)] = 'pregnant'

A_test.diabetic.loc[(A_test['diabetic'] == 0)] = 'not diabetic'
A_test.diabetic.loc[(A_test['diabetic'] == 1)] = 'diabetic'

A_test.asthmatic.loc[(A_test['asthmatic'] == 0)] = 'not asthmatic'
A_test.asthmatic.loc[(A_test['asthmatic'] == 1)] = 'asthmatic'

A_test.smoker.loc[(A_test['smoker'] == 0)] = 'not smoker'
A_test.smoker.loc[(A_test['smoker'] == 1)] = 'smoker'

## Training a fairness-unaware predictor

To show the effect of `Fairlearn` we will first train a standard ML predictor that does not incorporate fairness for speed of demonstration, we use a simple logistic regression estimator from `sklearn`:

In [None]:
clf = Pipeline(steps=[('classifier', LogisticRegression(solver='liblinear', fit_intercept=True))])

In [None]:
model = clf.fit(X_train, Y_train)

We can load this predictor into the Fairness dashboard, and examine how it is unfair (there is a warning about AzureML since we are not yet integrated with that product):

In [None]:
from fairlearn.widget import FairlearnDashboard

y_pred = model.predict(X_test)

FairlearnDashboard(sensitive_features=A_test,
                   sensitive_feature_names=['sex', 'pregnant', 'diabetic', 'asthmatic', 'smoker'],
                   y_true=Y_test.tolist(),
                   y_pred=[y_pred.tolist()])

Looking at the disparity in accuracy, we see that males have an error rate about three times greater than the females. More interesting is the disparity in opportunitiy - males are offered loans at three times the rate of females.

Despite the fact that we removed the feature from the training data, our predictor still discriminates based on sex. This demonstrates that simply ignoring a protected attribute when fitting a predictor rarely eliminates unfairness. There will generally be enough other features correlated with the removed attribute to lead to disparate impact.

## Mitigation with GridSearch

The `GridSearch` class in `Fairlearn` implements a simplified version of the exponentiated gradient reduction of [Agarwal et al. 2018](https://arxiv.org/abs/1803.02453). The user supplies a standard ML estimator, which is treated as a blackbox. `GridSearch` works by generating a sequence of relabellings and reweightings, and trains a predictor for each.

For this example, we specify demographic parity (on the protected attribute of sex) as the fairness metric. Demographic parity requires that individuals are offered the opportunity (are approved for a loan in this example) independent of membership in the protected class (i.e., females and males should be offered loans at the same rate). We are using this metric for the sake of simplicity; in general, the appropriate fairness metric will not be obvious.

In [None]:
sweep = GridSearch(LogisticRegression(solver='liblinear', fit_intercept=True),
                   constraints=DemographicParity(),
                   grid_size=70)

Our algorithms provide `fit()` and `predict()` methods, so they behave in a similar manner to other ML packages in Python. We do however have to specify two extra arguments to `fit()` - the column of protected attribute labels, and also the number of predictors to generate in our sweep.

After `fit()` completes, we extract the full set of predictors from the `GridSearch` object.

In [None]:
sweep.fit(X_train, Y_train,
          sensitive_features=A_train.sex)

predictors = sweep._predictors

We could load these predictors into the Fairness dashboard now. However, the plot would be somewhat confusing due to their number. In this case, we are going to remove the predictors which are dominated in the error-disparity space by others from the sweep (note that the disparity will only be calculated for the protected attribute; other potentially protected attributes will not be mitigated). In general, one might not want to do this, since there may be other considerations beyond the strict optimisation of error and disparity (of the given protected attribute).

In [None]:
errors, disparities = [], []
for m in predictors:
    classifier = lambda X: m.predict(X)
    
    error = ErrorRate()
    error.load_data(X_train, pd.Series(Y_train), sensitive_features=A_train.sex)
    disparity = DemographicParity()
    disparity.load_data(X_train, pd.Series(Y_train), sensitive_features=A_train.sex)
    
    errors.append(error.gamma(classifier)[0])
    disparities.append(disparity.gamma(classifier).max())
    
all_results = pd.DataFrame( {"predictor": predictors, "error": errors, "disparity": disparities})

all_models_dict = {"heart_disease_unmitigated": model}
dominant_models_dict = {"heart_disease_unmitigated": model}
base_name_format = "heart_disease_grid_model_{0}"

row_id = 0
for row in all_results.itertuples():
    model_name = base_name_format.format(row_id)
    all_models_dict[model_name] = row.predictor
    errors_for_lower_or_eq_disparity = all_results["error"][all_results["disparity"]<=row.disparity]
    if row.error <= errors_for_lower_or_eq_disparity.min():
        dominant_models_dict[model_name] = row.predictor
    row_id = row_id + 1

We can construct predictions for all the models, and also for the dominant models:

In [None]:
dashboard_all = dict()
models_all = dict()
for name, predictor in all_models_dict.items():
    value = predictor.predict(X_test)
    dashboard_all[name] = value
    models_all[name] = predictor
    
dominant_all = dict()
for n, p in dominant_models_dict.items():
    dominant_all[n] = p.predict(X_test)

We can see the GridSearch generate around 70 models of which 23 are the models that the Error disparity was the lower

In [None]:
len(list(models_all.keys()))

In [None]:
len(list(dominant_all.keys()))

In [None]:
dashboard = FairlearnDashboard(sensitive_features=A_test, 
                   sensitive_feature_names=['sex', 'pregnant', 'diabetic', 'asthmatic', 'smoker'],
                   y_true=Y_test.tolist(),
                   y_pred=dominant_all)

We see a Pareto front forming - the set of predictors which represent optimal tradeoffs between accuracy and disparity i predictions. In the ideal case, we would have a predictor at (1,0) - perfectly accurate and without any unfairness under demographic parity (with respect to the protected attribute "sex"). The Pareto front represents the closest we can come to this ideal based on our data and choice of estimator. Note the range of the axes - the disparity axis covers more values than the accuracy, so we can reduce disparity substantially for a small loss in accuracy.

By clicking on individual models on the plot, we can inspect their metrics for disparity and accuracy in greater detail. In a real example, we would then pick the model which represented the best trade-off between accuracy and disparity given the relevant business constraints.

# AzureML Integration

We will now go through a brief example of the AzureML integration.

In [None]:
os.makedirs('models', exist_ok=True)
def register_model(name, model, disparity=""):
    model_path = "models/{0}.pkl".format(name)
    joblib.dump(value=model, filename=model_path)
    registered_model = Model.register(model_path=model_path,
                  model_name=name,
                  workspace=ws,
                  tags={"disparity": f'{disparity}%'})
    return registered_model.id

Now, produce new predictions dictionaries, with the updated names:

In [None]:
model_name_id_mapping = dict()
for name, model in dominant_all.items():
    m_id = register_model(name, model)
    model_name_id_mapping[name] = m_id

In [None]:
dominant_all_ids = dict()
for name, y_pred in dominant_all.items():
    dominant_all_ids[model_name_id_mapping[name]] = y_pred

## Create group of metrics and visualize

### Precompute fairness metrics.

Create a dashboard dictionary using Fairlearn's metrics package. The _create_group_metric_set method has arguments similar to the Dashboard constructor, except that the sensitive features are passed as a dictionary (to ensure that names are available). We must also specify the type of prediction (binary classification in this case) when calling this method.

In [None]:
sf = {'sex': A_test.sex, 'pregnant': A_test.pregnant,
      'diabetic': A_test.diabetic, 'asthmatic': A_test.asthmatic}
sensitive_features = ['asthmatic', 'diabetic', 'pregnant', 'sex']

dash_dict_all = _create_group_metric_set(y_true=Y_test,
                                         predictions=dominant_all_ids,
                                         sensitive_features=sf,
                                         prediction_type='binary_classification')

In [None]:
def difference_selection_rate(selection_rate):
    return abs(selection_rate[0]-selection_rate[1])

In [None]:
def scatterplot(disparities, accuracy_scores, legend):    
    plt.figure(figsize=(12, 7), dpi=80)
    colors = np.random.rand(len(accuracy_scores),4)
    for accuracy, disparity, model_name, color in zip(accuracy_scores, disparities, legend, colors):
        plt.scatter(accuracy, disparity, c=[color], s=170, label=model_name, alpha=0.3)
    plt.legend(bbox_to_anchor=(1.04,1), loc="upper left")
    plt.title('Multi model view - Models Comparison')
    plt.xlabel("Accuracy")
    plt.ylabel("Disparity in predictions")
    plt.grid()
    plt.show()

In [None]:
def get_models_metrics(feature_models, disparities, accuracy_scores):
    disparities.append(difference_selection_rate(feature_models['selection_rate']['bins']))
    accuracy_scores.append(feature_models['accuracy_score']['global'])

In [None]:
def plot_multimodel_view_by_feature(feature, sensitive_features, dash_dict_all):
    disparities = []
    accuracy_scores = []
    list(map(lambda feature_models: get_models_metrics(feature_models, disparities, accuracy_scores), dash_dict_all['precomputedMetrics'][sensitive_features.index(feature)]))
    scatterplot(disparities, accuracy_scores, dash_dict_all['modelNames'])

In [None]:
plot_multimodel_view_by_feature('sex', sensitive_features, dash_dict_all)

## Registering Models

The fairness dashboard is designed to integrate with registered models, so we need to do this for the models we want in the Studio portal. The assumption is that the names of the models specified in the dashboard dictionary correspond to the `id`s (i.e. `<name>:<version>` pairs) of registered models in the workspace.

Next, we register into the workspace each of the best feature models focusing in disparity value. For this, we have to save each model to a file, and then register that file:

In [None]:
def build_models_metrics(tags, feature_models, feature):
    tags[feature]['disparity'].append(difference_selection_rate(feature_models['selection_rate']['bins']))

In [None]:
def upload_best_disparity_model_by_feature(dash_dict_all, dominant_all, sensitive_features):  
    tags = {}
    for i, feature in enumerate(sensitive_features):
        tags[feature] = {}
        tags[feature]['disparity'] = []
        list(map(lambda feature_models: build_models_metrics(tags, feature_models, feature), dash_dict_all[i]))
        model_info = tuple(dominant_all.items())[tags[feature]['disparity'].index(min(tags[feature]['disparity']))]
        register_model(f'{feature}', model_info[1], min(tags[feature]['disparity']))

In [None]:
upload_best_disparity_model_by_feature(dash_dict_all['precomputedMetrics'], dominant_all, sensitive_features)

## Uploading a dashboard

We create a _dashboard dictionary_ using Fairlearn's `metrics` package. The `_create_group_metric_set` method has arguments similar to the Dashboard constructor, except that the sensitive features are passed as a dictionary (to ensure that names are available), and we must specify the type of prediction. Note that we use the `dashboard_registered` dictionary we just created:

Now, we import our `contrib` package which contains the routine to perform the upload:

In [None]:
from azureml.contrib.fairness import upload_dashboard_dictionary, download_dashboard_by_upload_id

### Precompute fairness metrics for the unaware model.

Create a dashboard dictionary using Fairlearn's metrics package. The _create_group_metric_set method has arguments similar to the Dashboard constructor, except that the sensitive features are passed as a dictionary (to ensure that names are available). We must also specify the type of prediction (binary classification in this case) when calling this method.

In [None]:
sf = {'sex': A_test.sex, 'pregnant': A_test.pregnant,
      'diabetic': A_test.diabetic, 'asthmatic': A_test.asthmatic}

dash_dict_unaware_model = _create_group_metric_set(y_true=Y_test,
                                         predictions={Model(ws, 'heart_disease_unmitigated').id: y_pred},
                                         sensitive_features=sf,
                                         prediction_type='binary_classification')

Now we can create an Experiment, then a Run, and upload our dashboard to it:

In [None]:
def build_fairlearn_dashboard(dash_dict_all, experiment_name, dashboard_title):
    exp = Experiment(ws, experiment_name)
    print(exp)

    run = exp.start_logging()
    try:
        upload_id = upload_dashboard_dictionary(run,
                                                dash_dict_all,
                                                dashboard_name=dashboard_title)
        print("\nUploaded to id: {0}\n".format(upload_id))

        downloaded_dict = download_dashboard_by_upload_id(run, upload_id)


    finally:
        run.complete()

## Check the fairness dashboard from Azure Machine Learning service

If you complete the previous steps (uploading generated fairness insights to Azure Machine Learning), you can view the fairness dashboard in Azure Machine Learning studio. This dashboard is the same visualization dashboard provided in Fairlearn, enabling you to analyze the disparities among your sensitive feature's subgroups (e.g., male vs. female). Follow one of these paths to access the visualization dashboard in Azure Machine Learning studio:

Experiments pane (Preview)
Select Experiments in the left pane to see a list of experiments that you've run on Azure Machine Learning.
Select a particular experiment to view all the runs in that experiment.
Select a run, and then the Fairness tab to the explanation visualization dashboard.

#### Models pane
If you registered your original model by following the previous steps, you can select Models in the left pane to view it.
Select a model, and then the Fairness tab to view the explanation visualization dashboard.
To learn more about the visualization dashboard and what it contains, please check out Fairlearn's user guide.

In [None]:
build_fairlearn_dashboard(dash_dict_unaware_model, "Fairlearn_Heart_Disease_insights",
                          "Fairness insights of Logistic Regression Classifier with heart-disease data")

![FairLearn](../../docs/fairlearn_2.png)

![FairLearn](../../docs/fairlearn.png)

In [None]:
build_fairlearn_dashboard(dash_dict_all, "Fairlearn_Heart_Disease_multiasset_Grid_Search",
                          "Upload MultiAsset from Grid Search with heart-disease data")

![FairLearn](../../docs/fairlearn_3.png)