## Deleting and Mitigating Unfairness in the Model
For example, a model that predicts the likelihood of diabetes might work well for some age groups, but not for others - subjecting a subset of patients to unnecessary tests, or depriving them of tests that would confirm a diabetes diagnosis.

## Train a Model
In addition to splitting the data into training a test sets of features and labels, you'll extract sensitive features that are used to define subpopulations of the data for which you want to compare fairness.In this case, you'll use the Age column to define two categories of patient: those over 50 years old, and those 50 or younger.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

# load the diabetes dataset
print("Loading Data...")
data = pd.read_csv('data/diabetes.csv')

# Separate features and labels
features = ['Pregnancies','PlasmaGlucose','DiastolicBloodPressure','TricepsThickness','SerumInsulin','BMI','DiabetesPedigree','Age']
X, y = data[features].values, data['Diabetic'].values

# Get sensitive features
S = data[['Age']].astype(int)
# Change value to represent age groups
S['Age'] = np.where(S.Age > 50, 'Over 50', '50 or younger')

# Split data into training set and test set
X_train, X_test, y_train, y_test, S_train, S_test = train_test_split(X, y, S, test_size=0.20, random_state=0, stratify=y)

# Train a classification model
print("Training model...")
diabetes_model = DecisionTreeClassifier().fit(X_train, y_train)

print("Model trained.")

In [1]:
#Use the selection_rate_group_summary function to return the selection rate (percentage of positive predictions) for the overall population and for each age group in the Age sensitive feature.
#Use the group_summary function to calculate prediction performance based on three commomly used classification metrics (accuracy, precision, and recall) 
#for the overall population and for each age group in the Age sensitive feature. Note that scikit-learn metric scores are used to calculate the performance values.
from fairlearn.metrics import selection_rate_group_summary
from fairlearn.metrics import group_summary
from sklearn.metrics import accuracy_score, recall_score, precision_score

# Get predictions for the witheld test data
y_hat = diabetes_model.predict(X_test)

# Get selection rates
selection_rates = selection_rate_group_summary(y_test, y_hat, sensitive_features=S_test['Age'])
print("Selection Rates\n",selection_rates)

# Get accuracy scores
accuracy_scores = group_summary(accuracy_score, y_test, y_hat, sensitive_features=S_test['Age'])
print("\nAccuracy\n",accuracy_scores)

# Get precision scores
precision_scores = group_summary(precision_score, y_test, y_hat, sensitive_features=S_test['Age'])
print("\nPrecision\n",precision_scores)

# Get recall scores
recall_scores = group_summary(recall_score, y_test, y_hat, sensitive_features=S_test['Age'])
print("\nRecall\n",recall_scores)

From these metrics,  you should be able to discern (recognize) that a larger proportion of the older patients are predicted to be diabetic. Accuracy should be more or less equal for the two groups, but a closer inspection of precision and recall indicates some disparity (inconsistency) in how well the model predicts for each age group.

In this scenario, consider recall. The model does a better job of this for patients in the older age group than for younger patients.
It's often easier to compare metrics visually. To do this, you'll use the Fairlearn dashboard:

Run the cell below.
When the widget is displayed, use the Get started link to start configuring your visualization.
Select the sensitive features you want to compare (in this case, there's only one: Age).
Select the model performance metric you want to compare (in this case, it's a binary classification model so the options are Accuracy, Balanced accuracy, Precision, and Recall). Start with Recall.
View the dashboard visualization, which shows:
Disparity in performance - how the selected performance metric compares for the subpopulations, including underprediction (false negatives) and overprediction (false positives).
Disparity in predictions - A comparison of the number of positive cases per subpopulation.
Edit the configuration to compare the predictions based on different performance metrics.

In [None]:
from fairlearn.widget import FairlearnDashboard

# View this model in Fairlearn's fairness dashboard, and see the disparities which appear:
FairlearnDashboard(sensitive_features=S_test, 
                   sensitive_feature_names=['Age'],
                   y_true=y_test,
                   y_pred={"diabetes_model": diabetes_model.predict(X_test)})

Let's see what happens if we exclude the Age feature when training the model.

In [2]:
# Separate features and labels
ageless = features.copy()
ageless.remove('Age')
X2, y2 = data[ageless].values, data['Diabetic'].values

# Split data into training set and test set
X_train2, X_test2, y_train2, y_test2, S_train2, S_test2 = train_test_split(X2, y2, S, test_size=0.20, random_state=0, stratify=y2)

# Train a classification model
print("Training model...")
ageless_model = DecisionTreeClassifier().fit(X_train2, y_train2)
print("Model trained.")

# View this model in Fairlearn's fairness dashboard, and see the disparities which appear:
FairlearnDashboard(sensitive_features=S_test2, 
                   sensitive_feature_names=['Age'],
                   y_true=y_test2,
                   y_pred={"ageless_diabetes_model": ageless_model.predict(X_test2)})

#In this scenario, simply removing the Age feature slightly reduces the disparity in recall, but increases the disparity in precision and accuracy. 
#This underlines one the key difficulties in applying fairness to machine learning models - you must be clear about what fairness means in a particular context, and optimize for that.

NameError: name 'features' is not defined

## Register the Model and Upload the Dashboard Data to Azure ML
It might be useful to register the model in your Azure Machine Learning workspace and create an experiment to record the dashboard data so you can track and share your fairness analysis.

In [None]:
from azureml.core import Workspace, Experiment, Model
import joblib
import os

# Load the Azure ML workspace from the saved config file
ws = Workspace.from_config()
print('Ready to work with', ws.name)

# Save the trained model
model_file = 'diabetes_model.pkl'
joblib.dump(value=diabetes_model, filename=model_file)

# Register the model
print('Registering model...')
registered_model = Model.register(model_path=model_file,
                                  model_name='diabetes_classifier',
                                  workspace=ws)
model_id= registered_model.id

print('Model registered.', model_id)

Now you can use the FairLearn package to create binary classification group metric sets for one or more models, and use an Azure Machine Learning experiment to upload the metrics.

In [None]:
from fairlearn.metrics._group_metric_set import _create_group_metric_set
from azureml.contrib.fairness import upload_dashboard_dictionary, download_dashboard_by_upload_id

#  Create a dictionary of model(s) you want to assess for fairness 
sf = { 'Age': S_test.Age}
ys_pred = { model_id:diabetes_model.predict(X_test) }
dash_dict = _create_group_metric_set(y_true=y_test,
                                    predictions=ys_pred,
                                    sensitive_features=sf,
                                    prediction_type='binary_classification')

exp = Experiment(ws, "Diabetes_Fairness")
print(exp)

run = exp.start_logging()
# Upload the dashboard to Azure Machine Learning
try:In this exercise, you'll use the GridSearch feature, which trains multiple models in an attempt to minimize the disparity of predictive performance for the sensitive features in the dataset (in this case, the age groups). You'll optimize the models by applying the EqualizedOdds parity constraint, which tries to ensure that models that exhibit similar true and false positive rates for each sensitive feature grouping.
    dashboard_title = "Fairness insights of Diabetes Classifier"
    upload_id = upload_dashboard_dictionary(run,
                                            dash_dict,
                                            dashboard_name=dashboard_title)
    print("\nUploaded to id: {0}\n".format(upload_id))

    # To test the dashboard, you can download it
    downloaded_dict = download_dashboard_by_upload_id(run, upload_id)
    print(downloaded_dict)
    
finally:
    run.complete()
    
from azureml.widgets import RunDetails

RunDetails(run).show()

## Mitigate Unfairness for the Model
Now that you've analyzed the model for fairness, you can use any of the mitigation techniques supported by the FairLearn package to find a model that achieves the best balance of predictive performance and fairness.
In this exercise, you'll use the GridSearch feature, which trains multiple models in an attempt to minimize the disparity of predictive performance for the sensitive features in the dataset (in this case, the age groups). You'll optimize the models by applying the EqualizedOdds parity constraint, which tries to ensure that models that exhibit similar true and false positive rates for each sensitive feature grouping.

In [None]:
from fairlearn.reductions import GridSearch, EqualizedOdds
import joblib
import os
print('Finding mitigated models...')

# Train multiple models
sweep = GridSearch(DecisionTreeClassifier(),
                   constraints=EqualizedOdds(),
                   grid_size=20)

sweep.fit(X_train, y_train, sensitive_features=S_train.Age)
models = sweep._predictors

# Save the models and get predictions from them (plus the original unmitigated one for comparison)
model_dir = 'mitigated_models'
os.makedirs(model_dir, exist_ok=True)
model_name = 'diabetes_unmitigated'
print(model_name)
joblib.dump(value=diabetes_model, filename=os.path.join(model_dir, '{0}.pkl'.format(model_name)))
predictions = {model_name: diabetes_model.predict(X_test)}
i = 0
for model in models:
    i += 1
    model_name = 'diabetes_mitigated_{0}'.format(i)
    print(model_name)
    joblib.dump(value=model, filename=os.path.join(model_dir, '{0}.pkl'.format(model_name)))
    predictions[model_name] = model.predict(X_test)

Now you can use fairlearn dashboard to compare the mitigated models. The following code visuliaze age by recall and measure disparity. 

In [None]:
FairlearnDashboard(sensitive_features=S_test, 
                   sensitive_feature_names=['Age'],
                   y_true=y_test,
                   y_pred=predictions)

The models are shown on a scatter plot. You can compare the models by measuring the disparity in predictions (in other words, the selection rate) or the disparity in the selected performance metric (in this case, recall). In this scenario, we expect disparity in selection rates (because we know that age is a factor in diabetes, with more positive cases in the older age group). What we're interested in is the disparity in predictive performance, so select the option to measure Disparity in recall.
The chart shows clusters of models with the overall recall metric on the X axis, and the disparity in recall on the Y axis. Therefore, the ideal model (with high recall and low disparity) would be at the bottom right corner of the plot. You can choose the right balance of predictive performance and fairness for your particular needs, and select an appropriate model to see its details.
An important point to reinforce is that applying fairness to a model is a trade-off between overall predictive performance and disparity across sensitive feature groups - generally you must sacrifice some overall predictive performance to ensure that the model predicts fairly for all segments of the population.

## Uploading the Mitigation Dashboard Metrics to Azure ML

In [None]:
# Register the models
registered_model_predictions = dict()
for model_name, prediction_data in predictions.items():
    model_file = os.path.join(model_dir, model_name + ".pkl")
    registered_model = Model.register(model_path=model_file,
                                      model_name=model_name,
                                      workspace=ws)
    registered_model_predictions[registered_model.id] = prediction_data

#  Create a group metric set for binary classification based on the Age feature for all of the models
sf = { 'Age': S_test.Age}
dash_dict = _create_group_metric_set(y_true=y_test,
                                     predictions=registered_model_predictions,
                                     sensitive_features=sf,
                                     prediction_type='binary_classification')

exp = Experiment(ws, "Diabetes_Fairness_Mitigation")
print(exp)

run = exp.start_logging()
RunDetails(run).show()

# Upload the dashboard to Azure Machine Learning
try:
    dashboard_title = "Fairness Comparison of Diabetes Models"
    upload_id = upload_dashboard_dictionary(run,
                                            dash_dict,
                                            dashboard_name=dashboard_title)
    print("\nUploaded to id: {0}\n".format(upload_id))
finally:
    run.complete()