# Fairness on COMPAS using Inferred Attributes and autogluon


We demonstrate how to enforce a wide range of fairness definitions on the COMPAS dataset. This dataset records paroles caught violating the terms of parole. As it measures who was caught, it is strongly influenced by policing and environmental biases, and should not be confused with a measurement of who actually violated their terms of parole. See [this paper](https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/file/92cc227532d17e56e07902b254dfad10-Paper-round1.pdf) for a discussion of its limitations and caveats.

We use it because it is a standard fairness dataset that captures such strong differences in outcome between people identified as African-American and everyone else, that classifiers trained on this dataset violate most definitions of fairness.

As many of the ethnic groups are too small for reliable statistical estimation, we only consider differences is in outcomes between African-Americans vs. everyone else (labeled as other).
We load and preprocess the COMPAS dataset, splitting it into three roughly equal partitions of train, validation, and test:


In [1]:
import pandas as pd
import numpy as np
from autogluon.tabular import TabularPredictor
from oxonfair import FairPredictor, inferred_attribute_builder
from oxonfair.utils import group_metrics as gm
all_data = pd.read_csv('https://github.com/propublica/compas-analysis/raw/master/compas-scores-two-years.csv')
condensed_data=all_data[['sex','race','age', 'juv_fel_count', 'juv_misd_count', 'juv_other_count', 'priors_count', 'age_cat', 'c_charge_degree','two_year_recid']].copy()
condensed_data.replace({'Caucasian':'Other', 'Hispanic':'Other', 'Native American':'Other', 'Asian':'Other'},inplace=True)
train=condensed_data.sample(frac=0.3, random_state=0)
val_and_test=condensed_data.drop(train.index)
val=val_and_test.sample(frac=0.5, random_state=0)
test=val_and_test.drop(val.index)

  from .autonotebook import tqdm as notebook_tqdm


To enforce fairness constraints without access to protected attributes at test time, we train two classifiers to infer the 2-year recidivism rate, and ethnicity.

In [2]:
predictor2, protected = inferred_attribute_builder(train, 'two_year_recid', 'race', time_limit=5)

No path specified. Models will be saved in: "AutogluonModels/ag-20240617_142348"


No presets specified! To achieve strong results with AutoGluon, it is recommended to use the available presets.
	Recommended Presets (For more details refer to https://auto.gluon.ai/stable/tutorials/tabular/tabular-essentials.html#presets):
	presets='best_quality'   : Maximize accuracy. Default time_limit=3600.
	presets='high_quality'   : Strong accuracy with fast inference speed. Default time_limit=3600.
	presets='good_quality'   : Good accuracy with very fast inference speed. Default time_limit=3600.
	presets='medium_quality' : Fast training time, ideal for initial prototyping.


Beginning AutoGluon training ... Time limit = 5s


AutoGluon will save models to "AutogluonModels/ag-20240617_142348"


AutoGluon Version:  1.1.0
Python Version:     3.10.13
Operating System:   Darwin
Platform Machine:   arm64
Platform Version:   Darwin Kernel Version 23.5.0: Wed May  1 20:14:38 PDT 2024; root:xnu-10063.121.3~5/RELEASE_ARM64_T6020
CPU Count:          10
Memory Avail:       8.22 GB / 16.00 GB (51.4%)
Disk Space Avail:   363.57 GB / 460.43 GB (79.0%)


Train Data Rows:    2164


Train Data Columns: 8


Label Column:       two_year_recid


AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).


	2 unique label values:  [0, 1]


	If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])


Problem Type:       binary


Preprocessing data ...


Selected class <--> label mapping:  class 1 = 1, class 0 = 0


Using Feature Generators to preprocess the data ...


Fitting AutoMLPipelineFeatureGenerator...


	Available Memory:                    8421.78 MB


	Train Data (Original)  Memory Usage: 0.47 MB (0.0% of available memory)


	Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.


	Stage 1 Generators:


		Fitting AsTypeFeatureGenerator...


			Note: Converting 2 features to boolean dtype as they only contain 2 unique values.


	Stage 2 Generators:


		Fitting FillNaFeatureGenerator...


	Stage 3 Generators:


		Fitting IdentityFeatureGenerator...


		Fitting CategoryFeatureGenerator...


			Fitting CategoryMemoryMinimizeFeatureGenerator...


	Stage 4 Generators:


		Fitting DropUniqueFeatureGenerator...


	Stage 5 Generators:


		Fitting DropDuplicatesFeatureGenerator...


	Types of features in original data (raw dtype, special dtypes):


		('int', [])    : 5 | ['age', 'juv_fel_count', 'juv_misd_count', 'juv_other_count', 'priors_count']


		('object', []) : 3 | ['sex', 'age_cat', 'c_charge_degree']


	Types of features in processed data (raw dtype, special dtypes):


		('category', [])  : 1 | ['age_cat']


		('int', [])       : 5 | ['age', 'juv_fel_count', 'juv_misd_count', 'juv_other_count', 'priors_count']


		('int', ['bool']) : 2 | ['sex', 'c_charge_degree']


	0.0s = Fit runtime


	8 features in original data used to generate 8 features in processed data.


	Train Data (Processed) Memory Usage: 0.09 MB (0.0% of available memory)


Data preprocessing and feature engineering runtime = 0.03s ...


AutoGluon will gauge predictive performance using evaluation metric: 'accuracy'


	To change this, specify the eval_metric parameter of Predictor()


Automatically generating train/validation split with holdout_frac=0.2, Train Rows: 1731, Val Rows: 433


User-specified model hyperparameters to be fit:
{
	'NN_TORCH': {},
	'GBM': [{'extra_trees': True, 'ag_args': {'name_suffix': 'XT'}}, {}, 'GBMLarge'],
	'CAT': {},
	'XGB': {},
	'FASTAI': {},
	'RF': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
	'XT': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
	'KNN': [{'weights': 'uniform', 'ag_args': {'name_suffix': 'Unif'}}, {'weights': 'distance', 'ag_args': {'name_suffix': 'Dist'}}],
}


Fitting 13 L1 models ...


Fitting model: KNeighborsUnif ... Training model for up to 4.97s of the 4.97s of remaining time.


	0.6351	 = Validation score   (accuracy)


	1.26s	 = Training   runtime


	0.03s	 = Validation runtime


Fitting model: KNeighborsDist ... Training model for up to 3.67s of the 3.67s of remaining time.


	0.6259	 = Validation score   (accuracy)


	0.0s	 = Training   runtime


	0.01s	 = Validation runtime


Fitting model: LightGBMXT ... Training model for up to 3.65s of the 3.65s of remaining time.


Dask dataframe query planning is disabled because dask-expr is not installed.

You can install it with `pip install dask[dataframe]` or `conda install dask`.
This will raise in a future version.



	0.6882	 = Validation score   (accuracy)


	3.9s	 = Training   runtime


	0.0s	 = Validation runtime


Fitting model: WeightedEnsemble_L2 ... Training model for up to 4.97s of the -0.3s of remaining time.


	Ensemble Weights: {'LightGBMXT': 1.0}


	0.6882	 = Validation score   (accuracy)


	0.01s	 = Training   runtime


	0.0s	 = Validation runtime


AutoGluon training complete, total runtime = 5.34s ... Best model: "WeightedEnsemble_L2"


TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels/ag-20240617_142348")


No path specified. Models will be saved in: "AutogluonModels/ag-20240617_142353-001"


No presets specified! To achieve strong results with AutoGluon, it is recommended to use the available presets.
	Recommended Presets (For more details refer to https://auto.gluon.ai/stable/tutorials/tabular/tabular-essentials.html#presets):
	presets='best_quality'   : Maximize accuracy. Default time_limit=3600.
	presets='high_quality'   : Strong accuracy with fast inference speed. Default time_limit=3600.
	presets='good_quality'   : Good accuracy with very fast inference speed. Default time_limit=3600.
	presets='medium_quality' : Fast training time, ideal for initial prototyping.


Beginning AutoGluon training ... Time limit = 5s


AutoGluon will save models to "AutogluonModels/ag-20240617_142353-001"


AutoGluon Version:  1.1.0
Python Version:     3.10.13
Operating System:   Darwin
Platform Machine:   arm64
Platform Version:   Darwin Kernel Version 23.5.0: Wed May  1 20:14:38 PDT 2024; root:xnu-10063.121.3~5/RELEASE_ARM64_T6020
CPU Count:          10
Memory Avail:       8.10 GB / 16.00 GB (50.6%)
Disk Space Avail:   363.56 GB / 460.43 GB (79.0%)


Train Data Rows:    2164


Train Data Columns: 8


Label Column:       race


AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).


	2 unique label values:  ['African-American', 'Other']


	If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])


Problem Type:       binary


Preprocessing data ...


Selected class <--> label mapping:  class 1 = Other, class 0 = African-American


	Note: For your binary classification, AutoGluon arbitrarily selected which label-value represents positive (Other) vs negative (African-American) class.
	To explicitly set the positive_class, either rename classes to 1 and 0, or specify positive_class in Predictor init.


Using Feature Generators to preprocess the data ...


Fitting AutoMLPipelineFeatureGenerator...


	Available Memory:                    8296.98 MB


	Train Data (Original)  Memory Usage: 0.47 MB (0.0% of available memory)


	Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.


	Stage 1 Generators:


		Fitting AsTypeFeatureGenerator...


			Note: Converting 2 features to boolean dtype as they only contain 2 unique values.


	Stage 2 Generators:


		Fitting FillNaFeatureGenerator...


	Stage 3 Generators:


		Fitting IdentityFeatureGenerator...


		Fitting CategoryFeatureGenerator...


			Fitting CategoryMemoryMinimizeFeatureGenerator...


	Stage 4 Generators:


		Fitting DropUniqueFeatureGenerator...


	Stage 5 Generators:


		Fitting DropDuplicatesFeatureGenerator...


	Types of features in original data (raw dtype, special dtypes):


		('int', [])    : 5 | ['age', 'juv_fel_count', 'juv_misd_count', 'juv_other_count', 'priors_count']


		('object', []) : 3 | ['sex', 'age_cat', 'c_charge_degree']


	Types of features in processed data (raw dtype, special dtypes):


		('category', [])  : 1 | ['age_cat']


		('int', [])       : 5 | ['age', 'juv_fel_count', 'juv_misd_count', 'juv_other_count', 'priors_count']


		('int', ['bool']) : 2 | ['sex', 'c_charge_degree']


	0.0s = Fit runtime


	8 features in original data used to generate 8 features in processed data.


	Train Data (Processed) Memory Usage: 0.09 MB (0.0% of available memory)


Data preprocessing and feature engineering runtime = 0.05s ...


AutoGluon will gauge predictive performance using evaluation metric: 'accuracy'


	To change this, specify the eval_metric parameter of Predictor()


Automatically generating train/validation split with holdout_frac=0.2, Train Rows: 1731, Val Rows: 433


User-specified model hyperparameters to be fit:
{
	'NN_TORCH': {},
	'GBM': [{'extra_trees': True, 'ag_args': {'name_suffix': 'XT'}}, {}, 'GBMLarge'],
	'CAT': {},
	'XGB': {},
	'FASTAI': {},
	'RF': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
	'XT': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
	'KNN': [{'weights': 'uniform', 'ag_args': {'name_suffix': 'Unif'}}, {'weights': 'distance', 'ag_args': {'name_suffix': 'Dist'}}],
}


Fitting 13 L1 models ...


Fitting model: KNeighborsUnif ... Training model for up to 4.95s of the 4.95s of remaining time.


	0.6051	 = Validation score   (accuracy)


	0.0s	 = Training   runtime


	0.02s	 = Validation runtime


Fitting model: KNeighborsDist ... Training model for up to 4.93s of the 4.92s of remaining time.


	0.5889	 = Validation score   (accuracy)


	0.0s	 = Training   runtime


	0.02s	 = Validation runtime


Fitting model: LightGBMXT ... Training model for up to 4.9s of the 4.9s of remaining time.


	Ran out of time, early stopping on iteration 583. Best iteration is:
	[360]	valid_set's binary_error: 0.344111


	0.6559	 = Validation score   (accuracy)


	4.92s	 = Training   runtime


	0.01s	 = Validation runtime


Fitting model: WeightedEnsemble_L2 ... Training model for up to 4.95s of the -0.13s of remaining time.


	Ensemble Weights: {'LightGBMXT': 0.8, 'KNeighborsUnif': 0.2}


	0.6582	 = Validation score   (accuracy)


	0.02s	 = Training   runtime


	0.0s	 = Validation runtime


AutoGluon training complete, total runtime = 5.2s ... Best model: "WeightedEnsemble_L2"


TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels/ag-20240617_142353-001")


From these a single predictor that maximizes accuracy while reducing the demographic parity violation to less than 2.5% can be trained by running:

In [3]:
fpredictor=FairPredictor(predictor2, val, 'race', inferred_groups=protected)
fpredictor.fit(gm.accuracy, gm.demographic_parity, 0.025)

Now we will show how a family of fairness measures can be individually optimized. First, we consider the measures of Sagemaker Clarify. 

The following code plots a table showing the change in accuracy and the fairness measure on a held-out test set as we decrease the fairness measure to less than 0.025 (on validation) for all measures except for disparate impact which we raise to above 0.975.
We define a helper function for evaluation:

In [4]:
def evaluate(fpredictor,use_metrics):
    "Print a table showing the accuracy drop that comes with enforcing fairness"
    extra_metrics = {**use_metrics, 'accuracy':gm.accuracy}
    collect=pd.DataFrame(columns=['Measure (original)', 'Measure (updated)', 'Accuracy (original)', 'Accuracy (updated)'])
    for d in use_metrics.items():
        if d[1].greater_is_better is False:
            fpredictor.fit(gm.accuracy, d[1], 0.025)
        else:
            fpredictor.fit(gm.accuracy, d[1], 1-0.025)
        tmp=fpredictor.evaluate_fairness(test, metrics=extra_metrics,verbose=False)
        collect.loc[d[1].name]=np.concatenate((np.asarray(tmp.loc[d[0]]), np.asarray(tmp.loc['accuracy'])), 0)
    print(collect.to_markdown())

We can now contrast the behavior of a fair classifier that relies on access to the protected attribute at test time with one that infers it.

In [5]:
# we first create a classifier using the protected attribute
predictor=TabularPredictor(label='two_year_recid').fit(train_data=train,time_limit=5)
fpredictor = FairPredictor(predictor, val, 'race', )
evaluate(fpredictor, gm.clarify_metrics)


No path specified. Models will be saved in: "AutogluonModels/ag-20240617_142359"


No presets specified! To achieve strong results with AutoGluon, it is recommended to use the available presets.
	Recommended Presets (For more details refer to https://auto.gluon.ai/stable/tutorials/tabular/tabular-essentials.html#presets):
	presets='best_quality'   : Maximize accuracy. Default time_limit=3600.
	presets='high_quality'   : Strong accuracy with fast inference speed. Default time_limit=3600.
	presets='good_quality'   : Good accuracy with very fast inference speed. Default time_limit=3600.
	presets='medium_quality' : Fast training time, ideal for initial prototyping.


Beginning AutoGluon training ... Time limit = 5s


AutoGluon will save models to "AutogluonModels/ag-20240617_142359"


AutoGluon Version:  1.1.0
Python Version:     3.10.13
Operating System:   Darwin
Platform Machine:   arm64
Platform Version:   Darwin Kernel Version 23.5.0: Wed May  1 20:14:38 PDT 2024; root:xnu-10063.121.3~5/RELEASE_ARM64_T6020
CPU Count:          10
Memory Avail:       5.94 GB / 16.00 GB (37.1%)
Disk Space Avail:   363.52 GB / 460.43 GB (79.0%)


Train Data Rows:    2164


Train Data Columns: 9


Label Column:       two_year_recid


AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).


	2 unique label values:  [0, 1]


	If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])


Problem Type:       binary


Preprocessing data ...


Selected class <--> label mapping:  class 1 = 1, class 0 = 0


Using Feature Generators to preprocess the data ...


Fitting AutoMLPipelineFeatureGenerator...


	Available Memory:                    6083.89 MB


	Train Data (Original)  Memory Usage: 0.61 MB (0.0% of available memory)


	Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.


	Stage 1 Generators:


		Fitting AsTypeFeatureGenerator...


			Note: Converting 3 features to boolean dtype as they only contain 2 unique values.


	Stage 2 Generators:


		Fitting FillNaFeatureGenerator...


	Stage 3 Generators:


		Fitting IdentityFeatureGenerator...


		Fitting CategoryFeatureGenerator...


			Fitting CategoryMemoryMinimizeFeatureGenerator...


	Stage 4 Generators:


		Fitting DropUniqueFeatureGenerator...


	Stage 5 Generators:


		Fitting DropDuplicatesFeatureGenerator...


	Types of features in original data (raw dtype, special dtypes):


		('int', [])    : 5 | ['age', 'juv_fel_count', 'juv_misd_count', 'juv_other_count', 'priors_count']


		('object', []) : 4 | ['sex', 'race', 'age_cat', 'c_charge_degree']


	Types of features in processed data (raw dtype, special dtypes):


		('category', [])  : 1 | ['age_cat']


		('int', [])       : 5 | ['age', 'juv_fel_count', 'juv_misd_count', 'juv_other_count', 'priors_count']


		('int', ['bool']) : 3 | ['sex', 'race', 'c_charge_degree']


	0.1s = Fit runtime


	9 features in original data used to generate 9 features in processed data.


	Train Data (Processed) Memory Usage: 0.09 MB (0.0% of available memory)


Data preprocessing and feature engineering runtime = 0.07s ...


AutoGluon will gauge predictive performance using evaluation metric: 'accuracy'


	To change this, specify the eval_metric parameter of Predictor()


Automatically generating train/validation split with holdout_frac=0.2, Train Rows: 1731, Val Rows: 433


User-specified model hyperparameters to be fit:
{
	'NN_TORCH': {},
	'GBM': [{'extra_trees': True, 'ag_args': {'name_suffix': 'XT'}}, {}, 'GBMLarge'],
	'CAT': {},
	'XGB': {},
	'FASTAI': {},
	'RF': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
	'XT': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
	'KNN': [{'weights': 'uniform', 'ag_args': {'name_suffix': 'Unif'}}, {'weights': 'distance', 'ag_args': {'name_suffix': 'Dist'}}],
}


Fitting 13 L1 models ...


Fitting model: KNeighborsUnif ... Training model for up to 4.93s of the 4.93s of remaining time.


	0.6351	 = Validation score   (accuracy)


	0.0s	 = Training   runtime


	0.02s	 = Validation runtime


Fitting model: KNeighborsDist ... Training model for up to 4.87s of the 4.87s of remaining time.


	0.6259	 = Validation score   (accuracy)


	0.01s	 = Training   runtime


	0.02s	 = Validation runtime


Fitting model: LightGBMXT ... Training model for up to 4.81s of the 4.81s of remaining time.


	0.6975	 = Validation score   (accuracy)


	4.78s	 = Training   runtime


	0.0s	 = Validation runtime


Fitting model: WeightedEnsemble_L2 ... Training model for up to 4.93s of the -0.03s of remaining time.


	Ensemble Weights: {'LightGBMXT': 1.0}


	0.6975	 = Validation score   (accuracy)


	0.02s	 = Training   runtime


	0.0s	 = Validation runtime


AutoGluon training complete, total runtime = 5.12s ... Best model: "WeightedEnsemble_L2"


TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels/ag-20240617_142359")


|                                                         |   Measure (original) |   Measure (updated) |   Accuracy (original) |   Accuracy (updated) |
|:--------------------------------------------------------|---------------------:|--------------------:|----------------------:|---------------------:|
| Demographic Parity                                      |          0.33658     |          0.0102499  |              0.661386 |             0.646337 |
| Disparate Impact                                        |          0.424223    |          0.985274   |              0.661386 |             0.638812 |
| Average Group Difference in Conditional Acceptance Rate |          0.644987    |          0.055467   |              0.661386 |             0.670891 |
| Average Group Difference in Conditional Rejectance Rate |          0.354906    |          0.00356274 |              0.661386 |             0.666535 |
| Average Group Difference in Accuracy                    |          0.0484221   |      

In contrast, even though the base classifiers have similar accuracy, when using inferred attributes (N.B. the base classifier is not directly trained to maximize accuracy, which is why it can have higher accuracy when it doesn't use race), we see a much greater drop in accuracy as fairness is enforced which is consistent with [Lipton et al.](https://arxiv.org/pdf/1711.07076.pdf)


In [6]:
# Now using the inferred attributes
fpredictor2 = FairPredictor(predictor2, val, 'race', inferred_groups=protected)
evaluate(fpredictor2, gm.clarify_metrics)

|                                                         |   Measure (original) |   Measure (updated) |   Accuracy (original) |   Accuracy (updated) |
|:--------------------------------------------------------|---------------------:|--------------------:|----------------------:|---------------------:|
| Demographic Parity                                      |            0.229454  |         0.0231172   |              0.667327 |             0.570297 |
| Disparate Impact                                        |            0.572729  |         0.98373     |              0.667327 |             0.56396  |
| Average Group Difference in Conditional Acceptance Rate |            0.275048  |         0.000524707 |              0.667327 |             0.610693 |
| Average Group Difference in Conditional Rejectance Rate |            0.161647  |         0.00165289  |              0.667327 |             0.653069 |
| Average Group Difference in Accuracy                    |            0.0398999 |      

Similar results can be obtained using the metrics of Verma and Rubin, by running

In [7]:
evaluate(fpredictor, gm.verma_metrics)


|                                                 |   Measure (original) |   Measure (updated) |   Accuracy (original) |   Accuracy (updated) |
|:------------------------------------------------|---------------------:|--------------------:|----------------------:|---------------------:|
| Statistical Parity                              |          0.33658     |           0.0102499 |              0.661386 |             0.646337 |
| Predictive Parity                               |          0.000623377 |           0.0172127 |              0.661386 |             0.666139 |
| Equal Opportunity                               |          0.306877    |           0.0652084 |              0.661386 |             0.640396 |
| Average Group Difference in False Negative Rate |          0.306877    |           0.0652084 |              0.661386 |             0.640396 |
| Equalized Odds                                  |          0.301797    |           0.0154928 |              0.661386 |             0.6

And again for infered attributes. 

In [8]:
evaluate(fpredictor2, gm.verma_metrics)


No solutions satisfy the constraint found, selecting the closest solution.


|                                                 |   Measure (original) |   Measure (updated) |   Accuracy (original) |   Accuracy (updated) |
|:------------------------------------------------|---------------------:|--------------------:|----------------------:|---------------------:|
| Statistical Parity                              |            0.229454  |          0.0231172  |              0.667327 |             0.570297 |
| Predictive Parity                               |            0.042318  |          0.0965697  |              0.667327 |             0.607921 |
| Equal Opportunity                               |            0.188465  |          0.00202066 |              0.667327 |             0.589703 |
| Average Group Difference in False Negative Rate |            0.188465  |          0.00202066 |              0.667327 |             0.589703 |
| Equalized Odds                                  |            0.190124  |          0.0223392  |              0.667327 |             0.5