# Sklearn compatible Exponentiated Gradient Reduction

Exponentiated gradient reduction is an in-processing technique that reduces fair classification to a sequence of cost-sensitive classification problems, returning a randomized classifier with the lowest empirical error subject to 
fair classification constraints. The code for exponentiated gradient reduction wraps the source class 
`fairlearn.reductions.ExponentiatedGradient` available in the https://github.com/fairlearn/fairlearn library,
licensed under the MIT Licencse, Copyright Microsoft Corporation.

In [1]:
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

In [2]:
import numpy as np
import pandas as pd

from sklearn.compose import make_column_transformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

from aif360.sklearn.inprocessing import ExponentiatedGradientReduction

from aif360.sklearn.datasets import fetch_adult
from aif360.sklearn.metrics import average_odds_error

### Loading data

Datasets are formatted as separate `X` (# samples x # features) and `y` (# samples x # labels) DataFrames. The index of each DataFrame contains protected attribute values per sample. Datasets may also load a `sample_weight` object to be used with certain algorithms/metrics. All of this makes it so that aif360 is compatible with scikit-learn objects.

For example, we can easily load the Adult dataset from UCI with the following line:

In [3]:
X, y, sample_weight = fetch_adult()
X.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,age,workclass,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
race,sex,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Non-white,Male,25.0,Private,11th,7.0,Never-married,Machine-op-inspct,Own-child,Black,Male,0.0,0.0,40.0,United-States
White,Male,38.0,Private,HS-grad,9.0,Married-civ-spouse,Farming-fishing,Husband,White,Male,0.0,0.0,50.0,United-States
White,Male,28.0,Local-gov,Assoc-acdm,12.0,Married-civ-spouse,Protective-serv,Husband,White,Male,0.0,0.0,40.0,United-States
Non-white,Male,44.0,Private,Some-college,10.0,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688.0,0.0,40.0,United-States
White,Male,34.0,Private,10th,6.0,Never-married,Other-service,Not-in-family,White,Male,0.0,0.0,30.0,United-States


To match the old version, we also remap the "race" feature to "White"/"Non-white",

In [4]:
X.race = X.race.cat.set_categories(['Non-white', 'White'], ordered=True).fillna('Non-white')

We can then map the protected attributes to integers,

In [5]:
X.index = pd.MultiIndex.from_arrays(X.index.codes, names=X.index.names)
y.index = pd.MultiIndex.from_arrays(y.index.codes, names=y.index.names)

and the target classes to 0/1,

In [6]:
y = pd.Series(y.factorize(sort=True)[0], index=y.index)

split the dataset,

In [7]:
(X_train, X_test,
 y_train, y_test) = train_test_split(X, y, train_size=0.7, random_state=1234567)

We use sklearn for one-hot encoding for easy reference to columns associated with protected attributes, information necessary for Exponentiated Gradient Reduction

In [8]:
ohe = make_column_transformer(
        (OneHotEncoder(sparse=False), X_train.dtypes == 'category'),
        remainder='passthrough', verbose_feature_names_out=False)
X_train  = pd.DataFrame(ohe.fit_transform(X_train), columns=ohe.get_feature_names_out(), index=X_train.index)
X_test = pd.DataFrame(ohe.transform(X_test), columns=ohe.get_feature_names_out(), index=X_test.index)

X_train.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,workclass_Federal-gov,workclass_Local-gov,workclass_Private,workclass_Self-emp-inc,workclass_Self-emp-not-inc,workclass_State-gov,workclass_Without-pay,education_10th,education_11th,education_12th,...,native-country_Thailand,native-country_Trinadad&Tobago,native-country_United-States,native-country_Vietnam,native-country_Yugoslavia,age,education-num,capital-gain,capital-loss,hours-per-week
race,sex,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
1,1,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,58.0,11.0,0.0,0.0,42.0
1,0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,51.0,12.0,0.0,0.0,30.0
1,1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,26.0,14.0,0.0,1887.0,40.0
1,1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,44.0,3.0,0.0,0.0,40.0
1,1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,33.0,6.0,0.0,0.0,40.0


The protected attribute information is also replicated in the labels:

In [9]:
y_train.head()

race  sex
1     1      0
      0      1
      1      1
      1      0
      1      0
dtype: int64

### Running metrics

With the data in this format, we can easily train a scikit-learn model and get predictions for the test data:

In [10]:
y_pred = LogisticRegression(solver='liblinear').fit(X_train, y_train).predict(X_test)
lr_acc = accuracy_score(y_test, y_pred)
lr_acc

0.8460234392275374

We can assess how close the predictions are to equality of odds.

`average_odds_error()` computes the (unweighted) average of the absolute values of the true positive rate (TPR) difference and false positive rate (FPR) difference, i.e.:

$$ \tfrac{1}{2}\left(|FPR_{D = \text{unprivileged}} - FPR_{D = \text{privileged}}| + |TPR_{D = \text{unprivileged}} - TPR_{D = \text{privileged}}|\right) $$

In [11]:
lr_aoe_sex = average_odds_error(y_test, y_pred, prot_attr='sex')
lr_aoe_sex

0.09335303807799161

In [12]:
lr_aoe_race = average_odds_error(y_test, y_pred, prot_attr='race')
lr_aoe_race

0.06751597777565721

### Exponentiated Gradient Reduction

Choose a base model for the randomized classifier

In [13]:
estimator = LogisticRegression(solver='liblinear')

Determine the columns associated with the protected attribute(s)

In [14]:
prot_attr_cols = [colname for colname in X_train if "sex" in colname or "race" in colname]

Train the randomized classifier and observe test accuracy. Other options for `constraints` include "DemographicParity", "TruePositiveRateParity", "FalsePositiveRateParity", and "ErrorRateParity".

In [15]:
np.random.seed(0) #for reproducibility
exp_grad_red = ExponentiatedGradientReduction(prot_attr=prot_attr_cols, 
                                              estimator=estimator, 
                                              constraints="EqualizedOdds",
                                              drop_prot_attr=False)
exp_grad_red.fit(X_train, y_train)
egr_acc = exp_grad_red.score(X_test, y_test)
print(egr_acc)

# Check for that accuracy is comparable
assert abs(lr_acc-egr_acc)<=0.03

0.834303825458834


In [16]:
egr_aoe_sex = average_odds_error(y_test, exp_grad_red.predict(X_test), prot_attr='sex')
print(egr_aoe_sex)

# Check for improvement in average odds error for sex
assert egr_aoe_sex<lr_aoe_sex

0.02361168550972803


In [17]:
egr_aoe_race = average_odds_error(y_test, exp_grad_red.predict(X_test), prot_attr='race')
print(egr_aoe_race)

# Check for improvement in average odds error for race
assert egr_aoe_race<lr_aoe_race

0.024975550258025947


Number of calls made to base model algorithm

In [18]:
exp_grad_red.model_.n_oracle_calls_

29

Maximum calls permitted

In [19]:
exp_grad_red.max_iter

50

Instead of passing in a string value for `constraints`, we can also pass a `fairlearn.reductions.moment` object. You could use a predefined moment as we do below or create a custom moment using the fairlearn library.

In [20]:
import fairlearn.reductions as red 

np.random.seed(0) #need for reproducibility
exp_grad_red2 = ExponentiatedGradientReduction(prot_attr=prot_attr_cols, 
                                               estimator=estimator, 
                                               constraints=red.EqualizedOdds(),
                                               drop_prot_attr=False)
exp_grad_red2.fit(X_train, y_train)
exp_grad_red2.score(X_test, y_test)

0.834303825458834

In [21]:
average_odds_error(y_test, exp_grad_red2.predict(X_test), prot_attr='sex')

0.02361168550972803

In [22]:
average_odds_error(y_test, exp_grad_red2.predict(X_test), prot_attr='race')

0.024975550258025947