Skip to content

Commit

Permalink
adding more documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
Muhammad Bilal Zafar committed Apr 16, 2016
1 parent ea30b24 commit 8d2978e
Show file tree
Hide file tree
Showing 3 changed files with 56 additions and 10 deletions.
16 changes: 8 additions & 8 deletions README.md
@@ -1,12 +1,12 @@
#Learning Fair Classifiers

This repository provides a logistic regression implementation in python for our fair classification mechanism introduced in (Zafar et al., 2016). Please cite the paper when using the code.
This repository provides a logistic regression implementation in python for our fair classification mechanism introduced in [(Zafar et al., 2016)](http://arxiv.org/abs/1507.05259v3). Please cite the paper when using the code.

**Dependencies:** numpy, matplotlib, scipy

##1. Fair classification demo

Fair classification corresponds to a scenerio where we are learning classifiers from a dataset that is biased towards/against a specific demographic group, yet the classifier results are fair. For more details, have a look at Section 2 of our [paper](http://arxiv.org/pdf/1507.05259v3.pdf).
Fair classification corresponds to a scenario where we are learning classifiers from a dataset that is biased towards/against a specific demographic group, yet the classifier predictions are fair and do not show the biases contained in the data. For more details, have a look at Section 2 of our [paper](http://arxiv.org/pdf/1507.05259v3.pdf).

###1.1. Generating a biased dataset
Lets start off by generating a sample dataset where class labels are biased towards a certain group.
Expand All @@ -20,7 +20,7 @@ The code will generate a dataset with a multivariate normal distribution. The da

<img src="synthetic_data_demo/img/data.png" width="500px" style="float: right;">

Green color denotes the positive class while red denotes negative. Circles represent the protected group while crosses represent the non-protected group. It can be seen that class labels (green and red) are highly correlated with the sensitive feature value (protected and non-protected), that is, most of the green points are in the non-protected class while most of red points are in protected class. **Close the figure** for the code to continue. Next, the code will also output the following details about the dataset:
Green color denotes the positive class while red denotes negative. Circles represent the non-protected group while crosses represent the protected group. It can be seen that class labels (green and red) are highly correlated with the sensitive feature value (protected and non-protected), that is, most of the green (positive class) points are in the non-protected class while most of red (negative class) points are in the protected class. **Close the figure** for the code to continue. Next, the code will output the following details about the dataset:

```
Total data points: 2000
Expand Down Expand Up @@ -61,11 +61,11 @@ P-rule achieved: 48%
Covariance between sensitive feature and decision from distance boundary : 0.809
```

We can see that the classifier decisions reflect the biases contained in the original data, and the p-rule is 48%, showing the unfairness of classifier outcomes. The reason why the classifier shows similar biases as ones contained in the data is that the classifier model tries to minimize the loss (or maximize accuracy) on the training data by learning the patterns in the data as much as possible. One of the patterns was the unfairness w.r.t. the sensitive feature, and the classifier ended up copying that as well.
We can see that the classifier decisions reflect the biases contained in the original data, and the p-rule is 48%, showing the unfairness of classifier outcomes. The reason why the classifier shows similar biases as ones contained in the data is that the classifier model tries to minimize the loss (or maximize the accuracy) on the training data by learning the patterns in the data as best as possible. One of the patterns was the unfairness w.r.t. the sensitive feature, and the classifier ended up learning that as well.

###1.3. Optimizing classifier accuracy subject to fairness constraints

Next, we will try to make these outcomes fair by still **optimizing for classifier accuracy**, but **subject it to fairness constraints**. Refer to Section 3.2 of our paper for more details.
Next, we will try to make these outcomes fair by still **optimizing for classifier accuracy**, but **subject it to fairness constraints**. Refer to Section 3.2 of our [paper](http://arxiv.org/pdf/1507.05259v3.pdf) for more details.

```python
apply_fairness_constraints = 1 # set this flag to one since we want to optimize accuracy subject to fairness constraints
Expand Down Expand Up @@ -96,7 +96,7 @@ The figure shows the original decision boundary (without any constraints) and th

###1.4. Optimizing fairness subject to accuracy constraints

Now lets try to **optimize fairness** (that does not necessarily correspond to a 100% p-rule) **subject to a deterministic loss in accuracy**. The details can be found in Section 3.3 of our paper.
Now lets try to **optimize fairness** (that does not necessarily correspond to a 100% p-rule) **subject to a deterministic loss in accuracy**. The details can be found in Section 3.3 of our [paper](http://arxiv.org/pdf/1507.05259v3.pdf).

```python
apply_fairness_constraints = 0 # flag for fairness constraint is set back to0 since we want to apply the accuracy constraint now
Expand Down Expand Up @@ -125,7 +125,7 @@ You can experiment with more values of gamma to see how allowing more loss in ac

###1.5. Constraints on misclassying positive examples

Next, lets try to train a fair classifier, however, lets put an additional constraint: do not misclassify any points that were classified in positive class by the original (unconstrained) classifier! The idea here is that we only want to promote the examples from protected group to the positive class, without demoting any points from the positive class. Details of this formulation can be found in Section 3.3 of our paper. The code works as follows:
Next, lets try to train a fair classifier, however, lets put an additional constraint: do not misclassify any non-protected points that were classified in positive class by the original (unconstrained) classifier! The idea here is that we only want to promote the examples from protected group to the positive class, without demoting any non-protected points from the positive class (this might be a business necessity in many scenarios). Details of this formulation can be found in Section 3.3 of our [paper](http://arxiv.org/pdf/1507.05259v3.pdf). The code works as follows:

```python
apply_fairness_constraints = 0 # flag for fairness constraint is set back to0 since we want to apply the accuracy constraint now
Expand All @@ -148,7 +148,7 @@ Covariance between sensitive feature and decision from distance boundary : 0.075

<img src="synthetic_data_demo/img/a_cons_fine.png" width="500px" style="float: right;">

Notice the movement of decision boundary: we are only moving points to the positive class to achieve fairness (and not moving anything that was classified as positive by the original classifier to the negative class).
Notice the movement of decision boundary: we are only moving points to the positive class to achieve fairness (and not moving any non-protected point that was classified as positive by the original classifier to the negative class).

###1.6. Understanding trade-offs between fairness and accuracy
Remember, while optimizing for accuracy subject to fairness constraints, we forced the classifier to achieve perfect fairness by setting the covariance threshold to 0. This resulted in a perfectly fair classifier but we had to incur a rather big loss in accuracy (0.71 from 0.85). Lets see if we try a range of fairness values (not necessarily 100% p-rule), what kind of accuracy we will be achieving. We will do that by trying a range of values of covariance threshold (not only 0!) for this purpose. Execute the following command:
Expand Down
46 changes: 46 additions & 0 deletions adult_data_demo/fairness_acc_tradeoff.py
@@ -0,0 +1,46 @@
import os,sys
import numpy as np
from prepare_adult_data import *
sys.path.insert(0, '../fair_classification/') # the code for fair classification is in this directory
import utils as ut
import loss_funcs as lf # loss funcs that can be optimized subject to various constraints

NUM_FOLDS = 10 # we will show 10-fold cross validation accuracy as a performance measure

def test_synthetic_data():

""" Generate the synthetic data """
X, y, x_control = load_adult_data(load_data_size=None) # set the argument to none, or no arguments if you want to test with the whole data -- we are subsampling for performance speedup
ut.compute_p_rule(x_control["sex"], y) # compute the p-rule in the original data

""" Classify the data without any constraints """
apply_fairness_constraints = 0
apply_accuracy_constraint = 0
sep_constraint = 0

loss_function = lf._logistic_loss
X = ut.add_intercept(X) # add intercept to X before applying the linear classifier
test_acc_arr, train_acc_arr, correlation_dict_test_arr, correlation_dict_train_arr, cov_dict_test_arr, cov_dict_train_arr = ut.compute_cross_validation_error(X, y, x_control, NUM_FOLDS, loss_function, apply_fairness_constraints, apply_accuracy_constraint, sep_constraint, ['sex'], [{} for i in range(0,NUM_FOLDS)])
print
print "== Unconstrained (original) classifier =="
ut.print_classifier_fairness_stats(test_acc_arr, correlation_dict_test_arr, cov_dict_test_arr, "sex")


""" Now classify such that we achieve perfect fairness """
apply_fairness_constraints = 1
test_acc_arr, train_acc_arr, correlation_dict_test_arr, correlation_dict_train_arr, cov_dict_test_arr, cov_dict_train_arr = ut.compute_cross_validation_error(X, y, x_control, NUM_FOLDS, loss_function, apply_fairness_constraints, apply_accuracy_constraint, sep_constraint, ['sex'], [{'sex':0.0} for i in range(0,NUM_FOLDS)])
print
print "== Constrained (fair) classifier =="
ut.print_classifier_fairness_stats(test_acc_arr, correlation_dict_test_arr, cov_dict_test_arr, "sex")

""" Now plot a tradeoff between the fairness and accuracy """
ut.plot_cov_thresh_vs_acc_pos_ratio(X, y, x_control, NUM_FOLDS, loss_function, apply_fairness_constraints, apply_accuracy_constraint, sep_constraint, ['sex'])



def main():
test_synthetic_data()


if __name__ == '__main__':
main()
4 changes: 2 additions & 2 deletions fair_classification/utils.py
Expand Up @@ -91,7 +91,7 @@ def constraint_gamma_all(w, x, y, initial_loss_arr):
old_loss = sum(initial_loss_arr)
return ((1.0 + gamma) * old_loss) - new_loss

def constraint_protected_people(w,x,y): #
def constraint_protected_people(w,x,y): # dont confuse the protected here with the sensitive feature protected/non-protected values -- protected here means that these points should not be misclassified to negative class
return np.dot(w, x.T) # if this is positive, the constraint is satisfied
def constraint_unprotected_people(w,ind,old_loss,x,y):

Expand All @@ -105,7 +105,7 @@ def constraint_unprotected_people(w,ind,old_loss,x,y):
if sep_constraint == True: # separate gemma for different people
for i in range(0, len(predicted_labels)):
if predicted_labels[i] == 1.0 and x_control[sensitive_attrs[0]][i] == 1.0: # for now we are assuming just one sensitive attr for reverse constraint, later, extend the code to take into account multiple sensitive attrs
c = ({'type': 'ineq', 'fun': constraint_protected_people, 'args':(x[i], y[i])})
c = ({'type': 'ineq', 'fun': constraint_protected_people, 'args':(x[i], y[i])}) # this constraint makes sure that these people stay in the positive class even in the modified classifier
constraints.append(c)
else:
c = ({'type': 'ineq', 'fun': constraint_unprotected_people, 'args':(i, unconstrained_loss_arr[i], x[i], y[i])})
Expand Down

0 comments on commit 8d2978e

Please sign in to comment.