adding more documentation

mbilalzafar · Apr 16, 2016 · 8d2978e · 8d2978e
1 parent ea30b24
commit 8d2978e
Show file tree

Hide file tree

Showing 3 changed files with 56 additions and 10 deletions.
diff --git a/README.md b/README.md
@@ -1,12 +1,12 @@
 #Learning Fair Classifiers
 
-This repository provides a logistic regression implementation in python for our fair classification mechanism introduced in (Zafar et al., 2016). Please cite the paper when using the code.
+This repository provides a logistic regression implementation in python for our fair classification mechanism introduced in [(Zafar et al., 2016)](http://arxiv.org/abs/1507.05259v3). Please cite the paper when using the code.
 
 **Dependencies:** numpy, matplotlib, scipy
 
 ##1. Fair classification demo
 
-Fair classification corresponds to a scenerio where we are learning classifiers from a dataset that is biased towards/against a specific demographic group, yet the classifier results are fair. For more details, have a look at Section 2 of our [paper](http://arxiv.org/pdf/1507.05259v3.pdf).
+Fair classification corresponds to a scenario where we are learning classifiers from a dataset that is biased towards/against a specific demographic group, yet the classifier predictions are fair and do not show the biases contained in the data. For more details, have a look at Section 2 of our [paper](http://arxiv.org/pdf/1507.05259v3.pdf).
 
 ###1.1. Generating a biased dataset
 Lets start off by generating a sample dataset where class labels are biased towards a certain group.
@@ -20,7 +20,7 @@ The code will generate a dataset with a multivariate normal distribution. The da
 
 <img src="synthetic_data_demo/img/data.png" width="500px" style="float: right;">
 
-Green color denotes the positive class while red denotes negative. Circles represent the protected group while crosses represent the non-protected group. It can be seen that class labels (green and red) are highly correlated with the sensitive feature value (protected and non-protected), that is, most of the green points are in the non-protected class while most of red points are in protected class. **Close the figure** for the code to continue. Next, the code will also output the following details about the dataset:
+Green color denotes the positive class while red denotes negative. Circles represent the non-protected group while crosses represent the protected group. It can be seen that class labels (green and red) are highly correlated with the sensitive feature value (protected and non-protected), that is, most of the green (positive class) points are in the non-protected class while most of red (negative class) points are in the protected class. **Close the figure** for the code to continue. Next, the code will output the following details about the dataset:
 
 ```
 Total data points: 2000
@@ -61,11 +61,11 @@ P-rule achieved: 48%
 Covariance between sensitive feature and decision from distance boundary : 0.809
 ```
 
-We can see that the classifier decisions reflect the biases contained in the original data, and the p-rule is 48%, showing the unfairness of classifier outcomes. The reason why the classifier shows similar biases as ones contained in the data is that the classifier model tries to minimize the loss (or maximize accuracy) on the training data by learning the patterns in the data as much as possible. One of the patterns was the unfairness w.r.t. the sensitive feature, and the classifier ended up copying that as well.
+We can see that the classifier decisions reflect the biases contained in the original data, and the p-rule is 48%, showing the unfairness of classifier outcomes. The reason why the classifier shows similar biases as ones contained in the data is that the classifier model tries to minimize the loss (or maximize the accuracy) on the training data by learning the patterns in the data as best as possible. One of the patterns was the unfairness w.r.t. the sensitive feature, and the classifier ended up learning that as well.
 
 ###1.3. Optimizing classifier accuracy subject to fairness constraints
 
-Next, we will try to make these outcomes fair by still **optimizing for classifier accuracy**, but **subject it to fairness constraints**. Refer to Section 3.2 of our paper for more details.
+Next, we will try to make these outcomes fair by still **optimizing for classifier accuracy**, but **subject it to fairness constraints**. Refer to Section 3.2 of our [paper](http://arxiv.org/pdf/1507.05259v3.pdf) for more details.
 
 ```python
 apply_fairness_constraints = 1 # set this flag to one since we want to optimize accuracy subject to fairness constraints
@@ -96,7 +96,7 @@ The figure shows the original decision boundary (without any constraints) and th
 
 ###1.4. Optimizing fairness subject to accuracy constraints
 
-Now lets try to **optimize fairness** (that does not necessarily correspond to a 100% p-rule) **subject to a deterministic loss in accuracy**. The details can be found in Section 3.3 of our paper.
+Now lets try to **optimize fairness** (that does not necessarily correspond to a 100% p-rule) **subject to a deterministic loss in accuracy**. The details can be found in Section 3.3 of our [paper](http://arxiv.org/pdf/1507.05259v3.pdf).
 
 ```python
 apply_fairness_constraints = 0 # flag for fairness constraint is set back to0 since we want to apply the accuracy constraint now
@@ -125,7 +125,7 @@ You can experiment with more values of gamma to see how allowing more loss in ac
 
 ###1.5. Constraints on misclassying positive examples
 
-Next, lets try to train a fair classifier, however, lets put an additional constraint: do not misclassify any points that were classified in positive class by the original (unconstrained) classifier! The idea here is that we only want to promote the examples from protected group to the positive class, without demoting any points from the positive class. Details of this formulation can be found in Section 3.3 of our paper. The code works as follows:
+Next, lets try to train a fair classifier, however, lets put an additional constraint: do not misclassify any non-protected points that were classified in positive class by the original (unconstrained) classifier! The idea here is that we only want to promote the examples from protected group to the positive class, without demoting any non-protected points from the positive class (this might be a business necessity in many scenarios). Details of this formulation can be found in Section 3.3 of our [paper](http://arxiv.org/pdf/1507.05259v3.pdf). The code works as follows:
 
 ```python
 apply_fairness_constraints = 0 # flag for fairness constraint is set back to0 since we want to apply the accuracy constraint now
@@ -148,7 +148,7 @@ Covariance between sensitive feature and decision from distance boundary : 0.075
 
 <img src="synthetic_data_demo/img/a_cons_fine.png" width="500px" style="float: right;">
 
-Notice the movement of decision boundary: we are only moving points to the positive class to achieve fairness (and not moving anything that was classified as positive by the original classifier to the negative class).
+Notice the movement of decision boundary: we are only moving points to the positive class to achieve fairness (and not moving any non-protected point that was classified as positive by the original classifier to the negative class).
 
 ###1.6. Understanding trade-offs between fairness and accuracy
 Remember, while optimizing for accuracy subject to fairness constraints, we forced the classifier to achieve perfect fairness by setting the covariance threshold to 0. This resulted in a perfectly fair classifier but we had to incur a rather big loss in accuracy (0.71 from 0.85). Lets see if we try a range of fairness values (not necessarily 100% p-rule), what kind of accuracy we will be achieving. We will do that by trying a range of values of covariance threshold (not only 0!) for this purpose. Execute the following command:

diff --git a/adult_data_demo/fairness_acc_tradeoff.py b/adult_data_demo/fairness_acc_tradeoff.py
@@ -0,0 +1,46 @@
+import os,sys
+import numpy as np
+from prepare_adult_data import *
+sys.path.insert(0, '../fair_classification/') # the code for fair classification is in this directory
+import utils as ut
+import loss_funcs as lf # loss funcs that can be optimized subject to various constraints
+
+NUM_FOLDS = 10 # we will show 10-fold cross validation accuracy as a performance measure
+
+def test_synthetic_data():
+
+	""" Generate the synthetic data """
+	X, y, x_control = load_adult_data(load_data_size=None) # set the argument to none, or no arguments if you want to test with the whole data -- we are subsampling for performance speedup
+	ut.compute_p_rule(x_control["sex"], y) # compute the p-rule in the original data
+
+	""" Classify the data without any constraints """
+	apply_fairness_constraints = 0
+	apply_accuracy_constraint = 0
+	sep_constraint = 0
+
+	loss_function = lf._logistic_loss
+	X = ut.add_intercept(X) # add intercept to X before applying the linear classifier
+	test_acc_arr, train_acc_arr, correlation_dict_test_arr, correlation_dict_train_arr, cov_dict_test_arr, cov_dict_train_arr = ut.compute_cross_validation_error(X, y, x_control, NUM_FOLDS, loss_function, apply_fairness_constraints, apply_accuracy_constraint, sep_constraint, ['sex'], [{} for i in range(0,NUM_FOLDS)])
+	print
+	print "== Unconstrained (original) classifier =="
+	ut.print_classifier_fairness_stats(test_acc_arr, correlation_dict_test_arr, cov_dict_test_arr, "sex")
+
+
+	""" Now classify such that we achieve perfect fairness """
+	apply_fairness_constraints = 1
+	test_acc_arr, train_acc_arr, correlation_dict_test_arr, correlation_dict_train_arr, cov_dict_test_arr, cov_dict_train_arr = ut.compute_cross_validation_error(X, y, x_control, NUM_FOLDS, loss_function, apply_fairness_constraints, apply_accuracy_constraint, sep_constraint, ['sex'], [{'sex':0.0} for i in range(0,NUM_FOLDS)])		
+	print
+	print "== Constrained (fair) classifier =="
+	ut.print_classifier_fairness_stats(test_acc_arr, correlation_dict_test_arr, cov_dict_test_arr, "sex")
+
+	""" Now plot a tradeoff between the fairness and accuracy """
+	ut.plot_cov_thresh_vs_acc_pos_ratio(X, y, x_control, NUM_FOLDS, loss_function, apply_fairness_constraints, apply_accuracy_constraint, sep_constraint, ['sex'])
+
+
+
+def main():
+	test_synthetic_data()
+
+
+if __name__ == '__main__':
+	main()
diff --git a/fair_classification/utils.py b/fair_classification/utils.py
@@ -91,7 +91,7 @@ def constraint_gamma_all(w, x, y,  initial_loss_arr):
             old_loss = sum(initial_loss_arr)
             return ((1.0 + gamma) * old_loss) - new_loss
 
-        def constraint_protected_people(w,x,y): #
+        def constraint_protected_people(w,x,y): # dont confuse the protected here with the sensitive feature protected/non-protected values -- protected here means that these points should not be misclassified to negative class
             return np.dot(w, x.T) # if this is positive, the constraint is satisfied
         def constraint_unprotected_people(w,ind,old_loss,x,y):
 
@@ -105,7 +105,7 @@ def constraint_unprotected_people(w,ind,old_loss,x,y):
         if sep_constraint == True: # separate gemma for different people
             for i in range(0, len(predicted_labels)):
                 if predicted_labels[i] == 1.0 and x_control[sensitive_attrs[0]][i] == 1.0: # for now we are assuming just one sensitive attr for reverse constraint, later, extend the code to take into account multiple sensitive attrs
-                    c = ({'type': 'ineq', 'fun': constraint_protected_people, 'args':(x[i], y[i])})                
+                    c = ({'type': 'ineq', 'fun': constraint_protected_people, 'args':(x[i], y[i])}) # this constraint makes sure that these people stay in the positive class even in the modified classifier             
                     constraints.append(c)
                 else:
                     c = ({'type': 'ineq', 'fun': constraint_unprotected_people, 'args':(i, unconstrained_loss_arr[i], x[i], y[i])})