# ADS Project 4: Machine Learning Fairness
## Spring 2022

Chang Lu, Jiaxin Yu, Marcus Loke, Xiran Lin, Zaigham Khan

## Overview

+ The cleaned COMPAS dataset is provided in `../output/compas-scores-two-years(cleaned).csv`. The EDA and cleaning process is described in `../doc/eda_cleaning.html` and `eda_cleaning.Rmd`.


+ Our team focused on three algorithms aimed at ensuring machine learning fairness. The algorithms are: maximizing accuracy under fairness constraints using C-LR and C-SVM (A2) and information theoretic measures for fairness-aware feature selection (FFS) (A7).

## Load modules and data

In [1]:
import os, sys
import numpy as np
import pandas as pd
import utils as ut
import loss_funcs as lf
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.utils import shuffle
from sklearn.metrics import log_loss

In [2]:
df = pd.read_csv('../output/compas-scores-two-years(cleaned).csv')
df.head()

Unnamed: 0,sex,age_cat,race,priors_count,c_charge_degree,length_of_stay,two_year_recid
0,Male,25 - 45,African-American,-0.733607,F,-0.177294,1
1,Male,< 25,African-American,0.055928,F,-0.350235,1
2,Male,25 - 45,Caucasian,2.029767,F,-0.254156,1
3,Female,25 - 45,Caucasian,-0.733607,M,-0.311803,0
4,Male,< 25,Caucasian,-0.536224,F,-0.350235,1


Encode categorical variables with dummy variables:
+ `sex`: 1 for male and 0 for female
+ `age_cat`: 2 for > 45, 1 for 25 - 45 and 0 for < 25
+ `race`: 1 for caucasian and 0 for african-american
+ `c_charge_degree`: 1 for F and 0 for M

In [3]:
df['sex'] = df['sex'].apply(lambda sex: 0 if sex == 'Female' else 1)
df['age_cat'] = df['age_cat'].apply(lambda age_cat: 2 if age_cat == '> 45' else(1 if age_cat == '25 - 45' else 0))
df['race'] = df['race'].apply(lambda race: 0 if race == 'African-American' else 1)
df['c_charge_degree'] = df['c_charge_degree'].apply(lambda c_charge_degree: 0 if c_charge_degree == 'M' else 1)
df.head()

Unnamed: 0,sex,age_cat,race,priors_count,c_charge_degree,length_of_stay,two_year_recid
0,1,1,0,-0.733607,1,-0.177294,1
1,1,0,0,0.055928,1,-0.350235,1
2,1,1,1,2.029767,1,-0.254156,1
3,0,1,1,-0.733607,0,-0.311803,0
4,1,0,1,-0.536224,1,-0.350235,1


Create a function to process the data to obtain the target variable, the sensitive attribute and the remaining dataframe with the remaining features. We also perform a shuffle so that we can split the data into train and test sets.

In [4]:
# Vars to store features
features = ['sex', 'age_cat', 'priors_count', 'c_charge_degree', 'length_of_stay']
sensitive = 'race'
target = 'two_year_recid'

# Function to process data
def process_df(df):
    y_label = df[target]
    protected_attr = df[sensitive]
    df_new = df[features]
    y_label, protected_attr, df_new = shuffle(y_label, protected_attr, df_new, random_state = 617)
    
    return y_label.to_numpy(), protected_attr.to_numpy(), df_new.to_numpy()

# Split data into train and test
y_label, protected_attr, df_new =  process_df(df)
train_index = int(len(df_new) * 0.7)
x_train, y_train, race_train = df_new[:train_index], y_label[:train_index], protected_attr[:train_index]
x_test, y_test, race_test = df_new[train_index:], y_label[train_index:],protected_attr[train_index:]

We also created a function to determine the p-rule (p%).

+ **Protected**: Caucasians (i.e., `race == 1`)
+ **Not protected**: African-Americans (i.e., `race == 0`)

In [5]:
# Function to compute p-rule
def p_rule(sensitive_var, y_pred):
    protected = np.where(sensitive_var == 1)[0]
    not_protected = np.where(sensitive_var == 0)[0]
    protected_pred = np.where(y_pred[protected] == 1)
    not_protected_pred = np.where(y_pred[not_protected] == 1)
    protected_percent = protected_pred[0].shape[0]/protected.shape[0]
    not_protected_percent = not_protected_pred[0].shape[0]/not_protected.shape[0]
    ratio = protected_percent/not_protected_percent
    
    return ratio, protected_percent, not_protected_percent

## Logistic Regression

### Training unconstrained classifier

Training baseline unconstrained classifier.

In [6]:
clf = LogisticRegression(random_state = 0).fit(x_train, y_train)
coeff = clf.coef_
intercept = clf.intercept_
optimal_loss = log_loss(y_train, clf.predict_proba(x_train))
print_results = {"Set": ["Train", "Test"],
                 "Accuracy (%)": [clf.score(x_train, y_train)*100, clf.score(x_test, y_test)*100],
                 "P-rule (%)": [p_rule(race_train, clf.predict(x_train))[0]*100, p_rule(race_test, clf.predict(x_test))[0]*100],
                 "Protected (%)": [p_rule(race_train, clf.predict(x_train))[1]*100, p_rule(race_test, clf.predict(x_test))[1]*100],
                 "Not protected (%)": [p_rule(race_train, clf.predict(x_train))[2]*100, p_rule(race_test, clf.predict(x_test))[2]*100]}
pd.DataFrame(print_results)

Unnamed: 0,Set,Accuracy (%),P-rule (%),Protected (%),Not protected (%)
0,Train,66.932367,53.771942,29.312425,54.51249
1,Test,64.957746,61.64272,33.888889,54.976303


### Optimizing classifier accuracy subject to fairness constraints

Now we optimize accuracy subject to fairness constraints. The details can be found in Section 3.2 of the paper on [Fairness Constraints: Mechanisms for Fair Classification](https://arxiv.org/abs/1507.05259). Notice that setting {'race': 0} means that the classifier should achieve 0 covariance between the sensitive feature (`race`) value and distance to the decision boundary. A 0 covariance would mean no correlation between the two variables.

In [7]:
apply_fairness_constraints = 1
apply_accuracy_constraint = 0
sep_constraint = 0
gamma = None
sensitive_attrs = ['race']
sensitive_attrs_to_cov_thresh = {'race': 0}
x_control = {'race': race_train}

np.random.seed(100)
w = ut.train_model(x_train,
                y_train,
                x_control,
                lf._logistic_loss,
                apply_fairness_constraints,
                apply_accuracy_constraint,
                sep_constraint,
                sensitive_attrs,
                sensitive_attrs_to_cov_thresh,
                gamma)

In [8]:
# Fit coefficients/weights into logistic regression in sklearn
m = LogisticRegression()
m.coef_= w.reshape((1,-1))
m.intercept_ = 0
m.classes_ = np.array([0, 1])

In [9]:
print_results_clr = {"Set": ["Train", "Test"],
                 "Accuracy (%)": [m.score(x_train, y_train)*100, m.score(x_test, y_test)*100],
                 "P-rule (%)": [p_rule(race_train, m.predict(x_train))[0]*100, p_rule(race_test, m.predict(x_test))[0]*100],
                 "Protected (%)": [p_rule(race_train, m.predict(x_train))[1]*100, p_rule(race_test, m.predict(x_test))[1]*100],
                 "Not protected (%)": [p_rule(race_train, m.predict(x_train))[2]*100, p_rule(race_test, m.predict(x_test))[2]*100]}
pd.DataFrame(print_results_clr)

Unnamed: 0,Set,Accuracy (%),P-rule (%),Protected (%),Not protected (%)
0,Train,48.454106,99.939857,99.819059,99.87913
1,Test,46.084507,99.955856,99.861111,99.905213


## Support Vector Machine (SVM)

### Training unconstrained classifier

In [10]:
# @Chang, @Jiaxin and @Ryan: Code goes here 



### Optimizing classifier accuracy subject to fairness constraints

In [11]:
# @Chang, @Jiaxin and @Ryan: Code goes here 



## Information Theoretic Measures for Fairness-aware Feature selection (FFS)

In [12]:
# @Zaigham: Code goes here



## References

+ https://towardsdatascience.com/optimization-with-scipy-and-application-ideas-to-machine-learning-81d39c7938b8
+ https://github.com/mbilalzafar/fair-classification/tree/master/disparate_impact
+ https://www.propublica.org/datastore/dataset/compas-recidivism-risk-score-data-and-analysis