# ADS Project 4: Machine Learning Fairness
## Spring 2022

Chang Lu, Jiaxin Yu, Marcus Loke, Xiran Lin, Zaigham Khan

## Overview

+ The cleaned COMPAS dataset is provided in `../output/compas-scores-two-years(cleaned).csv`. The EDA and cleaning process is described in `../doc/eda_cleaning.html` and `eda_cleaning.Rmd`.


+ Our team focused on three algorithms aimed at ensuring machine learning fairness. The algorithms are: maximizing accuracy under fairness constraints using C-LR and C-SVM (A2) and information theoretic measures for fairness-aware feature selection (FFS) (A7).

## Load modules and data

In [1]:
import numpy as np
import pandas as pd
import utils as ut
import loss_funcs as lf
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

In [2]:
df = pd.read_csv('../output/compas-scores-two-years(cleaned).csv')
df

Unnamed: 0,sex,age_cat,race,priors_count,c_charge_degree,length_of_stay,two_year_recid
0,Male,25 - 45,African-American,-0.733607,F,-0.177294,1
1,Male,< 25,African-American,0.055928,F,-0.350235,1
2,Male,25 - 45,Caucasian,2.029767,F,-0.254156,1
3,Female,25 - 45,Caucasian,-0.733607,M,-0.311803,0
4,Male,< 25,Caucasian,-0.536224,F,-0.350235,1
...,...,...,...,...,...,...,...
5910,Male,25 - 45,African-American,-0.733607,M,-0.350235,1
5911,Male,< 25,African-American,-0.733607,F,-0.350235,0
5912,Male,< 25,African-American,-0.733607,F,-0.331019,0
5913,Male,< 25,African-American,-0.733607,F,-0.331019,0


Encode categorical variables with dummy variables:
+ `sex`: 1 for male and 0 for female
+ `age_cat`: 2 for > 45, 1 for 25 - 45 and 0 for < 25
+ `race`: 1 for caucasian and 0 for african-american
+ `c_charge_degree`: 1 for F and 0 for M

In [3]:
df['sex'] = df['sex'].apply(lambda sex: 0 if sex == 'Female' else 1)
df['age_cat'] = df['age_cat'].apply(lambda age_cat: 2 if age_cat == '> 45' else(1 if age_cat == '25 - 45' else 0))
df['race'] = df['race'].apply(lambda race: 0 if race == 'African-American' else 1)
df['c_charge_degree'] = df['c_charge_degree'].apply(lambda c_charge_degree: 0 if c_charge_degree == 'M' else 1)
df

Unnamed: 0,sex,age_cat,race,priors_count,c_charge_degree,length_of_stay,two_year_recid
0,1,1,0,-0.733607,1,-0.177294,1
1,1,0,0,0.055928,1,-0.350235,1
2,1,1,1,2.029767,1,-0.254156,1
3,0,1,1,-0.733607,0,-0.311803,0
4,1,0,1,-0.536224,1,-0.350235,1
...,...,...,...,...,...,...,...
5910,1,1,0,-0.733607,0,-0.350235,1
5911,1,0,0,-0.733607,1,-0.350235,0
5912,1,0,0,-0.733607,1,-0.331019,0
5913,1,0,0,-0.733607,1,-0.331019,0


Shuffle the data and split into train and test sets.

In [4]:
#x_train = train.loc[:, df.columns!='two_year_recid']
#y_train = train.loc[:, df.columns=='two_year_recid']
#x_control = {'race': x_train['race'].to_list()} # dictionary of the type {"s": [...]}; key "s" is the sensitive feature name

features = ['sex', 'age_cat', 'race', 'priors_count', 'c_charge_degree', 'length_of_stay']
target = 'two_year_recid'

train, test = train_test_split(df, test_size = 0.3, random_state = 100)
x_train = train.loc[:len(train), features]
y_train = train.loc[:len(train), target]
#x_test = test.loc[len(test):, features]
#y_test = test.loc[len(test):, to_predict]
x_control = {'race': x_train['race'].to_list()}

## Logistic regression (LR)

### Training an unconstrained LR on the biased data

In [5]:
# all constraint flags are set to 0 since we want to train an unconstrained (original) classifier
apply_fairness_constraints = 0
apply_accuracy_constraint = 0
sep_constraint = 0 # apply the fine grained accuracy constraint
sensitive_attrs = ['race'] # list of sensitive features for which to apply fairness constraint
sensitive_attrs_to_cov_thresh = {'race': 0}
gamma = 0 # controls the loss in accuracy we are willing to incur when using apply_accuracy_constraint and sep_constraint

# Train model
w = ut.train_model(x_train.to_numpy(),
                y_train.to_numpy(),
                x_control,
                lf._logistic_loss,
                apply_fairness_constraints,
                apply_accuracy_constraint,
                sep_constraint,
                sensitive_attrs,
                sensitive_attrs_to_cov_thresh,
                gamma)

In [6]:
w

array([ 85.22769084,  61.36409916, -35.24083563, -26.82642904,
        47.85521734,  -0.6037243 ])

### Optimizing LR accuracy subject to fairness constraints