# Reimplementing the Adversarially Reweighted Learning model by Lahoti et al. (2020) to improve fairness without demographics



This notebook contains the results presented in the paper by J. Mohazzab, L. Weytingh, C. Wortmann, and B. Brocades Zaalberg. More specifically, it contains the presented results for replicating [the paper by Lahoti et al.](https://arxiv.org/abs/2006.13114). In addition, this notebook includes the significance tests presented in Section 3.4.1 of the paper.

In [1]:
import copy
import time
import train
from argparser import DefaultArguments, get_optimal_parameters
from significance import test_significance

### Default Parameters

The default parameters are loaded below. They can be changed, e.g. for speeding up the training. 

In [2]:
# Load the default arguments
default_args = DefaultArguments()

# Change if the loss should be printed
default_args.print_loss = False

# Change the amount of times the results are averaged here.
default_args.average_over = 10

# Change the amount of training steps for each of the datasets here.
training_steps = {
    "uci_adult": 990,
    "law_school": 990,
    "compas": 470,
}

print("The default parameters are:\n", default_args.__dict__)

The default parameters are:
 {'average_over': 10, 'dataset': 'compas', 'train_steps': 1000, 'pretrain_steps': 250, 'batch_size': 32, 'optimizer': 'Adagrad', 'embedding_size': 32, 'lr_learner': 0.01, 'lr_adversary': 0.01, 'test_every': 5, 'seed': 42, 'log_dir': 'logs/', 'res_dir': 'results/', 'print_loss': False, 'model_name': 'ARL'}


## Replicability

The presented results for the PyTorch implementation are generated below for each classification task.

### Adult dataset

In [3]:
# Load the optimal hyperparameters.
adult_params = get_optimal_parameters("uci_adult")
adult_params["train_steps"] = training_steps["uci_adult"]

# Load the arguments passed to the training function.
adult_args = copy.copy(default_args)
adult_args.dataset = "uci_adult"
adult_args.update(adult_params)

print("Parameters used for the Adult dataset:\n", adult_params)

Parameters used for the Adult dataset:
 {'batch_size': 256, 'lr_learner': 0.01, 'lr_adversary': 1, 'train_steps': 990}


In [4]:
# Start timing.
adult_start = time.time()

# Train the model.
train.main(adult_args)

# Save the timing results.
adult_time = (time.time() - adult_start) / adult_args.average_over
print(f"Training and evaluating took, on average, {adult_time:.0f} seconds per model iteration for Adult")

Training model 1/10
Training model 2/10
Training model 3/10
Training model 4/10
Training model 5/10
Training model 6/10
Training model 7/10
Training model 8/10
Training model 9/10
Training model 10/10
Done training

-----------------------------------
Results

Average AUC: 0.904 ± 0.0020
Average AUC(macro-avg): 0.914
Average AUC(min): 0.878
Average AUC(minority): 0.949
-----------------------------------

Training and evaluating took, on average, 42 seconds per model iteration for Adult


### LSAC dataset

In [5]:
# Load the optimal hyperparameters.
lsac_params = get_optimal_parameters("law_school")
lsac_params["train_steps"] = training_steps["law_school"]

# Load the arguments passed to the training function.
lsac_args = copy.copy(default_args)
lsac_args.dataset = "law_school"
lsac_args.update(lsac_params)

print("Parameters used for the LSAC dataset:\n", lsac_params)

Parameters used for the LSAC dataset:
 {'batch_size': 256, 'lr_learner': 0.1, 'lr_adversary': 0.01, 'train_steps': 990}


In [6]:
# Start timing.
lsac_start = time.time()

# Train the model.
train.main(lsac_args)

# Save the timing results.
lsac_time = (time.time() - lsac_start) / lsac_args.average_over
print(f"Training and evaluating took, on average, {lsac_time:.0f} seconds per model iteration for LSAC")

Training model 1/10
Training model 2/10
Training model 3/10
Training model 4/10
Training model 5/10
Training model 6/10
Training model 7/10
Training model 8/10
Training model 9/10
Training model 10/10
Done training

-----------------------------------
Results

Average AUC: 0.820 ± 0.0091
Average AUC(macro-avg): 0.817
Average AUC(min): 0.795
Average AUC(minority): 0.829
-----------------------------------

Training and evaluating took, on average, 25 seconds per model iteration for LSAC


### COMPAS dataset

In [7]:
# Load the optimal hyperparameters.
compas_params = get_optimal_parameters("compas")
compas_params["train_steps"] = training_steps["compas"]

# Load the arguments passed to the training function.
compas_args = copy.copy(default_args)
compas_args.dataset = "compas"
compas_args.update(compas_params)

print("Parameters used for the COMPAS dataset:\n", compas_params)

Parameters used for the COMPAS dataset:
 {'batch_size': 32, 'lr_learner': 0.01, 'lr_adversary': 1, 'train_steps': 470}


In [8]:
# Start timing.
compas_start = time.time()

# Train the model.
train.main(compas_args)

# Save the timing results.
compas_time = (time.time() - compas_start) / compas_args.average_over
print(f"Training took, on average, {compas_time:.0f} seconds per model iteration for COMPAS")

Training model 1/10
Training model 2/10
Training model 3/10
Training model 4/10
Training model 5/10
Training model 6/10
Training model 7/10
Training model 8/10
Training model 9/10
Training model 10/10
Done training

-----------------------------------
Results

Average AUC: 0.721 ± 0.0065
Average AUC(macro-avg): 0.702
Average AUC(min): 0.616
Average AUC(minority): 0.754
-----------------------------------

Training took, on average, 9 seconds per model iteration for COMPAS


### Average runtime

In [9]:
avg_runtime = sum([adult_time, lsac_time, compas_time]) / 3
print(f"The average runtime per model iteration is {avg_runtime:.0f} seconds.")

The average runtime per model iteration is 25 seconds.


## Significance testing

The significance of the results of the ARL model are tested against a baseline model.

### Adult dataset

In [10]:
# Train the model without Adversary.
adult_args.model_name = "baseline" 
train.main(adult_args)

Training model 1/10
Training model 2/10
Training model 3/10
Training model 4/10
Training model 5/10
Training model 6/10
Training model 7/10
Training model 8/10
Training model 9/10
Training model 10/10
Done training

-----------------------------------
Results

Average AUC: 0.904 ± 0.0021
Average AUC(macro-avg): 0.913
Average AUC(min): 0.877
Average AUC(minority): 0.948
-----------------------------------



In [11]:
# Test the significance.
test_significance("uci_adult", adult_args.res_dir)

The p-value for uci_adult is 0.963
This difference is not significant


### LSAC dataset

In [12]:
# Train the model without Adversary
lsac_args.model_name = "baseline"
train.main(lsac_args)

Training model 1/10
Training model 2/10
Training model 3/10
Training model 4/10
Training model 5/10
Training model 6/10
Training model 7/10
Training model 8/10
Training model 9/10
Training model 10/10
Done training

-----------------------------------
Results

Average AUC: 0.820 ± 0.0075
Average AUC(macro-avg): 0.818
Average AUC(min): 0.798
Average AUC(minority): 0.834
-----------------------------------



In [13]:
# Test the significance.
test_significance("law_school", lsac_args.res_dir)

The p-value for law_school is 0.977
This difference is not significant


### COMPAS dataset

In [14]:
# Train the model without Adversary
compas_args.model_name = "baseline"
train.main(compas_args)

Training model 1/10
Training model 2/10
Training model 3/10
Training model 4/10
Training model 5/10
Training model 6/10
Training model 7/10
Training model 8/10
Training model 9/10
Training model 10/10
Done training

-----------------------------------
Results

Average AUC: 0.721 ± 0.0072
Average AUC(macro-avg): 0.702
Average AUC(min): 0.624
Average AUC(minority): 0.748
-----------------------------------



In [15]:
# Test the significance.
test_significance("compas", compas_args.res_dir)

The p-value for compas is 0.977
This difference is not significant
