# DR vs GBR: Distributional Repair & Group-Blind Repair (Adult dataset)

This notebook compares two dataset-repair approaches:

- **Distributional Repair (DR)** group-aware: learns two OT maps (S=0 and S=1) to a shared barycenter (conditional on U).

- **GroupBlind Repair (GBR)** group-blind: learns one OT map from the pooled distribution to a target (we will align this target with DRâ€™s barycenter so the comparison is meaningful). Variants:

    - `baseline` (no fairness vector)

    - `partial` / `total` (use fairness vector *V*), optional.

**Data:** Adult (AIF360).  
**Protected (S):** sex (0 = Female, 1 = Male)  
**Unprotected (U):** college_educated (0/1)  
**Features (X):** age, hours-per-week (continuous)

We compute U-Mean KLD (average KL divergence between P(X|S=1,U) and P(X|S=0,U), averaged over *U*) on:

- **Research** split (used to learn transport)

- **Archive** split (held-out to check generalization)

Smaller values indicate closer group distributions conditional on *U* (0 = identical).

## Imports & config

In [1]:
import time
import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import ot

from sklearn.neighbors import KernelDensity
from aif360.datasets import AdultDataset

from humancompatible.repair.distributional_repair import DistributionalRepair
from humancompatible.repair.group_blind_repair import GroupBlindRepair

np.random.seed(0)

## Dataset loader

In [2]:
def load_adult_dataset(s,u,x,y):
    def custom_preprocessing(df):
        pd.set_option('future.no_silent_downcasting', True)
        def group_race(x):
            if x == "White":
                return 1.0
            else:
                return 0.0

        df['race'] = df['race'].apply(lambda x: group_race(x))

        # Encode 'sex' column as numerical values
        df['sex'] = df['sex'].map({'Female': 0.0, 'Male': 1.0})

        df['Income Binary'] = df['income-per-year']
        df['Income Binary'] = df['Income Binary'].replace(to_replace='>50K.', value=1, regex=True)
        df['Income Binary'] = df['Income Binary'].replace(to_replace='>50K', value=1, regex=True)
        df['Income Binary'] = df['Income Binary'].replace(to_replace='<=50K.', value=0, regex=True)
        df['Income Binary'] = df['Income Binary'].replace(to_replace='<=50K', value=0, regex=True)
        # 1 if education-num is greater than 9, 0 otherwise
        df['college_educated'] = (df['education-num'] > 9).astype(int)

        #drop nan columns
        df = df.dropna()

        return df

    adult = AdultDataset(
        label_name=y,
        favorable_classes=[1,1],
        protected_attribute_names=[s],
        privileged_classes=[[1.0]],
        instance_weights_name=None,
        categorical_features=[],
        features_to_keep=[s]+[u]+x,
        na_values=[],
        custom_preprocessing=custom_preprocessing,
        features_to_drop=[],
        metadata={}
    )
    return adult