# Resample imbalanced dataset

Here is a bunch of text - I hope it looks normal! (make stuff hidden)

Datasets with multiple fairness attributes often display some level of class-imbalance. Often, it is helpful to address the class-imbalance prior to model training, however, balancing multiple sensitive attributes can be more challenging than the traditional problem of balancing a single class. Since the sensitive attributes are a multi-label problem, it is not straightforward how to reduce the imbalance in one label without accidentally making the imbalance worse along another. This challenges means that generall multi-label resamplers are only able to reduce the class-imbalance but not remove it entirely.

This package includes multiple methods to attempt to address this dilemma. Here, we walk through on of the simpler methods (multi-label random oversampling) using some synthetic data. 

In [2]:
# Required for notebook path to be at head of project for torch_fairness imports
import sys
sys.path.insert(0, os.path.abspath('../../..'))
import os
os.chdir('../../..')

In [40]:
import os
from typing import List

import pandas as pd
import numpy as np

from torch_fairness.resampling import MLROS
from torch_fairness.resampling import imbalance_ratio
from torch_fairness.data import SensitiveMap
from torch_fairness.data import SensitiveTransformer

In [47]:
def print_sample_sizes(data: pd.DataFrame) -> None:
    print(f"Sample size: {data.shape[0]}")
    print(f"Sex: {data.Sex.value_counts().to_dict()}")
    print(f"Race: {data.Race.value_counts().to_dict()}")

In [48]:
def print_imbalance(data: pd.DataFrame) -> None:
    sample_sizes = np.array([*data.Sex.value_counts().to_dict().values(), *data.Race.value_counts().to_dict().values()])
    print(f"Imbalance ratios: {imbalance_ratio(sample_sizes=sample_sizes)}")
    print(f"Mean imbalance ratios: {imbalance_ratio(sample_sizes=sample_sizes).mean()}")

## Data

This synthetic dataset involves two sensitive attributes: Race and Gender. 

In [49]:
feature_cols = ['X0', 'X1', 'X2']
sensitive_cols = ['Race', 'Sex']
data = pd.read_csv(os.path.join('datasets', 'synthetic_imbalanced_labels.csv'))
print_sample_sizes(data)

Sample size: 200
Sex: {'male': 158, 'female': 42}
Race: {'african-american': 97, 'caucasian': 66, 'hispanic': 27}


## Resampling

One of the simplest resampling methods is multi-label random oversampler (MLROS), which identifies the groups with the greatest class-imblanace and oversamples them until they either (a) no longer are marked as being imbalanced or (b) the maximum number of clones has been created.

In [23]:
sensitive_map = SensitiveMap.infer(data[sensitive_cols], minimum_sample_size=15)
sensitive_transformer = SensitiveTransformer(sensitive_map=sensitive_map, minimum_sample_size=15)
sensitive_transformer.fit(data[sensitive_cols])
dummy_sensitive = sensitive_transformer.transform(data[sensitive_cols])

In [50]:
resampler = MLROS(max_clone_percentage=0.5, random_state=1, sample_size_chunk=1)
new_data = resampler.balance(labels=dummy_sensitive, features=data[['X0', 'X1', 'X2']])
new_data = pd.DataFrame(sensitive_transformer.inverse_transform(new_data['labels']), columns=['Race', 'Sex'])
print_sample_sizes(new_data)

Sample size: 300
Sex: {'male': 199, 'female': 101}
Race: {'african-american': 123, 'hispanic': 91, 'caucasian': 73}


While the sample sizes look a bit better after the resampling, it can something be challenging to tell by eye. A commonly used measure in the multi-label imbalance literature is the imbalance ratio, where the smallest value is 1. and larger indicates more imbalance. It can be useful for understanding and comparing the impact of different resampling methods.   

In [54]:
print('Original data')
print_imbalance(data)
print('\nResampled data')
print_imbalance(new_data)

Original data
Imbalance ratios: [1.         3.76190476 1.62886598 2.39393939 5.85185185]
Mean imbalance ratios: 2.9273123974154904

Resampled data
Imbalance ratios: [1.         1.97029703 1.61788618 2.18681319 2.7260274 ]
Mean imbalance ratios: 1.9002047585276436


After looking at the resampling ratios, we see a substantial improvement in the two smallest groups. This is the expected behavior, as MLROS oversamples groups that fall above a certain threshold in their imbalance ratios. 