# Folktables: Real-World Bias Detection with MSD (1)

In this notebook we load real-world American Community Survey (ACS) data (via [folktables](https://github.com/socialfoundations/folktables) and the [Census Bureau's ACS program](https://www.census.gov/programs-surveys/acs/data.html "American Community Survey data"))  
 and use **Maximum Subgroup Discrepancy (MSD)** to:

1. Find the most disadvantaged subgroup in a given state (e.g., on the ACS Income >= $50 000 classification task)  
_This notebook: `02_folktables_within-state.ipynb`_

2. Find the subgroup where the population of two states differs *most*  
_Next notebook: `03_folktables_cross-state.ipynb`_

**MSD** finds the most disproportionately represented intersection of protected groups (e.g., age×marital-status×race…). It returns both the **value** (probability gap) and an **interpretable rule** (a conjunction of feature-value pairs).


## Configuration & Imports

In [1]:
import numpy as np
import pandas as pd

# We'll run the Income >= 50k task for each state:
from folktables import ACSIncome

from humancompatible.detect import detect_and_score
from humancompatible.detect import most_biased_subgroup, evaluate_biased_subgroup
from humancompatible.detect import most_biased_subgroup_two_samples, evaluate_biased_subgroup_two_samples
from humancompatible.detect.helpers import report_subgroup_bias

from humancompatible.detect.helpers.prepare import prepare_dataset
from humancompatible.detect.helpers.utils import signed_subgroup_discrepancy
from humancompatible.detect.methods.msd.mapping_msd import subgroup_map_from_conjuncts_dataframe
from humancompatible.detect.helpers.utils import signed_subgroup_prevalence_diff

from supports.folktables_utils import (
    load_state_data,
    ProtectedOnly,
    FEATURE_PROCESSING,
    CONTINUOUS_FEATURES, 
    FEATURE_NAMES, 
    PROTECTED_VALUES_MAP,
)

import logging
logging.basicConfig(level=logging.INFO, format="[%(levelname)s] %(message)s")

## PUMS Data Dictionary

For a complete list of all ACS PUMS variables, their codes and labels (e.g. every state code for `POBP`, every education level for `SCHL`, etc.), see the official 2018 PUMS Data Dictionary:

> **PUMS Data Dictionary (2018)**  
> https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2018.pdf

## State selection

In [2]:
# ────────── Which two states to compare ──────────
state1, state2 = "FL", "NH" # Florida and New Hampshire

## Global Configuration

In [3]:
# ────────── Which columns are protected ──────────
print("Features can be selected from:", ACSIncome.features)
print("Explanation of their abbreviations could be found below.")
protected_attrs = ['AGEP', 'SCHL', 'OCCP', 'SEX', 'RAC1P']

# ────────── MSD / Solver settings ──────────
n_samples = 1_000   # number of samples to subsample for faster computation
method = "MSD"
seed = 42 # fixed for demo purposes 
method_kwargs = {
    "solver": "gurobi",  # comment out if you don't have a license
}

Features can be selected from: ['AGEP', 'COW', 'SCHL', 'MAR', 'OCCP', 'POBP', 'RELP', 'WKHP', 'SEX', 'RAC1P']
Explanation of their abbreviations could be found below.


In [4]:
# ────────── Utility Setup ──────────
display(FEATURE_NAMES)
continuous_feats = [f for f in ACSIncome.features if f in CONTINUOUS_FEATURES]
feature_map = FEATURE_PROCESSING  # optional custom binning - to reduce the number of bins

{'SEX': 'Sex',
 'RAC1P': 'Race',
 'AGEP': 'Age',
 'MAR': 'Marital status',
 'POBP': 'Place of birth',
 'DIS': 'Disability',
 'CIT': 'Citizenship',
 'MIL': 'Military service',
 'ANC': 'Ancestry',
 'NATIVITY': 'Foreign or US native',
 'DEAR': 'Difficulty hearing',
 'DEYE': 'Difficulty seeing',
 'DREM': 'Cognitive difficulty',
 'FER': 'Gave birth last year',
 'POVPIP': 'Income / Poverty threshold',
 'COW': 'Class of worker',
 'SCHL': 'Educational attainment',
 'OCCP': 'Occupation recode',
 'RELP': 'Relationship',
 'WKHP': 'Usual hours worked per week past 12 months'}

## Download & Prepare the Two-State Dataset

In [5]:
X1, y1 = load_state_data(state1, problem_cls=ACSIncome)
X2, y2 = load_state_data(state2, problem_cls=ACSIncome)

print(f"{state1} shape:", X1.shape)
print(f"{state2} shape:", X2.shape)

display(X1.head())

FL shape: (98925, 10)
NH shape: (7966, 10)


Unnamed: 0,AGEP,COW,SCHL,MAR,OCCP,POBP,RELP,WKHP,SEX,RAC1P
0,20.0,1.0,16.0,5.0,5240.0,11.0,17.0,40.0,1.0,9.0
1,18.0,1.0,18.0,5.0,4622.0,36.0,17.0,40.0,2.0,2.0
2,18.0,1.0,18.0,5.0,4130.0,34.0,17.0,40.0,1.0,1.0
3,25.0,5.0,20.0,5.0,9825.0,26.0,17.0,50.0,1.0,1.0
4,27.0,2.0,17.0,1.0,2060.0,365.0,17.0,65.0,2.0,1.0


## Within-State Bias Detection

> **Task: ACS Income (> \$50 000) Classification**  
> We use the **ACSIncome** problem from **folktables**, which predicts whether an individual’s personal income (`PINCP`) exceeds \$50 000 per year.  
> 
> - **Features used**:  
>   - `AGEP` (Age)  
>   - `COW`  (Class of worker)  
>   - `SCHL` (Educational attainment)  
>   - `MAR`  (Marital status)  
>   - `OCCP` (Occupation recode)  
>   - `POBP` (Place of birth)  
>   - `RELP` (Relationship)  
>   - `WKHP` (Usual hours worked per week past 12 months)  
>   - `SEX`  (Sex)  
>   - `RAC1P`(Race)  
> - **Target**: 
>     1 if `PINCP > 50 000`, else 0 (Indicator of whether one has income above $50k)
>
> - **Preprocessing** (handled by `ACSIncome` and our pipeline):  
>   - Filter out individuals under 16  
>   - Filter out zero or missing income (`PINCP`)  
>   - Map any remaining missing categorical codes to -1  
> 
> **MSD** then finds which protected subgroup (e.g. "White & Male", "Doctorate holders born abroad", etc.) is **most disproportionately represented** in the high- vs. low-income classes.

In [6]:
rule_1, msd_val_1 = detect_and_score(
    X1, y1,
    protected_list=protected_attrs,
    continuous_list=continuous_feats,
    fp_map=feature_map,
    n_samples=n_samples,
    seed=seed,
    method=method,
    method_kwargs=method_kwargs
)

[INFO] Seeding the run with seed=42 for searching the `rule`.
[INFO] Set parameter Username
[INFO] Set parameter LicenseID to value 2649381
[INFO] Academic license - for non-commercial use only - expires 2026-04-09
[INFO] Seeding the run with seed=42 for searching the `value`.


In [7]:
report_subgroup_bias(
    f"State {state1}",
    msd_val_1,
    rule_1,
    FEATURE_NAMES,
    PROTECTED_VALUES_MAP,
)

State FL
MSD = 0.193
Rule: RAC1P = 1.0 AND SEX = 1.0
Explained rule: Race = White AND Sex = Male


In [8]:
# --- signed gap (Δ) ----------------------------------------------------------
np.random.seed(seed)                      # keep the subsampling reproducible
_, X1_sub, y1_sub = prepare_dataset(
    X1,
    y1,
    n_max=n_samples,
    protected_attrs=protected_attrs,
    continuous_feats=continuous_feats,
    feature_processing=feature_map,
)

mask_1 = subgroup_map_from_conjuncts_dataframe(rule_1, X1_sub)
delta_1 = signed_subgroup_discrepancy(mask_1, y1_sub.to_numpy().ravel())

print(f"Δ = {delta_1:.3f} -> subgroup is {"over-represented" if delta_1 > 0 else "under-represented"} among high-income earners in {state1}.")

Δ = 0.193 -> subgroup is over-represented among high-income in FL.


> **How to read the numbers**  
> * **MSD = 0.193** is the *magnitude* of the worst gap: The proportion of white males
>   differs by 19.3 percentage points in the high- vs. low-income subsets.  
> * **Δ = +0.193** has the same magnitude **but keeps the sign**.  A positive Δ tells
>   us the subgroup is **over-represented in the positive class**  
>   (here: high-income).  A negative Δ would mean under-representation.
>
> Need the full code-book for `RAC1P`, `SCHL`, …?  
> See the official **2018 PUMS Data Dictionary**  
> <https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2018.pdf>


In [9]:
rule_2, msd_val_2 = detect_and_score(
    X2, y2,
    protected_list=protected_attrs,
    continuous_list=continuous_feats,
    fp_map=feature_map,
    n_samples=n_samples,
    seed=seed,
    method=method,
    method_kwargs=method_kwargs
)

report_subgroup_bias(
    f"State {state2}",
    msd_val_2,
    rule_2,
    FEATURE_NAMES,
    PROTECTED_VALUES_MAP,
)

[INFO] Seeding the run with seed=42 for searching the `rule`.
[INFO] Seeding the run with seed=42 for searching the `value`.


State NH
MSD = 0.217
Rule: RAC1P = 1.0 AND SEX = 1.0
Explained rule: Race = White AND Sex = Male


In [10]:
# --- signed gap (Δ) ----------------------------------------------------------
np.random.seed(seed)                      # keep the subsampling reproducible
_, X2_sub, y2_sub = prepare_dataset(
    X2,
    y2,
    n_max=n_samples,
    protected_attrs=protected_attrs,
    continuous_feats=continuous_feats,
    feature_processing=feature_map,
)

mask_2 = subgroup_map_from_conjuncts_dataframe(rule_2, X2_sub)
delta_2 = signed_subgroup_discrepancy(mask_2, y2_sub.to_numpy().ravel())

print(f"Δ = {delta_2:.3f} -> subgroup is {"over-represented" if delta_2 > 0 else "under-represented"} among high-income earners in {state2}.")

Δ = 0.217 -> subgroup is over-represented among high-income in NH.


> **New Hampshire - what the two numbers mean**  
> * **MSD = 0.217** is the *size* of the worst gap: The probability of white males differs by 0.217 between the high- vs. low-income subsets of data.  
> * **Δ = +0.217** keeps the sign. The "+" tells us the subgroup is **over-represented among the high-income earners** in New Hampshire.  
>
> (A negative Δ would signal under-representation in the positive class.)
>
> Need the full code-book for `RAC1P`, `SCHL`, etc.?  
> See the **2018 PUMS Data Dictionary**  
> <https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2018.pdf>

## Interpret the Rules

Folktables encodes `RAC1P` (race) as:

| Code | Meaning |
|------|---------|
| 1    | White   |
| 2    | Black   |
| 3    | American Indian or Alaska Native |
| 4    | Alaska Native alone |
| 5    | Native Hawaiian or Other Pacific Islander |
| 6    | Asian   |
| 7    | Pacific Islander alone |
| 8    | Some Other Race |
| 9    | Two or More Races |

And `SEX` as:

| Code | Meaning |
|------|---------|
| 1    | Male    |
| 2    | Female  |

- **State FL:**  
  `RAC1P = 1.0 AND SEX = 1.0` -> "White & Male" are over-represented by **19.3 pp** in the high-income class.  
- **State NH:**  
  `RAC1P = 1.0 AND SEX = 1.0` -> "White & Male" are over-represented by **21.7 pp** in the high-income class.  

> ## _Continued in the next notebook_