# Folktables: Real-World Bias Detection with MSD

In this notebook we load real-world American Community Survey (ACS) data (via [folktables](https://github.com/socialfoundations/folktables) and the [Census Bureau’s ACS program](https://www.census.gov/programs-surveys/acs/data.html "American Community Survey data"))  
 and use **Maximum Subgroup Discrepancy (MSD)** to:

1. Find the most disadvantaged subgroup *within* each state (on the ACS Income ≥ $50 000 classification task)  
2. Find the subgroup where two states differ *most*  

**MSD** scans *all* intersectional protected groups (e.g. age×marital-status×race…) in **linear** sample time, returns both a **value** (percentage‐point gap) and an **interpretable rule** (a conjunction of feature‐value tests).


## Configuration & Imports

In [1]:
import numpy as np
import pandas as pd

from folktables import ACSDataSource
from humancompatible.detect import detect_bias, detect_bias_two_samples

## PUMS Data Dictionary

For a complete list of all ACS PUMS variables, their codes and labels (e.g. every state code for `POBP`, every education level for `SCHL`, etc.), see the official 2018 PUMS Data Dictionary:

> **PUMS Data Dictionary (2018)**  
> https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2018.pdf

## Parameters

In [2]:
# ────────── Which two states to compare ──────────
state1, state2 = "FL", "NH"

# We’ll run the Income ≥ 50k task for each state:
from folktables import ACSIncome as Dataset

> **Next compile the bottom part of the notebook!**
>
> Advanced parameters you can find in the bottom of the notebook!

## Download & Prepare the Two-State Dataset

In [8]:
def load_state_data(state_abbrev: str):
    """
    Load data for a single state via folktables,
    then return only our selected columns and the target series.
    """
    ds = ACSDataSource(
        survey_year=survey_year,
        horizon=horizon,
        survey="person",
        root_dir=data_root,
    )
    try:
        raw = ds.get_data(states=[state_abbrev], download=True)
    except Exception as e:
        print("\n⚠️  Automatic download failed:")
        print(f"    {e!r}\n")
        print("→ Please manually download this file and unzip it under:")
        print(f"    {data_root}/{survey_year}/{horizon}/csv_p{state_abbrev.lower()}.zip")
        print("\nYou can get it from:")
        print(f"https://www2.census.gov/programs-surveys/acs/data/pums/{survey_year}/{horizon}/\n")
        raw = ds.get_data(states=[state_abbrev], download=False)
    
    X_full, y_full, _ = Dataset.df_to_pandas(raw)
    X_sel = X_full[selected_columns]
    return X_sel, y_full

In [9]:
X1, y1 = load_state_data(state1)
X2, y2 = load_state_data(state2)

print(f"{state1} shape:", X1.shape)
print(f"{state2} shape:", X2.shape)

display(X1.head())

FL shape: (98925, 5)
NH shape: (7966, 5)


Unnamed: 0,AGEP,SCHL,OCCP,SEX,RAC1P
0,20.0,16.0,5240.0,1.0,9.0
1,18.0,18.0,4622.0,2.0,2.0
2,18.0,18.0,4130.0,1.0,1.0
3,25.0,20.0,9825.0,1.0,1.0
4,27.0,17.0,2060.0,2.0,1.0


## Within-State Bias Detection

> **Task: ACS Income (> \$50 000) Classification**  
> We use the **ACSIncome** problem from **folktables**, which predicts whether an individual’s personal income (`PINCP`) exceeds \$50 000 per year.  
> 
> - **Features used**:  
>   - `AGEP` (Age in years, must be > 16)  
>   - `SCHL` (Educational attainment)  
>   - `OCCP` (Occupation recode)  
>   - `SEX` (Male / Female)  
>   - `RAC1P` (Race recode)  
> - **Target**: 1 if `PINCP > 50 000`, else 0  
>
> - **Preprocessing** (built‐in to `ACSIncome` and our solver):  
>   - Filter out individuals under 16  
>   - Filter out zero or missing income (`PINCP`)  
>   - Map any remaining missing categories to –1  
> 
> Our within‐state **MSD** then finds the protected subgroup (e.g. "White & Male", etc.) whose high-vs-low-income rate differs the most from its complement.

In [None]:
msd_val_1, rule_1 = detect_bias(
    X1, y1,
    protected_list=protected_attrs,
    continuous_list=continuous_feats,
    fp_map=feature_map,
    n_samples=n_samples,
    seed=seed,
    method=method,
    method_kwargs=method_kwargs
)

report_subgroup_bias(
    f"State {state1}",
    msd_val_1,
    rule_1,
    selected_columns,
    FEATURE_NAMES,
    PROTECTED_VALUES_MAP,
)

[INFO] Seeding the run with seed=42
[INFO] Set parameter Username
[INFO] Set parameter LicenseID to value 2649381
[INFO] Academic license - for non-commercial use only - expires 2026-04-09


State FL
MSD (within-state) = 0.193
Rule: RAC1P = 1.0 AND SEX = 1.0
Explained rule: Race = White AND Sex = Male


> You’re seeing both the raw rule (e.g. `RAC1P = 1.0 AND SEX = 1.0`) and a human-readable version (“Race = White AND Sex = Male”) produced by our `report_subgroup_bias` helper.  
> 
> For a complete list of every ACS variable’s codes and labels (so you can look up other states, education levels, etc.), see the official 2018 PUMS Data Dictionary:  
> https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2018.pdf


In [None]:
msd_val_2, rule_2 = detect_bias(
    X2, y2,
    protected_list=protected_attrs,
    continuous_list=continuous_feats,
    fp_map=feature_map,
    n_samples=n_samples,
    seed=seed,
    method=method,
    method_kwargs=method_kwargs
)

report_subgroup_bias(
    f"State {state2}",
    msd_val_2,
    rule_2,
    selected_columns,
    FEATURE_NAMES,
    PROTECTED_VALUES_MAP,
)

[INFO] Seeding the run with seed=42


State NH
MSD (within-state) = 0.217
Rule: RAC1P = 1.0 AND SEX = 1.0
Explained rule: Race = White AND Sex = Male


> You’re seeing both the raw rule (e.g. `RAC1P = 1.0 AND SEX = 1.0`) and a human-readable version (“Race = White AND Sex = Male”) produced by our `report_subgroup_bias` helper.  
> 
> For a complete list of every ACS variable’s codes and labels (so you can look up other states, education levels, etc.), see the official 2018 PUMS Data Dictionary:  
> https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2018.pdf


## Interpret the Rules

Folktables encodes `RAC1P` (race) as:

| Code | Meaning |
|------|---------|
| 1    | White   |
| 2    | Black   |
| 3    | American Indian or Alaska Native |
| 4    | Alaska Native alone |
| 5    | Native Hawaiian or Other Pacific Islander |
| 6    | Asian   |
| 7    | Pacific Islander alone |
| 8    | Some Other Race |
| 9    | Two or More Races |

And `SEX` as:

| Code | Meaning |
|------|---------|
| 1    | Male    |
| 2    | Female  |

- **State FL:**  
  `RAC1P = 1.0 AND SEX = 1.0` -> "White & Male" are disproportionately represented by up to **19.3 pp**.  
- **State NH:**  
  `RAC1P = 1.0 AND SEX = 1.0` -> "White & Male" are disproportionately represented by up to **21.7 pp**.  

## Cross-State Discrepancy

In [None]:
msd_cross, rule_cross = detect_bias_two_samples(
    X1, X2, 
    protected_list=protected_attrs,
    continuous_list=continuous_feats,
    fp_map=feature_map,
    n_samples=n_samples,
    seed=seed,
    method=method,
    method_kwargs=method_kwargs
)

report_subgroup_bias(
    f"{state1} vs {state2}",
    msd_cross,
    rule_cross,
    selected_columns,
    FEATURE_NAMES,
    PROTECTED_VALUES_MAP,
)

[INFO] Seeding the run with seed=42


FL vs NH
MSD = 0.177
Rule: SCHL = 16.0 AND RAC1P = 1.0
Explained rule: Educational attainment = Doctorate degree AND Race = White


The **FL vs NH MSD** of **0.177** means that the subgroup  
**`SCHL = 16.0 AND RAC1P = 1.0`**  
("Doctorate degree" **and** "White") is disproportionately represented, with a **17.7 pp** difference in share between Florida and New Hampshire.

In other words, White doctorate holders comprise 17.7 pp more of Florida's sample than New Hampshire’s - the largest intersectional discrepancy between these two states.


> For a complete list of every ACS variable’s codes and labels (so you can look up other states, education levels, etc.), see the official 2018 PUMS Data Dictionary:  
> https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2018.pdf

## Conclusion

In this notebook, we have seen how **Maximum Subgroup Discrepancy (MSD)** can uncover both within-population and cross-population biases in a real-world dataset:

1. **Within-State Discrepancies**  
   - For **Florida**, the subgroup  
     `RAC1P = 1.0 AND SEX = 1.0`  
     ("White" **and** "Male") is the most disproportionately represented, with an MSD of **0.193** (19.3 pp gap vs its complement).  
   - For **New Hampshire**, the subgroup  
     `RAC1P = 1.0 AND SEX = 1.0`  
     ("White" **and** "Male") is the most disproportionately represented, with an MSD of **0.217**.

2. **Cross-State Discrepancy**  
   - Comparing **Florida vs New Hampshire**, the subgroup  
     `SCHL = 16.0 AND RAC1P = 1.0`  
     ("Doctorate degree" **and** "White") is the most disproportionately represented, with an MSD of **0.177** (17.7 pp gap in share).

Feel free to play with different feature sets, years, or other folktables problems (ACSPublicCoverage, ACSMobility, ...).

## Advanced Settings

In [None]:
# ────────── ACS Data settings ──────────
survey_year   = "2018"
horizon       = "1-Year"
data_root     = "../data/folktables"

# ────────── Which columns to keep ──────────
print("Features can be selected from:", Dataset.features)
print("Explanation of their abbreviations could be found below.")
selected_columns = ['AGEP', 'SCHL', 'OCCP', 'SEX', 'RAC1P']

# ────────── MSD / Solver settings ──────────
n_samples     = 1_000   # number of samples to subsample for faster computation
method        = "MSD"
method_kwargs = {
    "solver": "gurobi",  # comment out if you don’t have a license
}

Features can be selected from: ['AGEP', 'COW', 'SCHL', 'MAR', 'OCCP', 'POBP', 'RELP', 'WKHP', 'SEX', 'RAC1P']
Explanation of their abbreviations could be found below.


In [None]:
seed = 42

protected_attrs  = selected_columns.copy()

feature_map      = {}  # any custom binning

# ────────── Feature definitions ──────────
CONTINUOUS_FEATURES = ["AGEP", "PINCP", "WKHP", "JWMNP", "POVPIP"]
continuous_feats = [f for f in CONTINUOUS_FEATURES if f in selected_columns]

## State-Code Utility

In [None]:
from humancompatible.detect.utils import state_to_pobp_code
# https://www.icpsr.umich.edu/web/DSDR/studies/25042/datasets/0002/variables/POBP?archive=dsdr

## Explanation helpers

In [None]:
from humancompatible.detect.utils import report_subgroup_bias
from folktables_utils import feature_folktables, state_to_pobp_code
FEATURE_NAMES, PROTECTED_VALUES_MAP = feature_folktables()

FEATURE_NAMES

{'SEX': 'Sex',
 'RAC1P': 'Race',
 'AGEP': 'Age',
 'MAR': 'Marital status',
 'POBP': 'Place of birth',
 'DIS': 'Disability',
 'CIT': 'Citizenship',
 'MIL': 'Military service',
 'ANC': 'Ancestry',
 'NATIVITY': 'Foreign or US native',
 'DEAR': 'Difficulty hearing',
 'DEYE': 'Difficulty seeing',
 'DREM': 'Cognitive difficulty',
 'FER': 'Gave birth last year',
 'POVPIP': 'Income / Poverty threshold',
 'COW': 'Class of worker',
 'SCHL': 'Educational attainment',
 'OCCP': 'Occupation recode',
 'WKHP': 'Usual hours worked per week past 12 months'}