# Folktables: Real-World Bias Detection with MSD

In this notebook we load real-world ACS data (via [folktables](https://github.com/socialfoundations/folktables)) and use **Maximum Subgroup Discrepancy (MSD)** to:

1. Find the most disadvantaged subgroup *within* each state  
2. Find the subgroup where two states differ *most*  

**MSD** scans *all* intersectional protected groups (e.g. age×marital-status×race…) in **linear** sample time, returns both a **value** (percentage‐point gap) and an **interpretable rule** (a conjunction of feature‐value tests).


## Configuration & Imports

In [1]:
import numpy as np
import pandas as pd

from folktables import ACSDataSource, ACSIncome
from humancompatible.detect import detect_bias, detect_bias_two_samples

## Parameters

In [2]:
# ────────── Which two states to compare ──────────
state1, state2 = "HI", "ME"

# ────────── ACS Data settings ──────────
survey_year   = "2018"
horizon       = "1-Year"
data_root     = "../data/folktables"

# ────────── Which columns to keep ──────────
# ['AGEP', 'COW', 'SCHL', 'MAR', 'OCCP', 'POBP', 'RELP', 'WKHP', 'SEX', 'RAC1P']
selected_columns = ["AGEP", "MAR", "POBP", "SEX", "RAC1P"]
protected_attrs  = selected_columns.copy()
continuous_feats = []
feature_map      = {}  # any custom binning

# ────────── Method settings ──────────
seed          = 42
n_samples     = 1_000
method        = "MSD"
method_kwargs = {"time_limit": 120}  # 2 min per solve

## Download & Prepare the Two-State Dataset

In [3]:
def load_state_manual():
    """
    Attempts to download via folktables; if that fails, expects you to have
    manually downloaded & unzipped the two CSV zips into data_root/{year}/{horizon}/
    """
    ds = ACSDataSource(
        survey_year=survey_year,
        horizon=horizon,
        survey="person",
        root_dir=data_root,
    )
    try:
        raw = ds.get_data(states=[state1, state2], download=True)
    except Exception as e:
        print("\n⚠️  Automatic download failed:")
        print(f"    {e!r}\n")
        print("→ Please manually download these two files and unzip them under:")
        print(f"    {data_root}/{survey_year}/{horizon}/csv_p{state1.lower()}.zip")
        print(f"    {data_root}/{survey_year}/{horizon}/csv_p{state2.lower()}.zip")
        print("\nYou can get them from:")
        print(f"https://www2.census.gov/programs-surveys/acs/data/pums/{survey_year}/{horizon}/\n")
        raw = ds.get_data(states=[state1, state2], download=False)
    return raw

In [4]:
data = load_state_manual()
X, y, _ = ACSIncome.df_to_pandas(data)

# keep only our selected columns
X = X[selected_columns]
print("Combined shape:", X.shape)
X.head()

Combined shape: (14733, 5)


Unnamed: 0,AGEP,MAR,POBP,SEX,RAC1P
0,18.0,5.0,66.0,2.0,1.0
1,22.0,5.0,48.0,1.0,1.0
2,18.0,5.0,15.0,2.0,7.0
3,18.0,5.0,34.0,1.0,2.0
4,29.0,5.0,45.0,1.0,1.0


## State-Code Utility

In [5]:
_POBP_STATE_CODE = {
    "AL":  1,  # Alabama
    "AK":  2,  # Alaska
    "AZ":  4,  # Arizona
    "AR":  5,  # Arkansas
    "CA":  6,  # California
    "CO":  8,  # Colorado
    "CT":  9,  # Connecticut
    "DE": 10,  # Delaware
    "DC": 11,  # District of Columbia
    "FL": 12,  # Florida
    "GA": 13,  # Georgia
    "HI": 15,  # Hawaii
    "ID": 16,  # Idaho
    "IL": 17,  # Illinois
    "IN": 18,  # Indiana
    "IA": 19,  # Iowa
    "KS": 20,  # Kansas
    "KY": 21,  # Kentucky
    "LA": 22,  # Louisiana
    "ME": 23,  # Maine
    "MD": 24,  # Maryland
    "MA": 25,  # Massachusetts
    "MI": 26,  # Michigan
    "MN": 27,  # Minnesota
    "MS": 28,  # Mississippi
    "MO": 29,  # Missouri
    "MT": 30,  # Montana
    "NE": 31,  # Nebraska
    "NV": 32,  # Nevada
    "NH": 33,  # New Hampshire
    "NJ": 34,  # New Jersey
    "NM": 35,  # New Mexico
    "NY": 36,  # New York
    "NC": 37,  # North Carolina
    "ND": 38,  # North Dakota
    "OH": 39,  # Ohio
    "OK": 40,  # Oklahoma
    "OR": 41,  # Oregon
    "PA": 42,  # Pennsylvania
    "RI": 44,  # Rhode Island
    "SC": 45,  # South Carolina
    "SD": 46,  # South Dakota
    "TN": 47,  # Tennessee
    "TX": 48,  # Texas
    "UT": 49,  # Utah
    "VT": 50,  # Vermont
    "VA": 51,  # Virginia
    "WA": 53,  # Washington
    "WV": 54,  # West Virginia
    "WI": 55,  # Wisconsin
}

def state_to_pobp_code(abbrev: str) -> int:
    """
    Turn a two-letter state code (e.g. 'CA') into the ACS POBP recode.
    Raises a KeyError if the state isn't in the map.
    """
    st = abbrev.strip().upper()
    try:
        return _POBP_STATE_CODE[st]
    except KeyError:
        raise KeyError(f"Unknown state abbreviation '{abbrev}'. Valid codes are: "
                       + ", ".join(sorted(_POBP_STATE_CODE.keys())))

## Split Out Each State

In [6]:
mask1 = X["POBP"] == state_to_pobp_code(state1)
mask2 = X["POBP"] == state_to_pobp_code(state2)

X1, y1 = X[mask1].drop(columns="POBP"), y[mask1]
X2, y2 = X[mask2].drop(columns="POBP"), y[mask2]

print(f"{state1} shape:", X1.shape)
print(f"{state2} shape:", X2.shape)

HI shape: (3747, 4)
ME shape: (4089, 4)


## Within-State Bias Detection

In [7]:
msd_val_1, rule_1 = detect_bias(
    X1, y1,
    protected_list  = protected_attrs,
    continuous_list = continuous_feats,
    fp_map          = feature_map,
    seed            = seed,
    n_samples       = n_samples,
    method          = method,
    method_kwargs   = method_kwargs,
)

print(f"State {state1} MSD (within-state) = {msd_val_1:.3f}")
print("Rule:", " AND ".join(str(r) for _,r in rule_1))

[INFO] Seeding the run with seed=42
[INFO] Running HiGHS 1.11.0 (git hash: 364c83a): Copyright (c) 2025 HiGHS under MIT licence terms
[INFO] RUN!
[INFO] MIP  has 61743 rows; 777 cols; 180422 nonzeros; 90 integer variables (90 binary)
[INFO] Coefficient ranges:
[INFO]   Matrix [2e-03, 2e+00]
[INFO]   Cost   [1e+00, 1e+00]
[INFO]   Bound  [1e+00, 1e+00]
[INFO]   RHS    [1e+00, 2e+00]
[INFO] Presolving model
[INFO] 58998 rows, 777 cols, 176992 nonzeros  0s
[INFO] 58998 rows, 777 cols, 176992 nonzeros  0s
[INFO] 
[INFO] Solving MIP model with:
[INFO]    58998 rows
[INFO]    777 cols (90 binary, 0 integer, 0 implied int., 687 continuous, 0 domain fixed)
[INFO]    176992 nonzeros
[INFO] 
[INFO] Src: B => Branching; C => Central rounding; F => Feasibility pump; J => Feasibility jump;
[INFO]      H => Heuristic; L => Sub-MIP; P => Empty MIP; R => Randomized rounding; Z => ZI Round;
[INFO]      I => Shifting; S => Solve LP; T => Evaluate node; U => Unbounded; X => User solution;
[INFO]      z =

State HI MSD (within-state) = 0.337
Rule: MAR = 5.0


In [8]:
msd_val_2, rule_2 = detect_bias(
    X2, y2,
    protected_list  = protected_attrs,
    continuous_list = continuous_feats,
    fp_map          = feature_map,
    seed            = seed,
    n_samples       = n_samples,
    method          = method,
    method_kwargs   = method_kwargs,
)

print(f"State {state1} MSD (within-state) = {msd_val_2:.3f}")
print("Rule:", " AND ".join(str(r) for _,r in rule_2))

[INFO] Seeding the run with seed=42
[INFO] Running HiGHS 1.11.0 (git hash: 364c83a): Copyright (c) 2025 HiGHS under MIT licence terms
[INFO] RUN!
[INFO] MIP  has 38095 rows; 518 cols; 111284 nonzeros; 89 integer variables (89 binary)
[INFO] Coefficient ranges:
[INFO]   Matrix [1e-03, 2e+00]
[INFO]   Cost   [1e+00, 1e+00]
[INFO]   Bound  [1e+00, 1e+00]
[INFO]   RHS    [1e+00, 2e+00]
[INFO] Presolving model
[INFO] 36382 rows, 518 cols, 109144 nonzeros  0s
[INFO] 36382 rows, 518 cols, 109144 nonzeros  0s
[INFO] 
[INFO] Solving MIP model with:
[INFO]    36382 rows
[INFO]    518 cols (89 binary, 0 integer, 0 implied int., 429 continuous, 0 domain fixed)
[INFO]    109144 nonzeros
[INFO] 
[INFO] Src: B => Branching; C => Central rounding; F => Feasibility pump; J => Feasibility jump;
[INFO]      H => Heuristic; L => Sub-MIP; P => Empty MIP; R => Randomized rounding; Z => ZI Round;
[INFO]      I => Shifting; S => Solve LP; T => Evaluate node; U => Unbounded; X => User solution;
[INFO]      z =

State HI MSD (within-state) = 0.270
Rule: MAR = 1.0 AND SEX = 1.0


## Interpret the Rules

Folktables encodes `MAR` (marital status) as:

| Code | Meaning       |
|------|---------------|
| 1    | Married       |
| 2    | Widowed       |
| 3    | Divorced      |
| 4    | Separated     |
| 5    | Never married |

And `SEX` as:

| Code | Meaning |
|------|---------|
| 1    | Male    |
| 2    | Female  |

- **State HI:** `MAR = 5.0` --> "Never married" people are underserved by up to **33.7 pp**.  
- **State ME:** `MAR = 1.0 AND SEX = 1.0` --> "Married men" are underserved by up to **27.0 pp**.  