# Folktables: Real-World Bias Detection with MSD

In this notebook we load real-world American Community Survey (ACS) data (via [folktables](https://github.com/socialfoundations/folktables) and the [Census Bureau’s ACS program](https://www.census.gov/programs-surveys/acs/data.html "American Community Survey data"))  
 and use **Maximum Subgroup Discrepancy (MSD)** to:

1. Find the most disadvantaged subgroup *within* each state (on the ACS Income ≥ $50 000 classification task)  
2. Find the subgroup where two states differ *most*  

**MSD** scans *all* intersectional protected groups (e.g. age×marital-status×race…) in **linear** sample time, returns both a **value** (percentage‐point gap) and an **interpretable rule** (a conjunction of feature‐value tests).


## Configuration & Imports

In [1]:
import numpy as np
import pandas as pd

from folktables import ACSDataSource, ACSIncome
from humancompatible.detect import detect_bias, detect_bias_two_samples

## Parameters

In [2]:
# ────────── Which two states to compare ──────────
state1, state2 = "HI", "ME"

# ────────── ACS Data settings ──────────
survey_year   = "2018"
horizon       = "1-Year"
data_root     = "../data/folktables"

# ────────── Which columns to keep ──────────
# ['AGEP', 'COW', 'SCHL', 'MAR', 'OCCP', 'POBP', 'RELP', 'WKHP', 'SEX', 'RAC1P']
selected_columns = ["AGEP", "MAR", "POBP", "SEX", "RAC1P"]
protected_attrs  = selected_columns.copy()
continuous_feats = []
feature_map      = {}  # any custom binning

# ────────── Method settings ──────────
seed          = 42
n_samples     = 1_000  # if the code runs for too long, you can try to decrease this number
method        = "MSD"
solver        = "gurobi"
method_kwargs = {"time_limit": 120,    # 2 min per solve
                 "solver": "gurobi",  # if you don't have the licence, comment this line
                 }

## Download & Prepare the Two-State Dataset

In [3]:
def load_state_data(state_abbrev: str):
    """
    Load data for a single state via folktables,
    then return only our selected columns and the target series.
    """
    ds = ACSDataSource(
        survey_year=survey_year,
        horizon=horizon,
        survey="person",
        root_dir=data_root,
    )
    try:
        raw = ds.get_data(states=[state_abbrev], download=True)
    except Exception as e:
        print("\n⚠️  Automatic download failed:")
        print(f"    {e!r}\n")
        print("→ Please manually download this file and unzip it under:")
        print(f"    {data_root}/{survey_year}/{horizon}/csv_p{state_abbrev.lower()}.zip")
        print("\nYou can get it from:")
        print(f"https://www2.census.gov/programs-surveys/acs/data/pums/{survey_year}/{horizon}/\n")
        raw = ds.get_data(states=[state_abbrev], download=False)
    
    X_full, y_full, _ = ACSIncome.df_to_pandas(raw)
    X_sel = X_full[selected_columns]
    return X_sel, y_full

In [4]:
X1, y1 = load_state_data(state1)
X2, y2 = load_state_data(state2)

print(f"{state1} shape:", X1.shape)
print(f"{state2} shape:", X2.shape)

display(X1.head())

HI shape: (7731, 5)
ME shape: (7002, 5)


Unnamed: 0,AGEP,MAR,POBP,SEX,RAC1P
0,18.0,5.0,66.0,2.0,1.0
1,22.0,5.0,48.0,1.0,1.0
2,18.0,5.0,15.0,2.0,7.0
3,18.0,5.0,34.0,1.0,2.0
4,29.0,5.0,45.0,1.0,1.0


## State-Code Utility

In [5]:
_POBP_STATE_CODE = {
    "AL":  1,  # Alabama
    "AK":  2,  # Alaska
    "AZ":  4,  # Arizona
    "AR":  5,  # Arkansas
    "CA":  6,  # California
    "CO":  8,  # Colorado
    "CT":  9,  # Connecticut
    "DE": 10,  # Delaware
    "DC": 11,  # District of Columbia
    "FL": 12,  # Florida
    "GA": 13,  # Georgia
    "HI": 15,  # Hawaii
    "ID": 16,  # Idaho
    "IL": 17,  # Illinois
    "IN": 18,  # Indiana
    "IA": 19,  # Iowa
    "KS": 20,  # Kansas
    "KY": 21,  # Kentucky
    "LA": 22,  # Louisiana
    "ME": 23,  # Maine
    "MD": 24,  # Maryland
    "MA": 25,  # Massachusetts
    "MI": 26,  # Michigan
    "MN": 27,  # Minnesota
    "MS": 28,  # Mississippi
    "MO": 29,  # Missouri
    "MT": 30,  # Montana
    "NE": 31,  # Nebraska
    "NV": 32,  # Nevada
    "NH": 33,  # New Hampshire
    "NJ": 34,  # New Jersey
    "NM": 35,  # New Mexico
    "NY": 36,  # New York
    "NC": 37,  # North Carolina
    "ND": 38,  # North Dakota
    "OH": 39,  # Ohio
    "OK": 40,  # Oklahoma
    "OR": 41,  # Oregon
    "PA": 42,  # Pennsylvania
    "RI": 44,  # Rhode Island
    "SC": 45,  # South Carolina
    "SD": 46,  # South Dakota
    "TN": 47,  # Tennessee
    "TX": 48,  # Texas
    "UT": 49,  # Utah
    "VT": 50,  # Vermont
    "VA": 51,  # Virginia
    "WA": 53,  # Washington
    "WV": 54,  # West Virginia
    "WI": 55,  # Wisconsin
}

def state_to_pobp_code(abbrev: str) -> int:
    """
    Turn a two-letter state code (e.g. 'CA') into the ACS POBP recode.
    Raises a KeyError if the state isn't in the map.
    """
    st = abbrev.strip().upper()
    try:
        return _POBP_STATE_CODE[st]
    except KeyError:
        raise KeyError(f"Unknown state abbreviation '{abbrev}'. Valid codes are: "
                       + ", ".join(sorted(_POBP_STATE_CODE.keys())))

## Within-State Bias Detection

> **Task: ACS Income (> \$50 000) Classification**  
> We use the **ACSIncome** problem from **folktables**, which predicts whether an individual’s personal income (`PINCP`) exceeds \$50 000 per year.  
> 
> - **Features used**:  
>   - `AGEP` (Age in years, must be > 16)  
>   - `MAR` (Marital status)  
>   - `POBP` (Place of birth / state)  
>   - `SEX` (Male / Female)  
>   - `RAC1P` (Race recode)  
> - **Target**: 1 if `PINCP > 50 000`, else 0  
>
> - **Preprocessing** (built-in to ACSIncome and our solver):
>   - Filter out individuals under 16  
>   - Filter out zero or missing wages  
>   - Normalize missing values to –1  
> 
> Our within-state **MSD** then finds the protected subgroup (e.g. "never-married", "married men", etc.) whose positive-vs-negative income rate differs the most from its complement.

In [6]:
msd_val_1, rule_1 = detect_bias(
    X1, y1,
    protected_list  = protected_attrs,
    continuous_list = continuous_feats,
    fp_map          = feature_map,
    seed            = seed,
    n_samples       = n_samples,
    method          = method,
    method_kwargs   = method_kwargs,
)

print(f"State {state1} MSD (within-state) = {msd_val_1:.3f}")
print("Rule:", " AND ".join(str(r) for _,r in rule_1))

[INFO] Seeding the run with seed=42
[INFO] Set parameter Username
[INFO] Set parameter LicenseID to value 2649381
[INFO] Academic license - for non-commercial use only - expires 2026-04-09


State HI MSD (within-state) = 0.272
Rule: MAR = 5.0


In [7]:
msd_val_2, rule_2 = detect_bias(
    X2, y2,
    protected_list  = protected_attrs,
    continuous_list = continuous_feats,
    fp_map          = feature_map,
    seed            = seed,
    n_samples       = n_samples,
    method          = method,
    method_kwargs   = method_kwargs,
)

print(f"State {state1} MSD (within-state) = {msd_val_2:.3f}")
print("Rule:", " AND ".join(str(r) for _,r in rule_2))

[INFO] Seeding the run with seed=42


State HI MSD (within-state) = 0.329
Rule: MAR = 1.0 AND SEX = 1.0


## Interpret the Rules

Folktables encodes `MAR` (marital status) as:

| Code | Meaning       |
|------|---------------|
| 1    | Married       |
| 2    | Widowed       |
| 3    | Divorced      |
| 4    | Separated     |
| 5    | Never married |

And `SEX` as:

| Code | Meaning |
|------|---------|
| 1    | Male    |
| 2    | Female  |

- **State HI:** `MAR = 5.0` --> "Never married" people are underserved by up to **27.2 pp**.  
- **State ME:** `MAR = 1.0 AND SEX = 1.0` --> "Married men" are underserved by up to **32.9 pp**.  

## Cross-State Discrepancy

In [8]:
msd_cross, rule_cross = detect_bias_two_samples(
    X1, X2, 
    protected_list=protected_attrs,
    continuous_list=continuous_feats,
    fp_map=feature_map,
    seed=seed,
    n_samples=n_samples,
    method=method,
    method_kwargs=method_kwargs
)
print(f"HI vs ME MSD = {msd_cross:.3f}")
print("Rule: " + " AND ".join(str(r) for _,r in rule_cross))

[INFO] Seeding the run with seed=42


HI vs ME MSD = 0.709
Rule: RAC1P = 1.0


The **HI vs ME MSD** of **0.709** means that the subgroup  
**`RAC1P = 1.0`**  
("White" in the ACS race recode) differs by **70.9 percentage points** between Hawaii and Maine.

In other words, "White" individuals make up almost 71 pp more of one state's sample than the other's - this is by far the largest intersectional gap between the two populations.

## Conclusion

In this notebook, we have seen how **Maximum Subgroup Discrepancy (MSD)** can uncover both within-population and cross-population biases in a real-world dataset:

1. **Within-State Biases**  
   - For **Hawaii**, the most disadvantaged subgroup was defined by  
     `MAR = 5.0`  
     ("Never married"), with an MSD of **0.272**, indicating that "never married" individuals appear 27.2 pp more often in one outcome than the other.  
   - For **Maine**, the worst subgroup was  
     `MAR = 1.0 AND SEX = 1.0`  
     ("Married men"), with an MSD of **0.329**.

2. **Cross-State Drift**  
   - Comparing **Hawaii vs Maine**, the top rule was  
     `RAC1P = 1.0`  
     ("White"), with an MSD of **0.709** - an 70.9 pp gap in racial composition.



Feel free to play with different feature sets, years, or other folktables problems (ACSPublicCoverage, ACSMobility, ...).