# Folktables: Real-World Bias Detection with MSD (2)

This is the **second** part for operating with the folktables database.  
And we will focus on finding the subgroup where two states differ most.


## Configuration & Imports

In [1]:
import numpy as np
import pandas as pd

# We'll run the Income >= 50k task for each state:
from folktables import ACSIncome

from humancompatible.detect import detect_and_score
from humancompatible.detect import most_biased_subgroup, evaluate_biased_subgroup
from humancompatible.detect import most_biased_subgroup_two_samples, evaluate_biased_subgroup_two_samples
from humancompatible.detect.helpers import report_subgroup_bias

from humancompatible.detect.helpers.prepare import prepare_dataset
from humancompatible.detect.helpers.utils import signed_subgroup_discrepancy
from humancompatible.detect.methods.msd.mapping_msd import subgroup_map_from_conjuncts_dataframe
from humancompatible.detect.helpers.utils import signed_subgroup_prevalence_diff

from supports.folktables_utils import (
    load_state_data,
    ProtectedOnly,
    FEATURE_PROCESSING,
    CONTINUOUS_FEATURES, 
    FEATURE_NAMES, 
    PROTECTED_VALUES_MAP,
)

## PUMS Data Dictionary

For a complete list of all ACS PUMS variables, their codes and labels (e.g. every state code for `POBP`, every education level for `SCHL`, etc.), see the official 2018 PUMS Data Dictionary:

> **PUMS Data Dictionary (2018)**  
> https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2018.pdf

## State selection

In [2]:
# ────────── Which two states to compare ──────────
state1, state2 = "FL", "NH" # Florida and New Hampshire

## Global Configuration

In [3]:
# ────────── Which columns are protected ──────────
print("Features can be selected from:", ACSIncome.features)
print("Explanation of their abbreviations could be found below.")
protected_attrs = ['AGEP', 'SCHL', 'OCCP', 'SEX', 'RAC1P']

# ────────── MSD / Solver settings ──────────
n_samples = 1_000   # number of samples to subsample for faster computation
method = "MSD"
seed = 42 # fixed for demo purposes 
method_kwargs = {
    "solver": "gurobi",  # comment out if you don’t have a license
}

Features can be selected from: ['AGEP', 'COW', 'SCHL', 'MAR', 'OCCP', 'POBP', 'RELP', 'WKHP', 'SEX', 'RAC1P']
Explanation of their abbreviations could be found below.


## Download & Prepare the Two-State Dataset

In [4]:
X_all1, _ = load_state_data(state1, problem_cls=ProtectedOnly)
X_all2, _ = load_state_data(state2, problem_cls=ProtectedOnly)

protected_attrs_all = ProtectedOnly.features
continuous_feats_prot = [a for a in protected_attrs_all if a in CONTINUOUS_FEATURES]
feature_map = FEATURE_PROCESSING  # optional custom binning - to reduce the number of bins

print(f"{state1} shape:", X_all1.shape)
print(f"{state2} shape:", X_all2.shape)

display(X_all1.head())

FL shape: (202160, 14)
NH shape: (13780, 14)


Unnamed: 0,SEX,RAC1P,AGEP,POBP,DIS,CIT,MIL,ANC,NATIVITY,DEAR,DEYE,DREM,FER,POVPIP
0,1.0,8.0,64.0,327.0,2.0,5.0,4.0,1.0,2.0,2.0,2.0,2.0,0.0,0.0
1,2.0,1.0,95.0,12.0,1.0,1.0,4.0,1.0,1.0,2.0,2.0,1.0,0.0,0.0
2,1.0,2.0,15.0,12.0,2.0,1.0,0.0,1.0,1.0,2.0,2.0,2.0,0.0,0.0
3,1.0,9.0,20.0,11.0,2.0,1.0,2.0,2.0,1.0,2.0,2.0,2.0,0.0,0.0
4,1.0,2.0,18.0,40.0,2.0,1.0,4.0,1.0,1.0,2.0,2.0,2.0,0.0,0.0


## Cross-State Discrepancy

In [5]:
# get data with all protected attributes for each state
from copy import deepcopy


rule_cross = most_biased_subgroup_two_samples(
    X_all1, X_all2, 
    protected_list=protected_attrs_all,
    continuous_list=continuous_feats_prot,
    fp_map=feature_map,
    n_samples=n_samples,
    seed=seed,
    method=method,
    method_kwargs=method_kwargs
)

m_kwargs = deepcopy(method_kwargs)
m_kwargs = {**m_kwargs, "rule": rule_cross}
msd_cross = evaluate_biased_subgroup_two_samples(
    X_all1, X_all2, 
    protected_list=protected_attrs_all,
    continuous_list=continuous_feats_prot,
    fp_map=feature_map,
    n_samples=n_samples,
    seed=seed,
    method=method,
    method_kwargs=m_kwargs
)

report_subgroup_bias(
    f"{state1} vs {state2}",
    msd_cross,
    rule_cross,
    FEATURE_NAMES,
    PROTECTED_VALUES_MAP,
)

[INFO] Seeding the run with seed=42
[INFO] Set parameter Username
[INFO] Set parameter LicenseID to value 2649381
[INFO] Academic license - for non-commercial use only - expires 2026-04-09
[INFO] Seeding the run with seed=42


FL vs NH
MSD = 0.226
Rule: RAC1P = 1.0 AND CIT = 1.0 AND DREM = 2.0 AND DEAR = 2.0
Explained rule: Race = White AND Citizenship = Born in the US AND Cognitive difficulty = No AND Difficulty hearing = No


### Signed Cross‐State Gap (Δ)

Here we’ll recompute Δ for the "cross‐state" rule,  
but before this, we need to prepare datasets the same way, as they were during `evaluate_biased_subgroup*()` and comparing prevalences directly:

- **Δ > 0** -> subgroup is more prevalent in **NH**  
- **Δ < 0** -> subgroup is more prevalent in **FL**  


In [6]:
# y_cross: 0 for the first state, 1 for the second state
X_cross = pd.concat([X_all1, X_all2], ignore_index=True)
y_cross = np.concatenate([
    np.zeros(X_all1.shape[0], dtype=int),
    np.ones(X_all2.shape[0], dtype=int),
])
y_cross = pd.DataFrame(y_cross, columns=["target"])

np.random.seed(seed)                # keep the subsampling reproducible
_, X_cross_sub, y_cross_sub = prepare_dataset(
    X_cross,
    y_cross,
    n_max=n_samples,
    protected_attrs=protected_attrs_all,
    continuous_feats=continuous_feats_prot,
    feature_processing=feature_map,
)

In [7]:
mask_sub_1 = (y_cross_sub.values == 0)
mask_sub_2 = (y_cross_sub.values == 1)
X1_cross_sub = X_cross_sub[mask_sub_1]
X2_cross_sub = X_cross_sub[mask_sub_2]

mask_A = subgroup_map_from_conjuncts_dataframe(rule_cross, X1_cross_sub)
mask_B = subgroup_map_from_conjuncts_dataframe(rule_cross, X2_cross_sub)

delta_cross = signed_subgroup_prevalence_diff(mask_A, mask_B)
direction = (
    f"more prevalent in {state2}" if delta_cross > 0
    else f"more prevalent in {state1}"
)
print(f"Δ = {delta_cross:.3f} -> subgroup is {direction}.")


Δ = 0.226 -> subgroup is more prevalent in NH.


> **Florida and New Hampshire - interpreting MSD & Δ**  
> * **MSD = 0.226** is the *magnitude* of the biggest demographic shift between the two states.  
> * **Δ = +0.226** means the subgroup is **more common in New Hampshire than in Florida** (rows from NH were labelled `1`).  
>   If Δ were negative we would read it the other way around.
>
> Still curious about the exact codes for variables such as `RAC1P` or `CIT`?  
> See the [2018 PUMS Data Dictionary](https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2018.pdf).


The **FL vs NH MSD** of **0.226** means that the subgroup  
**`RAC1P = 1.0 AND CIT = 1.0 AND DREM = 2.0 AND DEAR = 2.0`**  
(White, US-born people, without cognitive or hearing difficutlies) is disproportionately represented, with a **22.6 pp** difference in proportions between Florida and New Hampshire.

> For a complete list of every ACS variable’s codes and labels (so you can look up other states, education levels, etc.), see the official 2018 PUMS Data Dictionary:  
> https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2018.pdf

## Takeaways

In this notebook, we have seen how **Maximum Subgroup Discrepancy (MSD)** can uncover both within-population and cross-population biases in a real-world dataset:

1. **Within-State Discrepancies**  
   - For **Florida**, the subgroup  
     `RAC1P = 1.0 AND SEX = 1.0`  
     ("White" **and** "Male") is the most disproportionately represented, with an MSD of **0.193** (19.3 pp gap vs its complement).  
   - For **New Hampshire**, the subgroup  
     `RAC1P = 1.0 AND SEX = 1.0`  
     ("White" **and** "Male") is the most disproportionately represented, with an MSD of **0.217**.

2. **Cross-State Discrepancy**  
   - Comparing **Florida vs New Hampshire**, the subgroup  
     **`RAC1P = 1.0 AND CIT = 1.0 AND DREM = 2.0 AND DEAR = 2.0`**  
(White, US-born people, without cognitive or hearing difficutlies) is the most disproportionately represented, with an MSD of **0.226** (22.6 pp gap in share).

Feel free to play with different feature sets, years, or other folktables problems (ACSPublicCoverage, ACSMobility, ...).

## Tips & next steps

- An option on how to obtain the signed version for the given rule can also be obtained by calling   
`evaluate_biased_subgroup*()` with given argument `signed=True`.  

- In this notebook and the previous one, alternative solutions are presented, and the way the algorithms are coded.  

- More details about the signature of functions, methods, and their attributes are introduced in the **next notebook**!