# Exploring Functionality (Folktables)

This notebook shows how each argument changes results and runtime using ACS (Folktables):

- input modes (DataFrame / CSV / two-sample)

- protected_list, continuous_list, fp_map

- n_samples, seed

- method="MSD" vs method="l_inf"

## Imports & basics

In [7]:
import numpy as np
import pandas as pd
from pathlib import Path

from time import perf_counter

from folktables import ACSIncome

from humancompatible.detect import (
    detect_and_score,
    most_biased_subgroup, evaluate_biased_subgroup,
    most_biased_subgroup_csv, evaluate_biased_subgroup_csv,
    most_biased_subgroup_two_samples, evaluate_biased_subgroup_two_samples,
)
from humancompatible.detect.helpers import report_subgroup_bias

from supports.folktables_utils import (
    load_state_data,
    ProtectedOnly,
    FEATURE_PROCESSING,
    CONTINUOUS_FEATURES,
    FEATURE_NAMES,
    PROTECTED_VALUES_MAP,
)

STATE_A = "CA"

## Quick API recap

**Core helper**  
`detect_and_score(...) -> (rule, value)` works in 3 modes:
- DataFrame: pass X, y
- CSV: pass csv_path, target_col
- Two-sample: pass X1, X2

Other shared options:
- protected_list, continuous_list, fp_map
- n_samples, seed
- method="MSD" (default) or "l_inf" (+ method_kwargs)

## DataFrame mode (ACSIncome, single state)

In [2]:
# Load ACSIncome for a single state (features & target)
X_df, y_df = load_state_data(STATE_A, problem_cls=ACSIncome)
print(STATE_A, X_df.shape, y_df.shape)
X_df.head()

CA (195665, 10) (195665, 1)


Unnamed: 0,AGEP,COW,SCHL,MAR,OCCP,POBP,RELP,WKHP,SEX,RAC1P
0,30.0,6.0,14.0,1.0,9610.0,6.0,16.0,40.0,1.0,8.0
1,21.0,4.0,16.0,5.0,1970.0,6.0,17.0,20.0,1.0,1.0
2,65.0,2.0,22.0,5.0,2040.0,6.0,17.0,8.0,1.0,1.0
3,33.0,1.0,14.0,3.0,9610.0,36.0,16.0,40.0,1.0,1.0
4,18.0,2.0,19.0,5.0,1021.0,6.0,17.0,18.0,2.0,1.0


In [3]:
# Start simple: pick a handful of protected features from ACSIncome
candidates = ACSIncome.features
print("ACSIncome feature candidates:", candidates)

protected_list = ["AGEP", "SCHL", "SEX", "RAC1P"]
continuous_list = [c for c in protected_list if c in CONTINUOUS_FEATURES]
fp_map = None
n_samples = 1_000
seed = 42
solver_name = "gurobi"

rule, val = detect_and_score(
    X=X_df, y=y_df,
    protected_list=protected_list,
    continuous_list=continuous_list,
    fp_map=fp_map,
    n_samples=n_samples, seed=seed,
    method="MSD", method_kwargs={"solver": solver_name}
)

report_subgroup_bias(f"{STATE_A} (DataFrame mode)", val, rule, FEATURE_NAMES, PROTECTED_VALUES_MAP)


[INFO] Seeding the run with seed=42 for searching the `rule`.


ACSIncome feature candidates: ['AGEP', 'COW', 'SCHL', 'MAR', 'OCCP', 'POBP', 'RELP', 'WKHP', 'SEX', 'RAC1P']


[INFO] Set parameter Username
[INFO] Set parameter LicenseID to value 2649381
[INFO] Academic license - for non-commercial use only - expires 2026-04-09
[INFO] Seeding the run with seed=42 for searching the `value`.


CA (DataFrame mode)
MSD = 0.243
Rule: AGEP between (np.float64(17.0), np.float64(24.70077))
Explained rule: Age = (np.float64(17.0), np.float64(24.70077))


## Effect of `protected_list`

`protected_list` tells the system **which columns are considered protected attributes** for subgroup search and evaluation. Only these columns are binarized and explored when building a conjunctive rule.

- Adding more protected columns increases the search space (potentially better rules) but also increases runtime.

- Fewer, meaningful protected attributes yield simpler rules that are easier to communicate.

- Default behavior: If `protected_list=None`, all feature columns are treated as protected, which is convenient but may slow the search.

In [4]:
rows = []
for plist in [
    ["AGEP", "SEX"],
    ["AGEP", "SCHL", "SEX", "RAC1P"],
    ["AGEP", "COW", "SCHL", "OCCP", "RELP", "SEX", "RAC1P"],
]:
    print("protected_list =", plist)
    t0 = perf_counter()

    r, v = detect_and_score(
        X=X_df, y=y_df,
        protected_list=plist,
        continuous_list=[c for c in (plist or []) if c in CONTINUOUS_FEATURES],
        fp_map=None,
        n_samples=n_samples, seed=seed,
        method="MSD", method_kwargs={"solver": solver_name}
    )
    rule_str = " AND ".join(str(binop) for _, binop in r)
    
    dt = perf_counter() - t0
    rows.append({
        "protected_list": plist,
        "n_samples": n_samples,
        "MSD": v,
        "rule": rule_str,
        "time_s": dt
    })

pd.DataFrame(rows).sort_values("time_s", ascending=True)


[INFO] Seeding the run with seed=42 for searching the `rule`.


protected_list = ['AGEP', 'SEX']


[INFO] Seeding the run with seed=42 for searching the `value`.
[INFO] Seeding the run with seed=42 for searching the `rule`.


protected_list = ['AGEP', 'SCHL', 'SEX', 'RAC1P']


[INFO] Seeding the run with seed=42 for searching the `value`.
[INFO] Seeding the run with seed=42 for searching the `rule`.


protected_list = ['AGEP', 'COW', 'SCHL', 'OCCP', 'RELP', 'SEX', 'RAC1P']


[INFO] Seeding the run with seed=42 for searching the `value`.


Unnamed: 0,protected_list,n_samples,MSD,rule,time_s
0,"[AGEP, SEX]",1000,0.242922,"AGEP between (np.float64(17.0), np.float64(24....",0.07784
1,"[AGEP, SCHL, SEX, RAC1P]",1000,0.242922,"AGEP between (np.float64(17.0), np.float64(24....",2.577876
2,"[AGEP, COW, SCHL, OCCP, RELP, SEX, RAC1P]",1000,0.281277,RELP = 0.0,83.154872


## Effect of `continuous_list`

`continuous_list` marks which protected attributes should be treated as **continuous** during binning. Continuous features are automatically partitioned into intervals (e.g., `[a, b)`), so rules can look like `AGEP between (17.0, 24.7)`.

- Intervals can capture targeted ranges (e.g., young adults) that categorical bins would smear out.

- Very fine bins on small samples can be unstable; with larger n_samples results tend to stabilize.

- Default: If a feature is not in `continuous_list`, it's treated as categorical (exact-value equality / inequality).

If you’re pre-bucketing a numeric feature via `fp_map` (e.g., decades), leave it out of `continuous_list` so it's treated as categorical.

In [5]:
for cont in [[], ["AGEP"]]:
    print("\ncontinuous_list =", cont)
    r, v = detect_and_score(
        X=X_df, y=y_df,
        protected_list=["AGEP","SCHL","SEX","RAC1P"],
        continuous_list=cont,
        fp_map=None,
        n_samples=n_samples, seed=seed,
        method="MSD", method_kwargs={"solver": solver_name}
    )
    report_subgroup_bias(f"{STATE_A} (continuous_list={cont})", v, r, FEATURE_NAMES, PROTECTED_VALUES_MAP)


[INFO] Seeding the run with seed=42 for searching the `rule`.



continuous_list = []


[INFO] Seeding the run with seed=42 for searching the `value`.
[INFO] Seeding the run with seed=42 for searching the `rule`.


CA (continuous_list=[])
MSD = 0.221
Rule: SCHL = 21.0
Explained rule: Educational attainment = 21.0

continuous_list = ['AGEP']


[INFO] Seeding the run with seed=42 for searching the `value`.


CA (continuous_list=['AGEP'])
MSD = 0.243
Rule: AGEP between (np.float64(17.0), np.float64(24.70077))
Explained rule: Age = (np.float64(17.0), np.float64(24.70077))


## Effect of `fp_map`

`fp_map` is a dict col -> function applied before binarization. Use it to:

- Normalize or clean raw codes (`"Male"/"M"` -> 1, `"Female"/"F"` -> 0).

- Bucket continuous values (e.g., `AGEP` -> decades).

- Collapse rare categories into an `"Other"` bin.

Why it matters:

- Human-meaningful codes produce cleaner rules.

- Well-chosen categories can significantly speed up search without losing signal.

Keep mappings deterministic and simple (no target leakage).

In [6]:
# Bucket ages by decade to simplify
for fp_map_demo in [
    {},
    {
        "AGEP": lambda a: int(a // 10),  # decades
    },
]:
    r, v = detect_and_score(
        X=X_df, y=y_df,
        protected_list=["AGEP","SCHL","SEX","RAC1P"],
        continuous_list=[],     # treat AGEP as categorical after bucketing
        fp_map=fp_map_demo,
        n_samples=n_samples, seed=seed,
        method="MSD", method_kwargs={"solver": solver_name}
    )
    report_subgroup_bias(f"\n{STATE_A} (fp_map={fp_map_demo})", v, r, FEATURE_NAMES, PROTECTED_VALUES_MAP)


[INFO] Seeding the run with seed=42 for searching the `rule`.
[INFO] Seeding the run with seed=42 for searching the `value`.
[INFO] Seeding the run with seed=42 for searching the `rule`.



CA (fp_map={})
MSD = 0.221
Rule: SCHL = 21.0
Explained rule: Educational attainment = 21.0


[INFO] Seeding the run with seed=42 for searching the `value`.



CA (fp_map={'AGEP': <function <lambda> at 0x0000028B82216E80>})
MSD = 0.229
Rule: AGEP = 2.0
Explained rule: Age = 2.0


## `n_samples` and `seed`

`n_samples` caps the number of rows via uniform downsampling; `seed` controls that randomness (and any solver randomness we expose).

Why it matters:

- Smaller samples run faster; larger samples give more stable rules/MSD values.

- Fix `seed` to reproduce runs exactly in the notebook.

Rules of thumb:

- Start with a modest `n_samples` (e.g., 1k-5k) to iterate quickly, then increase to validate stability.

- Expect runtime to grow with `n_samples`, and also with the size of `protected_list`.

In [7]:
print("n_samples effect:")
for n in [1_000, 2_000, 5_000, 10_000, 20_000, 50_000]:
    t0 = perf_counter()
    r, v = detect_and_score(
        X=X_df, y=y_df,
        protected_list=["AGEP","SCHL","SEX","RAC1P"],
        continuous_list=["AGEP"],
        fp_map=None,
        n_samples=n, seed=42,
        method="MSD", method_kwargs={"solver": solver_name}
    )
    rule_str = " AND ".join(str(binop) for _, binop in r)
    dt = perf_counter() - t0
    print(f"  n_samples={n:>6} -> MSD={v:.3f}, rule='{rule_str}', time={dt:.3f}s\n")

[INFO] Seeding the run with seed=42 for searching the `rule`.


n_samples effect:


[INFO] Seeding the run with seed=42 for searching the `value`.
[INFO] Seeding the run with seed=42 for searching the `rule`.


  n_samples=  1000 -> MSD=0.243, rule='AGEP between (np.float64(17.0), np.float64(24.70077))', time=3.777s



[INFO] Seeding the run with seed=42 for searching the `value`.
[INFO] Seeding the run with seed=42 for searching the `rule`.


  n_samples=  2000 -> MSD=0.205, rule='AGEP between (np.float64(17.0), np.float64(24.70077))', time=3.101s



[INFO] Seeding the run with seed=42 for searching the `value`.
[INFO] Seeding the run with seed=42 for searching the `rule`.


  n_samples=  5000 -> MSD=0.202, rule='AGEP between (np.float64(17.0), np.float64(24.70077))', time=9.429s



[INFO] Seeding the run with seed=42 for searching the `value`.
[INFO] Seeding the run with seed=42 for searching the `rule`.


  n_samples= 10000 -> MSD=0.199, rule='AGEP between (np.float64(17.0), np.float64(24.70077))', time=7.670s



[INFO] Seeding the run with seed=42 for searching the `value`.
[INFO] Seeding the run with seed=42 for searching the `rule`.


  n_samples= 20000 -> MSD=0.197, rule='AGEP between (np.float64(17.0), np.float64(24.70077))', time=11.131s



[INFO] Seeding the run with seed=42 for searching the `value`.


  n_samples= 50000 -> MSD=0.196, rule='AGEP between (np.float64(17.0), np.float64(24.70077))', time=37.120s



In [8]:
print("Seed sensitivity:")
for s in [1, 2, 3]:
    t0 = perf_counter()
    r, v = detect_and_score(
        X=X_df, y=y_df,
        protected_list=["AGEP","SCHL","SEX","RAC1P"],
        continuous_list=["AGEP"],
        fp_map=None,
        n_samples=50_000, seed=s,
        method="MSD", method_kwargs={"solver": solver_name}
    )
    rule_str = " AND ".join(str(binop) for _, binop in r)
    dt = perf_counter() - t0
    print(f"  seed={s} -> MSD={v:.3f}, rule={rule_str}, time={dt:.3f}s\n")

[INFO] Seeding the run with seed=1 for searching the `rule`.


Seed sensitivity:


[INFO] Seeding the run with seed=1 for searching the `value`.
[INFO] Seeding the run with seed=2 for searching the `rule`.


  seed=1 -> MSD=0.195, rule=AGEP between (np.float64(17.0), np.float64(24.70077)), time=13.254s



[INFO] Seeding the run with seed=2 for searching the `value`.
[INFO] Seeding the run with seed=3 for searching the `rule`.


  seed=2 -> MSD=0.195, rule=AGEP between (np.float64(17.0), np.float64(24.70077)), time=12.076s



[INFO] Seeding the run with seed=3 for searching the `value`.


  seed=3 -> MSD=0.198, rule=AGEP between (np.float64(17.0), np.float64(24.70077)), time=17.491s



## Method choice: MSD vs L∞

Two complementary questions:

- **MSD (search)**: "_Which subgroup has the largest outcome-rate gap?_"  
Output: a conjunctive rule (e.g., `AGEP between ... AND SEX = Male`) and its discrepancy value (signed or absolute).

- **l∞ (test)**: "_Does this specific subgroup differ from the population by more than δ, in any positive-class feature bin?_"  
Output: `1.0` if the sup-norm distance is ≤ δ (within tolerance), `0.0` otherwise.

When to use:

- Start with MSD to discover a candidate subgroup.

- Use l∞ to validate a particular subgroup against a policy tolerance (e.g., δ = 0.10).

In [12]:
protected_list = ["COW", "SCHL", "OCCP", "SEX", "RAC1P"]
continuous_list = [c for c in protected_list if c in CONTINUOUS_FEATURES]
n_samples = 2_000
seed = 42

rule, val = detect_and_score(
    X=X_df, y=y_df,
    protected_list=protected_list,
    continuous_list=continuous_list,
    fp_map=None,
    n_samples=n_samples, seed=seed,
    method="MSD", method_kwargs={"solver": solver_name}
)
rule_str = " AND ".join(str(binop) for _, binop in rule)

print(rule_str)
print(val)

[INFO] Seeding the run with seed=42 for searching the `rule`.
[INFO] Seeding the run with seed=42 for searching the `value`.


SCHL = 21.0
0.16064135735776874


In [13]:
# MSD is already demonstrated above. Now l_inf:
linf_kwargs = {
    "feature_involved": "SCHL",
    "subgroup_to_check": 21.0,
    "delta": 0.1,
}
_ = detect_and_score(
    X=X_df, y=y_df,
    protected_list=protected_list,
    continuous_list=continuous_list,
    fp_map=None,
    n_samples=n_samples, seed=seed,
    method="l_inf", method_kwargs=linf_kwargs
)

[INFO] Seeding the run with seed=42 for searching the `value`.
[INFO] The most impacted subgroup bias <= 0.1


## Data input modes (DataFrame, CSV, two-sample)

- DataFrame mode: You already have `X` and `y` in memory. Fastest for iterative exploration.

- CSV mode: Provide `csv_path` and `target_col` if your data sits on disk. Handy for shareable, reproducible runs.

- Two-sample mode: Provide `X1` and `X2` (same columns). We synthesize a target (`0` for `X1`, `1` for `X2`) to find where the prevalence of a subgroup differs the most across datasets (e.g., state A vs. state B).

In [4]:
n_samples = 1_000
protected_list = ["AGEP", "SCHL", "SEX", "RAC1P"]
continuous_list = [c for c in protected_list if c in CONTINUOUS_FEATURES]

In [5]:
print("=== DataFrame mode ===")
t0 = perf_counter()
rule_df, msd_df = detect_and_score(
    X=X_df, y=y_df,
    protected_list=protected_list,
    continuous_list=continuous_list,
    fp_map=None,
    n_samples=n_samples, seed=seed,
    method="MSD", method_kwargs={"solver": solver_name},
)
dt = perf_counter() - t0
report_subgroup_bias("DataFrame", msd_df, rule_df, FEATURE_NAMES, PROTECTED_VALUES_MAP)
print(f"(elapsed: {dt:.3f}s)\n")

[INFO] Seeding the run with seed=42 for searching the `rule`.


=== DataFrame mode ===


[INFO] Seeding the run with seed=42 for searching the `value`.


DataFrame
MSD = 0.243
Rule: AGEP between (np.float64(17.0), np.float64(24.70077))
Explained rule: Age = (np.float64(17.0), np.float64(24.70077))
(elapsed: 2.546s)



In [None]:
print("=== CSV mode ===")
# Write a small CSV snapshot and run the CSV helpers
csv_path = Path("acs_demo.csv")
pd.concat([X_df, y_df.rename(columns={y_df.columns[0]: "target"})], axis=1).to_csv(csv_path, index=False)

t0 = perf_counter()
rule_csv = most_biased_subgroup_csv(
    csv_path=csv_path, target_col="target",
    protected_list=protected_list,
    continuous_list=continuous_list,
    fp_map=None,
    seed=seed, n_samples=n_samples,
    method="MSD", method_kwargs={"solver": solver_name},
)
msd_csv = evaluate_biased_subgroup_csv(
    csv_path=csv_path, target_col="target",
    protected_list=protected_list,
    continuous_list=continuous_list,
    fp_map=None,
    seed=seed, n_samples=n_samples,
    method="MSD", method_kwargs={"rule": rule_csv},
)
dt = perf_counter() - t0
report_subgroup_bias("CSV", msd_csv, rule_csv, FEATURE_NAMES, PROTECTED_VALUES_MAP)
print(f"(elapsed: {dt:.3f}s)\n")

=== CSV mode ===


[INFO] Seeding the run with seed=42 for searching the `rule`.


acs_demo.csv


[INFO] Seeding the run with seed=42 for searching the `value`.


CSV
MSD = 0.243
Rule: AGEP between (np.float64(17.0), np.float64(24.70077))
Explained rule: Age = (np.float64(17.0), np.float64(24.70077))
(elapsed: 2.771s)



In [14]:
print("=== Two-sample mode ===")
# Compare two states (same columns). Feel free to change STATE_B.
STATE_B = "NY"
X_b, y_b = load_state_data(STATE_B, problem_cls=ACSIncome)

t0 = perf_counter()
# Find the subgroup whose prevalence differs most between X_df (A) and X_b (B)
rule_2s = most_biased_subgroup_two_samples(
    X1=X_df[protected_list], X2=X_b[protected_list],
    protected_list=protected_list,
    continuous_list=continuous_list,
    fp_map=None,
    seed=seed, n_samples=n_samples,
    method="MSD", method_kwargs={"solver": solver_name},
)

# Now quantify that gap on the combined data
msd_2s = evaluate_biased_subgroup_two_samples(
    X1=X_df[protected_list], X2=X_b[protected_list],
    protected_list=protected_list,
    continuous_list=continuous_list,
    fp_map=None,
    seed=seed, n_samples=n_samples,
    method="MSD", method_kwargs={"rule": rule_2s},
)
dt = perf_counter() - t0
report_subgroup_bias(f"Two-sample ({STATE_A} vs {STATE_B})", msd_2s, rule_2s, FEATURE_NAMES, PROTECTED_VALUES_MAP)
print(f"(elapsed: {dt:.3f}s)")

=== Two-sample mode ===


[INFO] Seeding the run with seed=42 for searching the `rule`.
[INFO] Seeding the run with seed=42 for searching the `value`.


Two-sample (CA vs NY)
MSD = 0.116
Rule: RAC1P = 1.0
Explained rule: Race = White
(elapsed: 2.387s)
