## How to run the dnf_bias.py

First, we need to tell Python where to find our `humancompatible.detect` package.

In [1]:
from pathlib import Path
import sys

PROJECT_ROOT = Path().resolve().parent

sys.path.insert(0, str(PROJECT_ROOT))

Next, point to your CSV and choose (or create) a folder to collect results:

In [2]:
dataset_path = PROJECT_ROOT / "humancompatible" / "detect" / "data" / "ACSIncome_CA.csv"
result_folder = PROJECT_ROOT / "results_dir"

if not dataset_path.exists():
    raise FileNotFoundError(f"Dataset not found at {dataset_path}")

Here we tell the pipeline:

- target – which column is our binary label

- protected_list – the “protected” features (we’ll only enumerate subgroups over these)

- continuous_list – which features to treat as numeric ranges rather than categories

- feature_processing – integer‐division rules to collapse high‐cardinality fields

In [3]:
target = "PINCP"

protected_list = [
    "SEX", "RAC1P", "AGEP", "POBP", "_POBP",
    "DIS", "CIT", "MIL", "ANC", "NATIVITY",
    "DEAR", "DEYE", "DREM", "FER", "POVPIP",
]

continuous_list = ["AGEP", "PINCP", "WKHP", "JWMNP", "POVPIP"]

feature_processing = {
    "POBP": 100,
    "OCCP": 100,
    "PUMA": 100,
    "POWPUMA": 1000,
}


In [4]:
from humancompatible.detect.dnf_bias import dnf_bias

max_dist, max_msd, best_literals = dnf_bias(
    csv=dataset_path,
    out=result_folder,
    target=target,
    protected_list=protected_list,
    continuous_list=continuous_list,
    fp_map=feature_processing,
)


  + K11.sum() / (n1 * (n1 - 1))
  return np.sqrt(mmd_estimate)
  K00.sum() / (n0 * (n0 - 1))


Result saved to C:\personal\work\detect\results_dir\output.txt


In [5]:
print(f"Max distance: {max_dist}")
print(f"Max MSD: {max_msd}")
print(f"Max subgroup: ({' AND '.join(sorted(map(str, best_literals)))}) \n")


Max distance: 0.6992058987801011
Max MSD: 3.691621021357009e-05
Max subgroup: (POBP = 2 AND RAC1P = 2.0) 



See `output.txt` for more detailed result.

In [6]:
dnf_bias?

[31mSignature:[39m
dnf_bias(
    csv: pathlib._local.Path,
    out: pathlib._local.Path,
    target: str,
    protected_list: List[str],
    continuous_list: List[str],
    fp_map: Dict[str, int],
    model: str = [33m'MMD'[39m,
    seed: int = [32m0[39m,
    n_samples: int = [32m1000000[39m,
    train_samples: int = [32m100000[39m,
    time_limit: int = [32m600[39m,
    n_min: int = [32m10[39m,
) -> Tuple[float, float, List[str]]
[31mDocstring:[39m
Running DNF bias detection experiments on given CSV dataset.

Args:
    csv (Path): Path to the input CSV file.
    out (Path): Directory where 'output.txt' will be saved.
    target (str): Name of the column to treat as the binary target variable.
    protected_list (List[str]): Comma-separated list of columns to treat as protected attributes; subgroups
        are defined over these attributes.
    continuous_list (List[str]): List of columns to treat as continuous features.
    fp_map (Dict[str, int]): Mapping of column n