## How to run the dnf_bias.py

### Example run:

```bash
python src/detect/dnf_bias/dnf_bias.py 
        dataset_path=src/detect/dnf_bias/data/ACSIncome_CA.csv 
        result_folder=results_dir 
        target=PINCP
        protected=SEX,RAC1P,AGEP,POBP,_POBP,DIS,CIT,MIL,ANC,NATIVITY,DEAR,DEYE,DREM,FER,POVPIP
        continuous=AGEP,PINCP,WKHP,JWMNP,POVPIP
        feature_processing=POBP:100,OCCP:100,PUMA:100,POWPUMA:1000

        model=MMD
        seed=0
        n_samples=1000000
        train_samples=100000
        time_limit=600
        n_min=10
```

### Command-line Flags

- **`dataset_path`**  
  Path to the input CSV file. Must include a header row and all feature columns, including the one named by `target`.

- **`result_folder`**  
  Directory where `output.txt` will be written. It will be created if it doesn’t exist.

- **`target`**  
  Name of the column in your CSV to treat as the binary target variable.

- **`protected`**  
  Comma-separated list of column names to treat as *protected* attributes (e.g. `SEX,RAC1P,AGEP`).  
  Subgroups are only generated over these features.

- **`continuous`**  
  Comma-separated list of column names to treat as *continuous*. Those columns will use a numeric range (`min`,`max`) rather than per-value categories.

- **`feature_processing`**  
  Comma-separated `col:divisor` pairs, e.g. `POBP:100,OCCP:100`.  
  Each matching column will be integer-divided by `divisor` before binarization—to reduce cardinality.

- **`model`**  
  Which distance metric to use for scoring subgroups. One of:
  - `MSD` – Mean Subgroup Difference  
  - `W1` – 1-Wasserstein (Earth Mover’s Distance)  
  - `W2` – 2-Wasserstein  
  - `TV` – Total Variation  
  - `MMD` – Maximum Mean Discrepancy

- **`seed`**  
  Integer seed for any random sampling (subsampling rows, tie-breaking, etc.) to ensure reproducibility.

- **`n_samples`**  
  Maximum number of rows to sample from your dataset (default 1 000 000). If the CSV has more rows, it will be randomly subsampled down.

- **`train_samples`**  
  (Reserved) Intended to limit the number of training samples—currently not used but parsed for future extensions.

- **`time_limit`**  
  Time budget (in seconds) for the enumerative search. The loop will terminate early once this limit is reached.

- **`n_min`**  
  Minimum required subgroup support (number of rows). Any subgroup with fewer than `n_min` rows is automatically pruned.

---

#### Output

After running, you’ll find:

- **`<result_folder>/output.txt`**  
  A summary of the search: max distance, best subgroup’s literals, total/skipped/checked counts, and elapsed time.
