# Data Loading & Preprocessing Demo

This notebook demonstrates how to use the data loading, preprocessing and grouping utilities:

- `fairness.data` (`load_csv`, `load_heart_csv`, `make_dataset_bundle`)
- `fairness.preprocess` (`add_age_group`, `map_binary_column`, `apply_transforms`, `preprocess_tabular`, `make_train_test_split`)
- `fairness.groups` (`create_intersectional_groups`, `warn_small_groups`)

The goal is to produce aligned objects for fairness analysis:

- `groups[i]` - protected group label for test individual *i*
- `y_pred[i]` - model prediction for test individual *i* 
- `y_test[i]` - true label for test individual *i*

> Note: here, model training is shown only as an example to generate `y_pred`.  
> The core pipeline modules are model-agnostic.


In [None]:
import sys
from pathlib import Path

from fairness.data import load_heart_csv, make_dataset_bundle
from fairness.preprocess import add_age_group, map_binary_column, apply_transforms, preprocess_tabular, make_train_test_split
from fairness.groups import create_intersectional_groups, warn_small_groups

import pandas as pd
import numpy as np

## Load the dataset

This demo uses the `heart.csv` file 


In [None]:
DATA_PATH = Path("fairness/data/heart.csv")  

df = load_heart_csv(DATA_PATH)
df.head()

## Fairness-oriented preprocessing

Continuous protected attributes (like age) are binned into a small number of categories
to produce interpretable groups and avoid tiny subgroup sample sizes.

There is an optional mapping for a binary protected attribute (e.g. `Sex` from `"M"/"F"` to `1/0`),
depending on how the dataset encodes it.


In [None]:
# Add a derived protected attribute for fairness analysis (Age -> age_group)
df_fair = add_age_group(df, age_col="Age", new_col="age_group", bins=(0, 55, 120), labels=("young", "older"))

# map binary/categorical encodings if needed (only run if your dataset has M/F)
if "Sex" in df_fair.columns and df_fair["Sex"].dtype == object:
    df_fair = map_binary_column(df_fair, col="Sex", mapping={"M": 1, "F": 0})

df_fair[["Age", "age_group", "Sex"]].head()

### Using `apply_transforms`

`apply_transforms` allows multiple `DataFrame -> DataFrame` operations to be chained togther.


In [None]:
df_fair2 = apply_transforms(
    df,
    transforms=[
        lambda d: add_age_group(d, age_col="Age", new_col="age_group"),
        lambda d: map_binary_column(d, col="Sex", mapping={"M": 1, "F": 0}),
    ],
)

df_fair2[["Age", "age_group"]].head()

## Model-oriented preprocessing

Convert a mixed-type DataFrame into numeric features (one-hot encode categoricals).
Protected attributes (e.g. `age_group`) are kept in the DataFrame for *grouping*,
but excluded rom model features during splitting.


In [None]:
df_model = preprocess_tabular(df_fair)
df_model.head()

## Train/test split 

`make_train_test_split` returns an immutable `SplitData` container:
- `X_train`, `X_test`
- `y_train`, `y_test`

Derived protected attributes (e.g. `age_group`) are dropped from the model features for training.


In [None]:
split = make_train_test_split(
    df_model,
    target_col="HeartDisease",
    drop_cols=("age_group",), 
    test_size=0.3,
    random_state=42,
    stratify=True,
)

split.X_train.shape, split.X_test.shape, split.y_train.shape, split.y_test.shape

## Create intersectional groups for the test set

Group labels are created for the test set rows, using the same indices as `X_test`.
This guarantees alignment of

`groups[i]  |  split.X_test.iloc[i]  |  split.y_test.iloc[i]`


In [None]:
protected = ["Sex", "age_group"]

protected_test = df_fair.loc[split.X_test.index, protected]

groups, group_map, counts = create_intersectional_groups(protected_test, protected=protected)

counts

In [None]:
msg = warn_small_groups(counts, min_size=20)
msg

Train a model to generate `y_pred`

(This step is outside of the data loading and processing modules.
It is included to show how `y_pred` can be 
produced for fairness metrics.)


In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

model = LogisticRegression(max_iter=1000)
model.fit(split.X_train, split.y_train)

y_pred = model.predict(split.X_test)

print("Lengths:", len(groups), len(y_pred), len(split.y_test))
print("\nClassification report:")
print(classification_report(split.y_test, y_pred))

## Check alignment

The following assertions should pass. If they fail, group labels are not aligned with predictions.


In [None]:
assert len(groups) == len(y_pred) == len(split.y_test)

# show the first few records
preview = pd.DataFrame({
    "group": groups[:10],
    "y_pred": y_pred[:10],
    "y_true": split.y_test.iloc[:10].to_list(),
})
preview

## Next steps

`y_pred` and `groups` can be passed to fairness metric functions.
