# Data Loading & Preprocessing Demo

This notebook demonstrates how to use the data loading, preprocessing and grouping utilities:

- `fairness.data` (`load_csv`, `load_heart_csv`, `make_dataset_bundle`)
- `fairness.preprocess` (`add_age_group`, `map_binary_column`, `apply_transforms`, `preprocess_tabular`, `make_train_test_split`)
- `fairness.groups` (`create_intersectional_groups`, `warn_small_groups`)

The goal is to produce aligned objects for fairness analysis:

- `groups[i]` - protected group label for test individual *i*
- `y_pred[i]` - model prediction for test individual *i* 
- `y_test[i]` - true label for test individual *i*

> Note: here, model training is shown only as an example to generate `y_pred`.  
> The core pipeline modules are model-agnostic.


In [1]:
import sys
from pathlib import Path

from fairness.data import load_heart_csv, make_dataset_bundle
from fairness.preprocess import add_age_group, map_binary_column, apply_transforms, preprocess_tabular, make_train_test_split
from fairness.groups import create_intersectional_groups, warn_small_groups

import pandas as pd
import numpy as np

## Load the dataset

This demo uses the `heart.csv` file 


In [2]:
DATA_PATH = Path("fairness/data/heart.csv")  

df = load_heart_csv(DATA_PATH)
df.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


## Fairness-oriented preprocessing

Continuous protected attributes (like age) are binned into a small number of categories
to produce interpretable groups and avoid tiny subgroup sample sizes.

There is an optional mapping for a binary protected attribute (e.g. `Sex` from `"M"/"F"` to `1/0`),
depending on how the dataset encodes it.


In [3]:
# Add a protected attribute for fairness analysis 
df_fair = add_age_group(df, age_col="Age", new_col="age_group", bins=(0, 55, 120), labels=("young", "older"))

# map binary/categorical encodings if needed (if dataset has M/F)
if "Sex" in df_fair.columns and df_fair["Sex"].dtype == object:
    df_fair = map_binary_column(df_fair, col="Sex", mapping={"M": 1, "F": 0})

df_fair[["Age", "age_group", "Sex"]].head()

Unnamed: 0,Age,age_group,Sex
0,40,young,1
1,49,young,0
2,37,young,1
3,48,young,0
4,54,young,1


### Using `apply_transforms`

`apply_transforms` allows multiple `DataFrame -> DataFrame` operations to be chained togther.


In [4]:
df_fair2 = apply_transforms(
    df,
    transforms=[
        lambda d: add_age_group(d, age_col="Age", new_col="age_group"),
        lambda d: map_binary_column(d, col="Sex", mapping={"M": 1, "F": 0}),
    ],
)

df_fair2[["Age", "age_group","Sex"]].head()

Unnamed: 0,Age,age_group,Sex
0,40,young,1
1,49,young,0
2,37,young,1
3,48,young,0
4,54,young,1


## Model-oriented preprocessing

In the raw dataset, variables are represented using a mixture of numeric and categorical encodings, reflecting how the data were originally defined and collected.

Binary clinical indicators such as `FastingBS (0, 1)` are passed through unchanged.

Variables that represent categorical concepts with 2 or ,ore possible values, such as `ChestPainType (TA, ATA, NAP, ASY)` are converted into nuermic features using one-hot encoding. This creates binary indicator columns that take value `True` if the category applies to the individual, else `False`.

This allows interpretation by machine learning models.

Some protected characteristics, such as sex, may be clinically relevant predictors
and are therefore retained in the model inputs. Derived protected attributes used
only for fairness analysis (e.g. `age_group`) are excluded. 

In [5]:
df_model = preprocess_tabular(df_fair, drop_cols=("age_group",))
df_model.head()

Unnamed: 0,Age,Sex,RestingBP,Cholesterol,FastingBS,MaxHR,Oldpeak,HeartDisease,ChestPainType_ATA,ChestPainType_NAP,ChestPainType_TA,RestingECG_Normal,RestingECG_ST,ExerciseAngina_Y,ST_Slope_Flat,ST_Slope_Up
0,40,1,140,289,0,172,0.0,0,True,False,False,True,False,False,False,True
1,49,0,160,180,0,156,1.0,1,False,True,False,True,False,False,True,False
2,37,1,130,283,0,98,0.0,0,True,False,False,False,True,False,False,True
3,48,0,138,214,0,108,1.5,1,False,False,False,True,False,True,True,False
4,54,1,150,195,0,122,0.0,0,False,True,False,True,False,False,False,True


## Train/test split 

`make_train_test_split` returns a `SplitData` container:
- `X_train`, `X_test`
- `y_train`, `y_test`

Derived protected attributes (e.g. `age_group`) are dropped from the model features for training.


In [6]:
split = make_train_test_split(
    df_model,
    target_col="HeartDisease",
    test_size=0.3,
    random_state=42,
    stratify=True,
)

split.X_train.shape, split.X_test.shape, split.y_train.shape, split.y_test.shape

((642, 15), (276, 15), (642,), (276,))

## Create intersectional groups for the test set

Group labels are created for the test set rows, using the same indices as `X_test`.


In [7]:
protected = ["Sex", "age_group"]

protected_test = df_fair.loc[split.X_test.index, protected]

groups, group_map, counts = create_intersectional_groups(protected_test, protected=protected)

counts

group
Sex=1|age_group=young    129
Sex=1|age_group=older     91
Sex=0|age_group=young     40
Sex=0|age_group=older     16
Name: count, dtype: int64

In [8]:
msg = warn_small_groups(counts, min_size=20)
msg

'Small intersectional groups detected (<20): Sex=0|age_group=older (n=16)'

## Train a model to generate `y_pred`

This step is outside of the data loading and processing modules.
It is included here to show how `y_pred` can be 
produced for fairness metrics.


In [9]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

model = Pipeline([
    ("scaler", StandardScaler()),
    ("clf", LogisticRegression(max_iter=1000))
])

model.fit(split.X_train, split.y_train)
y_pred = model.predict(split.X_test)


print("Lengths:", len(groups), len(y_pred), len(split.y_test))
print("\nClassification report:")
print(classification_report(split.y_test, y_pred))

Lengths: 276 276 276

Classification report:
              precision    recall  f1-score   support

           0       0.90      0.84      0.87       123
           1       0.88      0.92      0.90       153

    accuracy                           0.88       276
   macro avg       0.89      0.88      0.88       276
weighted avg       0.88      0.88      0.88       276



In this example, the classifier achieves good accuracy, providing a suitable baseline for fairness analysis.

## Check alignment

The following assertions should pass. If they fail, group labels are not aligned with predictions.


In [10]:
assert len(groups) == len(y_pred) == len(split.y_test)

# show the first few records
preview = pd.DataFrame({
    "group": groups[:10],
    "y_pred": y_pred[:10],
    "y_true": split.y_test.iloc[:10].to_list(),
})
preview

Unnamed: 0,group,y_pred,y_true
0,Sex=1|age_group=young,1,1
1,Sex=1|age_group=older,1,1
2,Sex=1|age_group=older,1,1
3,Sex=1|age_group=young,0,0
4,Sex=0|age_group=older,0,0
5,Sex=1|age_group=older,1,1
6,Sex=1|age_group=older,1,1
7,Sex=0|age_group=older,1,1
8,Sex=0|age_group=older,0,0
9,Sex=1|age_group=young,0,0


## Next steps

`y_pred` and `groups` can be passed to fairness metric functions.

groups: list[str] length n_samples
Example element: `Sex=1|age_group=older` (one intersectional label per sample)

However, if the fairness function expects `group_dict: dict[str, list]` where each key is a protected attribute and each value is a list of that attributeâ€™s value for every sample, e.g.



In [11]:
{
  "Sex":      [1, 0, 1, ...],
  "age_group": ["older", "young", "older", ...]
}

{'Sex': [1, 0, 1, Ellipsis],
 'age_group': ['older', 'young', 'older', Ellipsis]}

Then `group_dict` can be created from `protected_test`, whixh is a subset of the original DataFrame that contains only the rows used for testing and only the protected attributes we care about such as sex, age group.

In [12]:
group_dict = {
    "Sex": protected_test["Sex"].tolist(),
    "age_group": protected_test["age_group"].tolist(),
}

{k: v[:10] for k, v in group_dict.items()}


{'Sex': [1, 1, 1, 1, 0, 1, 1, 0, 0, 1],
 'age_group': ['young',
  'older',
  'older',
  'young',
  'older',
  'older',
  'older',
  'older',
  'older',
  'young']}

## Quick pipeline for internal / development use 

For convenience, a whole pipeline function is included at `utils/pipeline.py`. Using the code below, this can be called to get:
- a fitted model
- y_test (true values)
- y_pred (predicted values)
- group_dict

It's designed for our convenience while developing the package and is not intedned to be part of the final package.

In [19]:
from fairness.preprocess import add_age_group, map_binary_column
from fairness.utils.pipeline import run_demo_pipeline

result = run_demo_pipeline(
    csv_path="fairness/data/heart.csv",
    target_col="HeartDisease",
    protected_cols=["Sex", "age_group"],
    fairness_transforms=[
        lambda d: add_age_group(d, age_col="Age", new_col="age_group", bins=(0, 55, 120), labels=("young", "older")),
        lambda d: map_binary_column(d, col="Sex", mapping={"M": 1, "F": 0}),
    ],
    drop_from_X=("age_group",),  
)

y_pred = result.y_pred
group_dict = result.group_dict   
y_test = result.split.y_test

print("Predictions, y_pred:", y_pred[:10])
print("protected characteristics, group_dict:", {k: v[:10] for k, v in group_dict.items()})
print("True outcomes, y_test:",y_test.iloc[:10].to_list())

Predictions, y_pred: [1 1 1 0 0 1 1 1 0 0]
protected characteristics, group_dict: {'Sex': [1, 1, 1, 1, 0, 1, 1, 0, 0, 1], 'age_group': ['young', 'older', 'older', 'young', 'older', 'older', 'older', 'older', 'older', 'young']}
True outcomes, y_test: [1, 1, 1, 0, 0, 1, 1, 1, 0, 0]
