# 1. Dataset Collection

$\mathcal{D} = \{(\vec{x}_n, y_n)\}_{n=1}^N \in \mathbb{D} \subset \mathbb{X} \times \mathbb{Y}$


## 1.1 OpenML

The dataset we are going to use as an example is very common in fairness examples, namely Adult.
To obtain it we use OpenML which allows us to get a dataset in the form of Dataframe pandas with some extra information about the content

In [1]:
import openml

In [2]:
adult_dataset = openml.datasets.get_dataset(179)
adult_dataset

OpenML Dataset
Name.........: adult
Version......: 1
Format.......: ARFF
Upload Date..: 2014-04-23 13:13:24
Licence......: Public
Download URL.: https://api.openml.org/data/v1/download/3608/adult.arff
OpenML URL...: https://www.openml.org/d/179
# of features: None

In [3]:
adult_X, _, z, col_name = adult_dataset.get_data(dataset_format="dataframe")

In [4]:
adult_X.rename(columns={'class': 'income'}, inplace=True)

# 2. Fairlib Setup

Setting up the library is very simple, just importing it and “wrapping” the dataframe to add useful features for the fairness process

In [5]:
import fairlib as fl

INFO:fairlib:fairlib loaded


In [6]:
adult = fl.DataFrame(adult_X)

## 2.1 Target Features


$f: \mathbb{X} \rightarrow \mathbb{Y}$

Indicate the **target** feature $Y \subset \mathbb{Y}$ (e.g., "income" in Adult)

According to $Y$, we can have:
*   Classification
    * Binary $\mathbb{Y} \in [0, 1]$
    * Multi-class $\mathbb{Y} \in \mathbb{N}$
*   Regression $\mathbb{Y} \in \mathbb{R}$

In [7]:
adult.targets = 'income'

## 2.2 Sensible Features

Indicate the **sensitive** features $\mathcal{X}_s =  X_{[:, S]}$ (e.g., "gender" and "race" in Adult)

Each sensitive feature can be:
*   Categorical
    * Binary
    * Multi
*   Numerical

In [8]:
adult.sensitive = {'sex', 'race'}

## 2.3 Priviliged Groups

Indicate the **Privileged** groups (e.g., "gender = male" and "race = white" in Adult)

We have different cases according to the sensitive feature in consideration:
* Categorical
    * Binary: only one group is priviliged
    * Multi-cass: we can have more than one (should we define a distribution?)
* Numerical:
    * we can have only one group (then, specify an interval) or
    * we can have more than one (then, specify a distribution)?

In [9]:
privileged = {
    "sex": lambda x: x == "Male",
    "race": lambda x: x == "White",
    "income": lambda x: x == ">50K",
}

In [10]:
for column, rule in privileged.items():
    adult[column] = adult[column].apply(rule).astype(int)

# 3. Preliminary Analysis

## 3.1 Preliminary Assessment

Choose the fairness losses $\mathcal{L}_fair$ you are interested into:
* Group
    * statistical parity
    * equalized odds
    * disparate impact
* Individual
    * equal opportunity
* Counter-factual

And test them atop your sensitive features $\mathcal{X}_s$.

In [11]:
adult.disparate_impact()



{(income=1, sex=1): 0.3596552625800337, (income=1, race=1): 0.6005915505110953}

In [12]:
adult.statistical_parity_difference()

{(income=1, sex=1): 0.19451574596420296, (income=1, race=1): 0.10144450514172723}

## 3.2 Proxy Indentification

Try different strategies to spot **proxy** features in the dataset:
* Correlation matrix
* Functional dependencies
* ...


# 4. Evaluation Protocol

We have to agree on a set of supported protocols:
* hold-out (train/test split)
* k-fold cross-validation (k splits)
* stratified k-fold cross-validation (each split has the same target distribution)
* stratified k-fold cross-validation for fairness (each split has also the same distribution wrt. sensitive features)

# 5. ML Pipeline Definition

ML pipeline "sklearn-like" with "fit_predict"

$$\frac{1}{k} \sum_{i=1}^{k} \mathcal{L}\left(\langle \mathcal{P}, \mathcal{A} \rangle_{\lambda} ( \mathcal{D}_{\mathit{train}}^{(i)}), \mathcal{D}_\mathit{valid}^{(i)} \right) \, .$$


## 5.1 Pre-processing

## 5.2 In-processing

## 5.3 Post-processing

# 6. Optimization

Find best pipelines in a multi-objective problem (i.e., provide a pareto front, pipelines are hyperparamter configurations that achieves the best trade-offs accuracy/fairness)