# Fairness statistics for adult income

## Task 1

We have two populations Blue (privileged) and Red (unprivileged), with the Blue population being 9 times larger than the Red population.

Individuals from both populations are requesting to attend XAI training to improve competency in this important area. Number of places is limited. The administrators of the training have decided to give priority to enrolling individuals who may need this training in the future, although unfortunately it is difficult to predict who will benefit.

The decision rule adopted:
1. In the Red group, half of the people will find the skills useful in future and half will not. Administrators randomly allocate 50% of people to training.
2. in the Blue group, 80% of people will find the training useful in future and 20% will not, although of course it is not known who will find it useful. The administrators have built a predictive model based on user behavior in predicting for whom it will be useful and whom will not. The model has the following performance:


| Blue                     	| Will use XAI 	| Will not use XAI 	| Total 	|
|--------------------------	|--------------	|------------------	|-------	|
| Enrolled in training     	| 60           	| 5               	| 65    	|
| not enrolled in training 	| 20            | 15               	| 35    	|
| Total                    	| 80           	| 20               	| 100   	|


Task: Calculate the Demographic parity, equal opportunity and predictive rate parity coefficients for this decision rule.

### Solution

$Y$ - individual will use XAI

$\hat{Y}$ - individual enrolled in training

#### Demographic parity

$$P(\hat{Y}|Red) = 0.5$$
$$P(\hat{Y}|Blue) = 0.65$$
$$P(\hat{Y}|Red) \ne P(\hat{Y}|Blue)$$
No demographic parity

#### Equal opportunity

$$P(\hat{y}|Red, Y=1) = 0.5$$
$$P(\hat{y}|Blue, Y=1) = 0.75$$
Coefficient: $0.75/0.5=1.5$

#### Predictive rate parity
$$P(Y=1|Red, \hat{Y}=1) = 1/2$$
$$P(Y=1|Blue, \hat{Y}=1) = 60/65 = 12/13$$
Coefficient: $\frac{12/13}{1/2}=24/13 \approx 1.85$


![](imgs/solution.png)

## Task 2

Dataset used is The Adult income dataset
([source](https://www.kaggle.com/datasets/wenruliu/adult-income-dataset)).


I have preprocessed the dataset by one hot encoding categorical features.


## Appendix

### Install required packages.

In [1]:
%%capture
%pip install dalex jinja2 kaleido numpy nbformat pandas plotly scikit-learn

### Imports and loading dataset

In [6]:
import dalex as dx
import numpy as np
import pandas as pd
import plotly.express as px
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import RidgeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

rng = np.random.default_rng(0)

TARGET_COLUMN = "income"
df = pd.read_csv("adult.csv")
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


Shuffling the data, extracting target column and one hot encoding categorical columns.

In [8]:
df = df.sample(frac=1, random_state=0).reset_index(drop=True)

y = df[[TARGET_COLUMN]]

x = df.drop(TARGET_COLUMN, axis=1)

categorical_cols = ["workclass", "education", "marital-status", "occupation", "relationship", "race", "gender", "native-country"]
numerical_cols = list(set(x.columns) - set(categorical_cols))

x = pd.get_dummies(x, columns=categorical_cols, drop_first=True)
n_columns = len(x.columns)

categorical_cols, numerical_cols

(['workclass',
  'education',
  'marital-status',
  'occupation',
  'relationship',
  'race',
  'gender',
  'native-country'],
 ['capital-gain',
  'age',
  'fnlwgt',
  'capital-loss',
  'hours-per-week',
  'educational-num'])

In [10]:
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.75, random_state=0, shuffle=True, stratify=y)

## Random Forest model

In [11]:
model = RandomForestClassifier(random_state=0).fit(x_train, y_train)

accuracy_score(y_test, model.predict(x_test))

  model = RandomForestClassifier(random_state=0).fit(x_train, y_train)


0.8554581934321513