# **Measuring Bias in regression**


This notebook is a tutorial on auditing bias within a regression task. We will use the holisticai library thoughout, introducing some of the functions we have created to help study algorithmic bias.

The sections are organised as follows :
1. Load the data : we load the student grades dataset as a pandas DataFrame
2. Data Exploration : some preliminary analysis of the data
3. Train a Model : we train a simple linear regression model (sklearn)
4. Measure Bias : we compute a few bias metrics, and comment on their meaning

## **Load the data**

First of all, we need to import the required packages to perform our bias analysis and mitigation. You will need to have the `holisticai` package installed on your system, remember that you can install it by running: 
```bash
!pip install holisticai[all]
```

In [1]:
# Imports
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

We host a few example datasets on the holisticai library for quick loading and experimentation. Here we load and use the Student dataset. The goal of this dataset is the prediction of the numerical attribute 'G3' (mathematics grade of student in 3rd trimester). There are a number of sensitive attributes in this dataset, some of which are : sex, address, Mjob (mother's job), Fjob (father's job)...

Although the `load_dataset` function returns a preprocessed version, in this opportunity we will perform this processing by ourselves since the default preprocessing is more suitable for multiclassification tasks.

In [15]:

from holisticai.datasets._dataloaders import (
    load_adult,
    load_last_fm,
    load_law_school,
    load_student,
    load_us_crime,
)

bunch = load_student()
df = bunch["frame"]
df

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,'GP','F',18,'U','GT3','A',4,4,'at_home','teacher',...,4,3,4,1,1,3,6,5,6,6
1,'GP','F',17,'U','GT3','T',1,1,'at_home','other',...,5,3,3,1,1,3,4,5,5,6
2,'GP','F',15,'U','LE3','T',1,1,'at_home','other',...,4,3,2,2,3,3,10,7,8,10
3,'GP','F',15,'U','GT3','T',4,2,'health','services',...,3,2,2,1,1,5,2,15,14,15
4,'GP','F',16,'U','GT3','T',3,3,'other','other',...,4,3,2,1,2,5,4,6,10,10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
390,'MS','M',20,'U','LE3','A',2,2,'services','services',...,5,5,4,4,5,4,11,9,9,9
391,'MS','M',17,'U','LE3','T',3,1,'services','services',...,2,4,5,3,4,2,3,14,16,16
392,'MS','M',21,'R','GT3','T',1,1,'other','other',...,5,5,3,3,3,3,3,10,8,7
393,'MS','M',18,'R','LE3','T',3,2,'services','other',...,4,4,1,3,4,5,0,11,12,10


In [13]:
from holisticai.datasets import load_dataset
dataset = load_dataset('student', protected_attribute='address')
dataset

In [14]:
dataset['group_a']

0      'F'
1      'F'
2      'F'
3      'F'
4      'F'
      ... 
390    'M'
391    'M'
392    'M'
393    'M'
394    'M'
Name: group_a, Length: 395, dtype: object

## **Data Exploration**

We import some of the holisticai plotters for quick exploration of the data.

In [None]:
from holisticai.bias.plots import group_pie_plot
from holisticai.bias.plots import distribution_plot
from holisticai.bias.plots import success_rate_curves

In [None]:
group_pie_plot(dataset['group_a'])

The data is balanced in terms of sex.

In [None]:
# distribution of grades for male an female
distribution_plot(dataset['y'], dataset['group_a'])

The Mother's job attribute is the one that shows most difference in the densities of grades. For instance we observe students with a mother working in health have higher density at higher grades.

In [None]:
p_attr = np.array(dataset['p_attr']['Fjob'])
y =      np.array(dataset['y']['G3'])
success_rate_curves(p_attr, y, groups=['at_home', 'health', 'teacher'])

The above shows the success rate (sucess is exceeding the given threshold) as a function of threshold for different subgroups of the population : Father's job in ['at_home', 'health', 'teacher']. We can observe that student's with a parent as a teacher are more likely to exceed high thresholds than other groups.

## **Preprocess Data and Train a model**

We use a sklearn linear regression model.

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
# Load, preprocess and split for training
datasets = dataset.train_test_split(test_size=0.3, random_state=42)
datasets

In [None]:
# G3 is the students final grade (drop G2 and G1 as well because highly correlated with G3)
X_train = datasets['train']['x']
X_test = datasets['test']['x']
y_train = datasets['train']['y']['G3']
y_test = datasets['test']['y']['G3']

p_attr_train = datasets['train']['p_attr']
p_attr_test = datasets['test']['p_attr']

# Train a simple linear regression model
model = LinearRegression()
model = model.fit(X_train, y_train)

# Predict values
y_pred = model.predict(X_test)

In [None]:
#from holisticai.metrics.efficacy import regression_efficacy_metrics
#regression_efficacy_metrics(y_pred, y_test)

## **Measure bias**

In [None]:
# import some bias metrics
from holisticai.bias.metrics import statistical_parity_regression
from holisticai.bias.metrics import disparate_impact_regression
from holisticai.bias.metrics import mae_ratio
from holisticai.bias.metrics import rmse_ratio

In [None]:
# set up vectors for gender

group_a = np.array(p_attr_test['sex']=='M')
group_b = np.array(p_attr_test['sex']=='F')
y_pred  = np.array(model.predict(X_test))
y_true  = np.array(y_test)

In [None]:
# evaluate fairness metrics for gender
print ('Statistical Parity Q80   : ' + str(statistical_parity_regression(group_a, group_b, y_pred, q=0.8)))
print ('Disparate Impact Q80     : ' + str(disparate_impact_regression(group_a, group_b, y_pred, q=0.8)))
print ('MAE Ratio                : ' + str(mae_ratio(group_a, group_b, y_pred, y_true)))
print ('RMSE Ratio               : ' + str(rmse_ratio(group_a, group_b, y_pred, y_true)))

All the above metrics are within acceptable ranges. This shows there isn't much bias for the subgroups of the sex column. Let's try the address attribute.

In [None]:
# set up vectors for address

group_a = np.array(p_attr_test['address']=='U')
group_b = np.array(p_attr_test['address']=='R')
y_pred  = np.array(model.predict(X_test))
y_true  = np.array(y_test)

In [None]:
# evaluate fairness metrics for address
print ('Statistical Parity Q80   : ' + str(statistical_parity_regression(group_a, group_b, y_pred, q=0.8)))
print ('Disparate Impact Q80     : ' + str(disparate_impact_regression(group_a, group_b, y_pred, q=0.8)))
print ('MAE Ratio                : ' + str(mae_ratio(group_a, group_b, y_pred, y_true)))
print ('RMSE Ratio               : ' + str(rmse_ratio(group_a, group_b, y_pred, y_true)))

The disparate impact at quantile 0.8 is outside of fair ranges (0.8, 1.2), students living in urban areas are 1.8 times more likely to be predicted in top 20% of grades than students living in rural areas.

In [None]:
print ('Disparate Impact Q80     : ' + str(disparate_impact_regression(group_a, group_b, y_true, q=0.8)))

When we look at the metric computed on true values, we get an even worst pattern. Students living in urban areas are actually 4.2 times more likely to be in top 20% of grades than students living in rural areas.

**Equality of outcome metrics (batch computation)**

Use address as protected attribute

In [None]:
# set up vectors for address

group_a = np.array(p_attr_test['address']=='U')
group_b = np.array(p_attr_test['address']=='R')
y_pred  = np.array(model.predict(X_test))

In [None]:
from holisticai.bias.metrics import regression_bias_metrics
regression_bias_metrics(group_a, group_b, y_pred, metric_type='equal_outcome')

**Equality of opportunity metrics (batch computation)**

Use address as protected attribute

In [None]:
# set up vectors for address

group_a = p_attr_test['address']=='U'
group_b = p_attr_test['address']=='R'
y_pred  = model.predict(X_test)
y_true  = y_test

In [None]:
regression_bias_metrics(group_a, group_b, y_pred, y_true, metric_type='equal_opportunity')

In [None]:
regression_bias_metrics(group_a=group_a, group_b=group_b, y_pred=y_pred, y_true=y_true, metric_type='both')

We can show all individual bias metrics by setting 'metric_types' as 'individual'.

In [None]:
regression_bias_metrics(group_a=group_a, group_b=group_b, y_pred=y_pred, y_true=y_true, metric_type='individual')