# **Bias measuring and mitigation with Disparate impact remover**

This tutorial demonstrates how to implement the "Disparate impact remover" preprocessing method to enhance fairness in regression models using the `holisticai` library.

- [Traditional implementation](#traditional-implementation)
- [Pipeline implementation](#pipeline-implementation)

First, install the `holisticai` package if you haven't already:
```bash
!pip install holisticai[all]
```
Then, import the necessary libraries.

In [27]:
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from holisticai.datasets import load_dataset
from holisticai.bias.mitigation import DisparateImpactRemover
from holisticai.bias.metrics import regression_bias_metrics

np.random.seed(0)
import warnings
warnings.filterwarnings("ignore")

Load the proprocessed "Communities and Crime" dataset.

In [28]:
dataset = load_dataset('us_crime', protected_attribute="race")
dataset = dataset.train_test_split(test_size=0.2, random_state=0)
train_data = dataset['train']
test_data = dataset['test']

dataset

## **Bias Mitigation**
### **Traditional Implementation**
We will implement the "Grid search reduction" method, an in-processing technique to mitigate bias in the regression model.

In [29]:
LR = LinearRegression()

corr = DisparateImpactRemover()
processed_data = corr.fit_transform(train_data['X'], train_data['group_a'], train_data['group_b'])

LR.fit(processed_data, train_data['y'])

processed_data = corr.transform(test_data['X'], test_data['group_a'], test_data['group_b'])
y_pred = LR.predict(processed_data)

df_dir = regression_bias_metrics(
    test_data['group_a'],
    test_data['group_b'],
    y_pred,
    test_data['y'],
    metric_type='both'
)
df_dir

Unnamed: 0_level_0,Value,Reference
Metric,Unnamed: 1_level_1,Unnamed: 2_level_1
Disparate Impact Q90,0.13082,1
Disparate Impact Q80,0.22756,1
Disparate Impact Q50,0.478532,1
Statistical Parity Q50,-0.45059,0
No Disparate Impact Level,0.100759,-
Average Score Difference,-0.198491,0
Average Score Ratio,0.502665,1
Z Score Difference,-1.439428,0
Max Statistical Parity,0.494543,0
Statistical Parity AUC,0.31646,0


In [30]:
dispimp_rmse = mean_squared_error(dataset['test']['y'], y_pred, squared=False)
print("RMS error: {}".format(dispimp_rmse))

RMS error: 0.16073221630461687


### **Pipeline implementation**

In [31]:
from holisticai.pipeline import Pipeline

model = LinearRegression()
pipeline = Pipeline(
    steps=[
        ("bm_preprocessing", DisparateImpactRemover()),
        ("model", model),
    ]
)

# X, y, group_a, group_b = train_data
fit_params = {
    "bm__group_a": train_data['group_a'], 
    "bm__group_b": train_data['group_b']
}

pipeline.fit(train_data['X'], train_data['y'], **fit_params)
pipeline

In [32]:
# X, y, group_a, group_b = test_data
predict_params = {
    "bm__group_a": test_data['group_a'],
    "bm__group_b": test_data['group_b'],
}
y_pred = pipeline.predict(test_data['X'], **predict_params)
df_disprem = regression_bias_metrics(
    test_data['group_a'],
    test_data['group_b'],
    y_pred,
    test_data['y'],
    metric_type='both'
)
df_disprem

Unnamed: 0_level_0,Value,Reference
Metric,Unnamed: 1_level_1,Unnamed: 2_level_1
Disparate Impact Q90,0.13082,1
Disparate Impact Q80,0.22756,1
Disparate Impact Q50,0.478532,1
Statistical Parity Q50,-0.45059,0
No Disparate Impact Level,0.100759,-
Average Score Difference,-0.198491,0
Average Score Ratio,0.502665,1
Z Score Difference,-1.439428,0
Max Statistical Parity,0.494543,0
Statistical Parity AUC,0.31646,0


In [33]:
dispimp_rmse = mean_squared_error(dataset['test']['y'], y_pred, squared=False)
print("RMS error: {}".format(dispimp_rmse))

RMS error: 0.16073221630461687
