# **Mitigation Bias using Fairlet clustering**


## **Load the data**

In [1]:
# Imports
import numpy as np
import pandas as pd

#sys
import sys
sys.path.append('../../')

We will start by importing the adult dataset, which we host on our library. The adult dataset contains a set of informations extract from US 1994 Census database. It includes personal information about the individuals, specifically sex, race, and education. In this tutorial we will perform unsupervised learning to cluster the data, then measure whether this clustering contains gender or race information (clustering bias).

In [2]:
# Get data
from holisticai.datasets import load_adult
df = load_adult()['frame']
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,25.0,Private,226802.0,11th,7.0,Never-married,Machine-op-inspct,Own-child,Black,Male,0.0,0.0,40.0,United-States,<=50K
1,38.0,Private,89814.0,HS-grad,9.0,Married-civ-spouse,Farming-fishing,Husband,White,Male,0.0,0.0,50.0,United-States,<=50K
2,28.0,Local-gov,336951.0,Assoc-acdm,12.0,Married-civ-spouse,Protective-serv,Husband,White,Male,0.0,0.0,40.0,United-States,>50K
3,44.0,Private,160323.0,Some-college,10.0,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688.0,0.0,40.0,United-States,>50K
4,18.0,,103497.0,Some-college,10.0,Never-married,,Own-child,White,Female,0.0,0.0,30.0,United-States,<=50K


## **Preprocess data and Train a model**

In [3]:
from sklearn.model_selection import train_test_split

In [4]:
# Dataset
dataset = load_adult()

# Dataframe
df = pd.concat([dataset["data"], dataset["target"]], axis=1)
protected_variables = ["sex", "race"]
output_variable = ["class"]

# Simple preprocessing
y = df[output_variable].replace({">50K": 1, "<=50K": 0})
X = pd.get_dummies(df.drop(protected_variables + output_variable, axis=1))
group = ["sex"]
group_a = df[group] == "Female"
group_b = df[group] == "Male"
data = [X, y, group_a, group_b]

# Train test split
dataset = train_test_split(*data, test_size=0.2, shuffle=True)
train_data = dataset[::2]
test_data = dataset[1::2]

In [5]:

from sklearn.cluster import KMeans
from holisticai.bias.metrics import clustering_bias_metrics

X_train, _, group_a_train, group_b_train = train_data
# we choose to use 4 clusters
model = KMeans(n_clusters = 4)
model.fit(X_train)

# test data
X, _, group_a, group_b = test_data

# predict
y_pred = model.predict(X)

centroids = model.cluster_centers_

# Computing Bias Metrics for Train Data
dfm = clustering_bias_metrics(group_a_train, 
                              group_b_train, 
                              model.labels_, 
                              data = X_train, 
                              centroids = centroids, 
                              metric_type = 'equal_outcome')

# Computing Bias Metrics for Test Data
dfm_ts = clustering_bias_metrics(group_a, 
                                 group_b, 
                                 y_pred, 
                                 data = X, 
                                 centroids = centroids, 
                                 metric_type = 'equal_outcome')

# Using Fairlet Clustering Preprocessing

In [6]:
from holisticai.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from holisticai.bias.mitigation import FairletClusteringPreprocessing
from holisticai.utils.models.cluster import KMedoids
from sklearn.cluster import KMeans
decomposition = FairletClusteringPreprocessing(decomposition='Scalable', p=10, q=21, 
                                          seed=42)
pipeline = Pipeline(steps=[
    ('scaler', StandardScaler()),
    ('bm_preprocessing', decomposition),
    ('cluster', KMeans(n_clusters=4))])

pipeline.fit(X_train, bm__group_a = group_a_train, bm__group_b = group_b_train)

  from .autonotebook import tqdm as notebook_tqdm


In [7]:
centroids2 = pipeline['cluster'].cluster_centers_
y_pred2 = pipeline.predict(X)
dfm2 = clustering_bias_metrics(group_a_train, 
                               group_b_train, 
                               pipeline['cluster'].labels_, 
                               data = X_train, 
                               centroids = centroids2, 
                               metric_type = 'both')

dfm2_ts = clustering_bias_metrics(group_a, 
                                  group_b, 
                                  y_pred2, 
                                  data = X, 
                                  centroids = centroids2, 
                                  metric_type = 'both')

In [16]:
from holisticai.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from holisticai.bias.mitigation import FairletClustering

pipeline = Pipeline(steps=[
    ('scaler', StandardScaler()),
    ('bm_inprocessing', FairletClustering(decomposition='Scalable', clustering_model='KMedoids', 
                                          p=10, q=21, n_clusters=4, seed=42))])
pipeline.fit(X_train, bm__group_a = group_a_train, bm__group_b = group_b_train)

In [17]:
centroids3 = pipeline['bm_inprocessing'].cluster_centers_
y_pred3 = pipeline.predict(X)
dfm3 = clustering_bias_metrics(group_a_train, 
                               group_b_train, 
                               pipeline['bm_inprocessing'].labels_, 
                               data = X_train, 
                               centroids = centroids3, 
                               metric_type = 'both')

dfm3_ts = clustering_bias_metrics(group_a, 
                                  group_b, 
                                  y_pred3, 
                                  data = X, 
                                  centroids = centroids3, 
                                  metric_type = 'both')

In [18]:
metrics = pd.concat([dfm["Value"], dfm_ts["Value"], 
                     dfm2["Value"], dfm2_ts["Value"], 
                     dfm3["Value"], dfm3_ts],axis=1)
metrics.columns= ['Baseline (Train)','Baseline (Test)', 
                  'Fairlet-Pre (Train)', 'Fairlet-Pre (Test)', 
                  'Fairlet-In (Train)', 'Fairlet-In (Test)', 
                  'Reference']
metrics

Unnamed: 0_level_0,Baseline (Train),Baseline (Test),Fairlet-Pre (Train),Fairlet-Pre (Test),Fairlet-In (Train),Fairlet-In (Test),Reference
Metric,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Cluster Balance,0.819081,0.831794,0.965721,0.783658,0.972921,0.548712,1
Minimum Cluster Ratio,0.372822,0.380488,0.476947,0.350746,0.47619,0.222222,1
Cluster Distribution Total Variation,0.026917,0.022305,0.020125,0.139153,0.000129,0.001063,0
Cluster Distribution KL Div,0.002569,0.00205,0.000888,0.042045,3e-06,0.000406,0
Social Fairness Ratio,0.969292,0.948549,0.965064,0.973969,0.965064,0.973969,1
Silhouette Difference,0.003664,0.001604,-0.000659,-0.002045,-0.000545,-0.000778,0
