# End to End Data Balance and Error Mitigation
This Notebook will demonstrate how to use both the Data Balance Analysis capabilities and error mitigation functions together using an example HR dataset which is a tabular dataset with a label column that indicates whether or not a person is promoted based on attributes such as education, gender, number of trainings, and other factors. 
The steps that we will take in this notebook are 
1. We will first conduct an analysis on how balanced the data is. 
2. We will train an example model to see how it performs on the data. 
3. We will try to balance the data to mitigate biases that may have resulted from unbalanced data
4. We will then compare model performance and data balance metrics before and after rebalancing the data

First we import all the dependencies needed in our analysis. This includes the classes to produce the data balance metrics, the sklearn functions to see the model performance and the error mitigation steps like DataRebalance and DataSplit that we apply to the dataset itself. 

In [None]:
!pip install -e ../../responsible-ai-mitigations

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

from imblearn.over_sampling import SMOTE

from raimitigations.dataprocessing import Rebalance, Split

from raimitigations.databalanceanalysis import FeatureBalanceMeasure, AggregateBalanceMeasure, DistributionBalanceMeasure

from lightgbm import LGBMClassifier

from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.preprocessing import OrdinalEncoder

### Train LightGBM Model
Now we import the tabular dataset that we will look at in the example, we load it into a pandas dataframe that we can then modify and use for all the other steps. For the data balance analysis portion we need our label columns and a list of sensitive columns that are interested in checking for balance. 

In [None]:
from urllib.request import urlretrieve
import zipfile
import pathlib

import pandas as pd

outdirname = 'mitigations-datasets.2.22.2022'
zipfilename = outdirname + '.zip'
if not pathlib.Path(outdirname).exists () :
    urlretrieve('https://publictestdatasets.blob.core.windows.net/data/' + zipfilename, '../../' + zipfilename)
    with zipfile.ZipFile('../../' + zipfilename, 'r') as unzip:
        unzip.extractall('../../.')

data_dir = ('../' + outdirname + '/hr_promotion')
df =  pd.read_csv(data_dir + '/train.csv').drop(['employee_id'], axis=1)

We do some data transformation on the categorical columns in order to make the training data input in the format that the lightGBM model expects. Although lightGBM can internally deal with categorical columns that the user specifies, we need to encode those categories into integers before we are able to train the lightGBM model on it. 

In [None]:
cols_of_interest = ['education', 'recruitment_channel']
categorical_cols = ["department", "gender", "education", "region", "recruitment_channel" ]
label_col = 'is_promoted'
seed = 42
# handle duplicates
df = df.drop_duplicates().drop(["employee_id"], axis = 1)
df = df.dropna()
ord_enc = OrdinalEncoder(dtype = int)
df[categorical_cols] = ord_enc.fit_transform(df[categorical_cols])
df.head()

Here we do a split on the data, train a LightGBM model, and see how this model does on some test data. After this processing, we train the model and we can see that the model does well on false values, getting 97.3% of them correct, but the model does a lot worse on the true values, only identifying approximately a third of the true positives correctly. 

In [None]:
## Train a model and get accuracy numbers

# data prep
def split_label(dataset):
    x = dataset.drop(['is_promoted'], axis=1)
    y = dataset['is_promoted']
    return x, y

dataset = df
target_index = dataset.columns.get_loc('is_promoted')
data_split =  Split(dataset,target_index , 0.9, 42, False, False, False, True)
train_data, test_data = data_split.split()
# splitting the training data
x_train, y_train = split_label(train_data)
# splitting the test data
x_test, y_test = split_label(test_data)

# LGBMClassifier Model
clf = LGBMClassifier(n_estimators=50, )
model = clf.fit(x_train, y_train, categorical_feature = categorical_cols)

pred = model.predict(x_test)

def conf_matrix(y,pred):
    ((tn, fp), (fn, tp)) = metrics.confusion_matrix(y, pred)
    ((tnr,fpr),(fnr,tpr))= metrics.confusion_matrix(y, pred, normalize='true')
    return pd.DataFrame([[f'TP = {tp} ({tpr:1.2%})', f'FN = {fn} ({fnr:1.2%})'], 
                         [f'FP = {fp} ({fpr:1.2%})', f'TN = {tn} ({tnr:1.2%})']],
                        index=['True', 'False'], 
                        columns=['Pred 1', 'Pred 0'])

print("number of errors on test dataset: " + str(sum(pred != y_test)))

print(conf_matrix(y_test,pred))

print(classification_report(y_test, pred)) 


### Error Analysis on Baseline Model
Now that we have a baseline model to work with, we can see how this model is doing overall on the data and see if there are any cohorts within the data that it performs worse on. We use the [Error Analysis Dashboard](https://erroranalysis.ai/) to determine which cohorts of data this model performs worse on. Since the error analysis dashboard is interactive and too large to render on Github, we will include screenshots from our analysis. From these screenshots we can see that if we zoom in on certain cohorts that the model is getting more errors on, that region, department and education are all attributes that are involved in those cohorts. For the purpose of this example, we chose remove some of the other columns like KPIs_met from the error analysis since we want to focus on attributes that may lead to biases rather than more measurable attributes. We will focus on analyzing and mitigating errors within the department and education columns for the rest of the analysis. 

In [None]:
from raiwidgets import ErrorAnalysisDashboard
predictions = model.predict(x_test)
#ErrorAnalysisDashboard(dataset=x_test, true_y=y_test, features=x_test.columns, pred_y=predictions, categorical_features = categorical_cols)


![error_analysis1](images/error_analysis1.png)

### Data Balance Analysis
First we can take a look at the feature balance measures. These measures indicate the difference in the label column amongst different feature values. For example the first row here indicates if people with the "Masters & above" education has a different proportion of people receiving the promoted outcome than those that have a Bachelor's. Lower values of these measures indicates that the amounts of people with class A vs versus those with class B with a label of 1 is similar. The t-test value can also tell us if the difference we see is statistically significant.

In [None]:
train_df = train_data
train_df[categorical_cols ] = ord_enc.inverse_transform(train_df[categorical_cols])

In [None]:
feature_measures = FeatureBalanceMeasure( cols_of_interest, label_col)

feat_measures1 = feature_measures.measures(train_df)
feat_measures1.head()

In [None]:
 %matplotlib inline 
educations = train_df['education'].unique()
education_dp_values = feat_measures1[feat_measures1["FeatureName"] == 'education'][["ClassA", "ClassB", "pmi"]]
education_dp_array = np.zeros((len(educations), len(educations)))

for idx, row in education_dp_values.iterrows():
    class_a = row[0]
    class_b = row[1]
    dp_value = row[2]
    i, j = np.where(educations==class_a)[0][0], np.where(educations == class_b)
    dp_value = round(dp_value, 2)
    education_dp_array[i, j] = dp_value
    education_dp_array[j, i] = -1 * dp_value

colormap = "RdBu"
dp_min, dp_max = -1.0, 1.0

fig, ax = plt.subplots()
im = ax.imshow(education_dp_array, vmin=dp_min, vmax=dp_max, cmap=colormap)

cbar = ax.figure.colorbar(im, ax=ax)
cbar.ax.set_ylabel("Point Mutual Info", rotation=-90, va="bottom")

ax.set_xticks(np.arange(len(educations)))
ax.set_yticks(np.arange(len(educations)))
ax.set_xticklabels(educations)
ax.set_yticklabels(educations)

plt.setp(ax.get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor")

for i in range(len(educations)):
    for j in range(len(educations)):
        text = ax.text(j, i, education_dp_array[i, j], ha="center", va="center", color="k")
    
ax.set_title("PMI of education in HR Dataset")
fig.tight_layout()
plt.show()

Next we can take a look at the distribution balance measures. These measures each of the columns of interest that we selected to the uniform distribution of those values. Values that are closer to zero indicate that the difference between the actual distribution of the data and the uniform distribution of values.

In [None]:
dist_measures = DistributionBalanceMeasure( cols_of_interest)
dist_measures1 = dist_measures.measures(train_df)
dist_measures1


In [None]:
 %matplotlib inline 
measures_of_interest = ["kl_divergence", "js_dist", "inf_norm_dist", "total_variation_dist", "wasserstein_dist"]
education_measures = dist_measures1[dist_measures1['FeatureName'] == 'education'].iloc[0]
department_measures = dist_measures1[dist_measures1['FeatureName'] == 'recruitment_channel'].iloc[0]
education_array = [round(education_measures[measure], 4) for measure in measures_of_interest]
department_array = [round(department_measures[measure], 4) for measure in measures_of_interest]

x = np.arange(len(measures_of_interest))
width = 0.35

fig, ax = plt.subplots()
rects1 = ax.bar(x - width/2, education_array, width, label="education")
rects2 = ax.bar(x + width/2, department_array, width, label="recruitment channel")

ax.set_xlabel("Measure")
ax.set_ylabel("Value")
ax.set_title("Distribution Balance Measures of Education and Recruitment Channel in Adult Dataset")
ax.set_xticks(x)
ax.set_xticklabels(measures_of_interest)
ax.legend()

plt.setp(ax.get_xticklabels(), rotation=20, ha="right", rotation_mode="default")

def autolabel(rects):
    for rect in rects:
        height = rect.get_height()
        ax.annotate('{}'.format(height),
                    xy=(rect.get_x() + rect.get_width() / 2, height),
                    xytext=(0, 1),  # 1 point vertical offset
                    textcoords="offset points",
                    ha='center', va='bottom')

autolabel(rects1)
autolabel(rects2)

fig.tight_layout()

plt.show()

We can also look at aggregate balance measures which indicate a notion of overall inequality in the data. We can see that the Atkinson Index is 0.79. This means that in order to create a perfectly balanced dataset over these measures we would need to forgo 79.9% of the data. 

In [None]:
agg_measures = AggregateBalanceMeasure( cols_of_interest)
agg_measures1 = agg_measures.measures(train_df)
agg_measures1

### Error Mitigation: Rebalancing dataset
In order to rebalance the data we can choose from three different methods of under or oversampling. These are SMOTE, Tomek and SMOTE-Tomek. SMOTE is a oversampling technique for the less represented class. Tomek is an undersampling technique that would be applied to the more represented class. Smote-Tomek is when both of these methods are applied in conjunction on the dataset. In this example, we will choose to use the SMOTE sampling technique on the columns of interest. The Rebalance function can only be applied on one column at a time so in order to apply this rebalancing technique on a cohort of two sensitive columns instead of an individual column, we combine these two columns into a single column that can be balanced. 

In [None]:
smote = SMOTE()
# these are the other balance algorithm objects we could use but
#  we are using the SMOTE resampling technique in this example
# smote_tomek = SMOTETomek()
# tomek = TomekLinks()

def combine_cols(df):
    return str(df[0]) + " * " + str(df[1])

train_df2 = train_df
train_df2[categorical_cols] = ord_enc.transform(train_df[categorical_cols])
train_df2["education_recruitment_cohort"] = df[["education", "recruitment_channel"]].apply(combine_cols, axis=1)
train_df2 = train_df2.drop(["education", "recruitment_channel"], axis=1)

In [None]:
data_balance_smote = Rebalance(train_df2, 'education_recruitment_cohort', 'auto', 42, None, smote, None)

smote_df = data_balance_smote.rebalance()
smote_df['education'] = smote_df['education_recruitment_cohort'].apply(lambda x: int(x.split(" * ")[0]))
smote_df['recruitment_channel'] = smote_df['education_recruitment_cohort'].apply(lambda x: int(x.split(" * ")[1]))
smote_df = smote_df.drop(["education_recruitment_cohort"], axis=1)

TODO size of dataset

### New Model on Rebalanced Datasets
After applying the SMOTE Method on the data, we can then train a new lightGBM model on this newly balanced data and see if there are differences in model performance based on this balancing. We compare the results below and find that the new model trained on the data post rebalancing does a better job predicting true positives than the original model and thus has greater recall and overall precision. So not only does data rebalancing help with making sure a model is less biased, it also helps the model actually fit and be able to predict the data outcomes more accurately. 

In [None]:
target_index = smote_df.columns.get_loc('is_promoted')
data_split =  Split(smote_df,target_index , 0.9, 42, False, False, False, True)
train_data, test_data = data_split.split()
# splitting the training data
x_train2, y_train2 = split_label(train_data)
# splitting the test data
x_test2, y_test2 = split_label(test_data)

# LGBMClassifier Model
clf2 = LGBMClassifier(n_estimators=50)
model2 = clf2.fit(x_train2, y_train2, categorical_feature = categorical_cols)

pred2 = model2.predict(x_test2)
# reorders the columns to fit the second model since there is some rearranging
pred_model1 = model.predict(x_test2.reindex(columns = x_test.columns))

def conf_matrix(y,pred):
    ((tn, fp), (fn, tp)) = metrics.confusion_matrix(y, pred)
    ((tnr,fpr),(fnr,tpr))= metrics.confusion_matrix(y, pred, normalize='true')
    return pd.DataFrame([[f'TP = {tp} ({tpr:1.2%})', f'FN = {fn} ({fnr:1.2%})'], 
                         [f'FP = {fp} ({fpr:1.2%})', f'TN = {tn} ({tnr:1.2%})']],
                        index=['True', 'False'], 
                        columns=['Pred 1', 'Pred 0'])


We compare the number of error that the model 1 that was trained before rebalancing and model 2 that was trained after rebalancing have and we find that overall there are less errors with model 2. 

In [None]:
# Compare Results
class color:
   PURPLE = '\033[95m'
   CYAN = '\033[96m'
   DARKCYAN = '\033[36m'
   BLUE = '\033[94m'
   GREEN = '\033[92m'
   YELLOW = '\033[93m'
   RED = '\033[91m'
   BOLD = '\033[1m'
   UNDERLINE = '\033[4m'
   END = '\033[0m'


print('')
print(color.PURPLE + color.BOLD + "BEFORE: " + color.END + "number of test dataset instances: " + color.BOLD   + color.GREEN + str(len(y_test2)) + color.END)
print("      : number of errors on test dataset: " + color.BOLD   + color.RED + str(sum(pred_model1 != y_test2)) + color.END)
print('')
print(color.PURPLE + color.BOLD + "AFTER:  " + color.END + "number of test dataset instances: " + color.BOLD   + color.GREEN + str(len(y_test2)) + color.END)
print("     :  number of errors on test dataset: " + color.BOLD  + color.RED + str(sum(pred2 != y_test2)) + color.END)
print('')

In [None]:
# compare conf matrices
print("-----------------------------------------------------------------------")
print('')
print(color.BLUE + color.BOLD +"BEFORE: conf_matrix:" + color.END)
print("--------------------")
print(conf_matrix(y_test2,pred_model1) )
print('')
print(color.BLUE + color.BOLD +"AFTER: conf_matrix:" + color.END)
print("-------------------")
print(conf_matrix(y_test2,pred2))
print("-----------------------------------------------------------------------")
print("-----------------------------------------------------------------------")
print('')

TODO call out which improved

In [None]:
# compare classification report
print(color.YELLOW + color.BOLD +"BEFORE: classification_report:" + color.END)
print("--------------------------------")
print(classification_report(y_test2, pred_model1)) 
print(color.YELLOW + color.BOLD +"AFTER: classification_report:" + color.END)
print("--------------------------------")
print(classification_report(y_test2, pred2)) 

We return the dataframe columns to their original form after rebalancing the data. We no longer encode  We want to be able to run the data balance analysis again on the newly rebalanced data and see what the difference is from before applying SMOTE and after.

In [None]:
smote_df[categorical_cols] = ord_enc.inverse_transform(smote_df[categorical_cols]) 

The feature value measures before rebalancing don't indicate a lot of discrepancy in the outcome within specific features within a class since the values such as demographic parity which is on a 0 to 1 scale are very close to zero. After rebalancing, we still have similarly low values of these measures but there is not a significant improvement since they already started low.

todo feat balance comparison

In [None]:
feat_measures1.head()

In [None]:
feature_measures.measures(smote_df).head()

When we compare the distribution measures before and after rebalancing, we find that the data is much more evenly distributed (close to the uniform distribution) for the two columns of interest after rebalancing the data using the SMOTE algorithm

In [None]:
#before
dist_measures1

In [None]:
#after
dist_measures.measures(smote_df)


The Atkinson index which gives us the overall notion of inequality before and after rebalancing shows us that in order to get a perfectly balanced dataset, we no longer need to forgo any of the data.

In [None]:
#before
agg_measures1

In [None]:
#after
agg_measures.measures(smote_df)

In [None]:
model2 = clf.fit(x_train, y_train)
predictions2 = model2.predict(x_test)
#ErrorAnalysisDashboard(dataset=x_test, true_y=y_test, features=x_test.columns, pred_y=predictions2, categorical_features = categorical_cols)

label the top as before and the bottom as after
highlight underneath the cohort
lower fill (meaning )

![error_analysis2](images/error_analysis2.png)