## End to End Data Balance and Error Mitigation
This Notebook will demonstrate how to use both the Data Balance Analysis capabilities and error mitigation functions together using an example HR dataset which is a tabular dataset with a label column that indicates whether or not a person is promoted based on attributes such as education, gender, number of trainings, and other factors. 
The steps that we will take in this notebook are 
1. We will first conduct an analysis on how balanced the data is. 
2. We will train an example model to see how it performs on the data. 
3. We will try to balance the data to mitigate biases that may have resulted from unbalanced data
4. We will then compare model performance and data balance metrics before and after rebalancing the data

First we import all the dependencies needed in our analysis. This includes the classes to produce the data balance metrics, the sklearn functions to see the model performance and the error mitigation steps like DataRebalance and DataSplit that we apply to the dataset itself. 

In [9]:
import pandas as pd
import sys
sys.path.append('../../ResponsibleAIToolbox-Mitigation/')
from imblearn.over_sampling import SMOTE

from dataprocessing import DataRebalance
from dataprocessing import DataSplit

from databalanceanalysis.utils import undummify
from databalanceanalysis import FeatureBalanceMeasure
from databalanceanalysis import DistributionBalanceMeasure
from databalanceanalysis import AggregateBalanceMeasure

from lightgbm import LGBMClassifier

from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [10]:
from raiwidgets import ResponsibleAIDashboard
from responsibleai import RAIInsights

Now we import the tabular dataset that we will look at in the example, we load it into a pandas dataframe that we can then modify and use for all the other steps. For the data balance analysis portion we need our label columns and a list of sensitive columns that are interested in checking for balance. For the purpose of this example, we chose to look at the education and gender columns. The reason we chose these columns is because we would not want a model to biased against a person getting promoted based on their gender or how much education they have, the measurable outputs of their work that are in the data like percent of KPIs hit should be more important when deciding whether or not to promote them.

In [11]:
   
data_dir = '../datasets/hr_promotion'
df =  pd.read_csv(data_dir + '/train.csv').drop(['employee_id'], axis=1)
cols_of_interest = ['education', 'gender']
label_col = 'is_promoted'
seed = 42
# handle duplicates
df = df.drop_duplicates()
df = df.dropna()

Here we do a basic split on the data, train a LightGBM model, and see how this model does on some test data. We can see that the model does well on false values, getting 97.3% of them correct, but the model does a lot worse on the true values, only identifying approximately a third of the true positives correctly. 

In [12]:
## Train a model and get accuracy numbers

# data prep
def split_label(dataset):
    x = dataset.drop(['is_promoted'], axis=1)
    y = dataset['is_promoted']
    return x, y

# dataset = pd.get_dummies(df, drop_first=False)
dataset = df
target_index = dataset.columns.get_loc('is_promoted')
data_split =  DataSplit(dataset,target_index , 0.9, 42, True, False, False, True)
train_data, test_data = data_split.Split()
# splitting the training data
x_train, y_train = split_label(train_data)
# splitting the test data
x_test, y_test = split_label(test_data)

# LGBMClassifier Model
clf = LGBMClassifier(n_estimators=50)
model = clf.fit(x_train, y_train)

pred = model.predict(x_test)

def conf_matrix(y,pred):
    ((tn, fp), (fn, tp)) = metrics.confusion_matrix(y, pred)
    ((tnr,fpr),(fnr,tpr))= metrics.confusion_matrix(y, pred, normalize='true')
    return pd.DataFrame([[f'TP = {tp} ({tpr:1.2%})', f'FN = {fn} ({fnr:1.2%})'], 
                         [f'FP = {fp} ({fpr:1.2%})', f'TN = {tn} ({tnr:1.2%})']],
                        index=['True', 'False'], 
                        columns=['Pred 1', 'Pred 0'])

print("number of errors on test dataset: " + str(sum(pred != y_test)))

conf_matrix(y_test,pred)

print(classification_report(y_test, pred)) 


Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError.  Select only valid columns before calling the reduction.


number of errors on test dataset: 293


Unnamed: 0,Pred 1,Pred 0
True,TP = 142 (33.57%),FN = 281 (66.43%)
False,FP = 12 (0.27%),TN = 4426 (99.73%)


              precision    recall  f1-score   support

           0       0.94      1.00      0.97      4438
           1       0.92      0.34      0.49       423

    accuracy                           0.94      4861
   macro avg       0.93      0.67      0.73      4861
weighted avg       0.94      0.94      0.93      4861



LightGBM and most other models require numerical inputs and so most of the time, a data scientist would apply one-hot encoding to any of the categorical data in order to be used in training the model. Unfortunately one-hot encoding sometimes makes it difficult to get a good understanding of each column individually and to work with columns by individual name. The data balance analysis metrics rely on categorical column names so we utilize a function to collapse dummy variables back into a single column so we can do our analysis. 

In [18]:
train_df = undummify(train_data, prefix_sep = "-")
train_df
test_df = undummify(test_data, prefix_sep = "-")

Unnamed: 0,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted,department,region,education,gender,recruitment_channel
24292,1,35,1.0,10,0,0,48,0,Sales & Marketing,region_4,Master's & above,f,sourcing
5724,1,29,3.0,3,0,0,74,0,Procurement,region_7,Bachelor's,f,other
40782,1,56,3.0,5,0,0,46,0,HR,region_2,Bachelor's,f,other
42178,1,35,2.0,9,0,0,67,0,Procurement,region_2,Master's & above,f,other
20870,1,34,4.0,7,1,0,91,1,Procurement,region_7,Master's & above,m,other
...,...,...,...,...,...,...,...,...,...,...,...,...,...
29463,1,45,5.0,6,0,0,58,0,Operations,region_2,Bachelor's,m,other
52013,1,33,5.0,5,0,0,99,1,Analytics,region_28,Bachelor's,m,sourcing
51353,1,34,5.0,4,0,1,63,1,Sales & Marketing,region_4,Bachelor's,m,other
48648,1,31,4.0,6,0,0,46,0,Sales & Marketing,region_21,Bachelor's,m,other


## TODO Ask Mers for help on how to get this dashboard working 
## TODO step 3 (use rai dashboard to look at error analysis and determine what cohorts have more errors)

In [16]:
train_data

Unnamed: 0,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted,department-Analytics,department-Finance,...,education-Below Secondary,education-Master's & above,education-nan,gender-f,gender-m,gender-nan,recruitment_channel-other,recruitment_channel-referred,recruitment_channel-sourcing,recruitment_channel-nan
24292,1,35,1.0,10,0,0,48,0,0,0,...,0,1,0,1,0,0,0,0,1,0
5724,1,29,3.0,3,0,0,74,0,0,0,...,0,0,0,1,0,0,1,0,0,0
40782,1,56,3.0,5,0,0,46,0,0,0,...,0,0,0,1,0,0,1,0,0,0
42178,1,35,2.0,9,0,0,67,0,0,0,...,0,1,0,1,0,0,1,0,0,0
20870,1,34,4.0,7,1,0,91,1,0,0,...,0,1,0,0,1,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29463,1,45,5.0,6,0,0,58,0,0,0,...,0,0,0,0,1,0,1,0,0,0
52013,1,33,5.0,5,0,0,99,1,1,0,...,0,0,0,0,1,0,0,0,1,0
51353,1,34,5.0,4,0,1,63,1,0,0,...,0,0,0,0,1,0,1,0,0,0
48648,1,31,4.0,6,0,0,46,0,0,0,...,0,0,0,0,1,0,1,0,0,0


In [22]:
task_type = "classification"
categorical_cols = ["department", "region",	"education", "gender", "recruitment_channel"]
rai_insights = RAIInsights(model, train_data, test_data, label_col, task_type)
rai_insights.explainer.add()
rai_insights.error_analysis.add()
rai_insights.compute()
ResponsibleAIDashboard(rai_insights)

ResponsibleAI started at http://localhost:5001


<raiwidgets.responsibleai_dashboard.ResponsibleAIDashboard at 0x23967ca2ee0>

## TODO add data balance visualizations (step 4 doing data balance analysis)

First we can take a look at the feature balance measures. These measures indicate the difference in the label column amongst different feature values. For example the first row here indicates if people with the "Masters & above" education has a different proportion of people receiving the promoted outcome than those that have a Bachelor's. Lower values of these measures indicates that the amounts of people with class A vs class B with these feature values is similar

In [None]:
feature_measures = FeatureBalanceMeasure( cols_of_interest, label_col)

feat_measures1 = feature_measures.measures(train_df)
feat_measures1

Next we can take a look at the distribution balance measures. These measures each of the columns of interest that we selected to the uniform distribution of those values. Values that are closer to zero indicate that the difference between the actual distribution of the data and the uniform distribution of values.

In [None]:
dist_measures = DistributionBalanceMeasure( cols_of_interest)
dist_measures1 = dist_measures.measures(train_df)
dist_measures1


We can also look at aggregate balance measures which indicate a notion of overall inequality in the data. We can see that the Atkinson Index is 0.648. 

In [None]:
agg_measures = AggregateBalanceMeasure( cols_of_interest)
agg_measures1 = agg_measures.measures(train_df)
agg_measures1

In order to rebalance the data we can choose from three different methods of under or oversampling. These are SMOTE, Tomek and SMOTE-Tomek. SMOTE is a oversampling technique for the less represented class. Tomek is an undersampling technique that would be applied to the more represented class. Smote-Tomek is when both of these methods are applied in conjunction on the dataset. In this example, we will use the SMOTE sampling technique on the columns of interest. The Rebalance function can only be applied on one column at a time so in order to apply this rebalancing technique on a cohort of two sensitive columns instead of an individual column, we combine these two columns into a single column that can be balance. To apply the SMOTE Method, we also need a numerical dataset so we can apply 

In [None]:
smote = SMOTE()
# these are the other balance algorithm objects we could use
# smote_tomek = SMOTETomek()
# tomek = TomekLinks()
dummy_df = pd.get_dummies(train_df.drop(["education", "gender"], axis = 1), prefix_sep = "-")
dummy_df["education_gender_cohort"] = train_df["education"] + " * " + train_df["gender"]
dummy_df.head()

In [None]:
data_balance_smote =  DataRebalance(dummy_df, 'education_gender_cohort', 'auto', 42, None, smote, None)

smote_df = data_balance_smote.Rebalance()
print(smote_df.shape)
# smote_df
# print(smote_df.head)


After applying the SMOTE Method on the data, we can then train a new lightGBM model on this newly balanced data and see if there are differences in model performance based on this balancing. We compare the results below and find that the new model trained on the data post rebalancing does a better job predicting true positives than the original model and thus has greater recall and overall precision. So not only does data rebalancing help with making sure a model is less biased, it also helps the model actually fit and be able to predict the data outcomes more accurately. 

In [None]:
curr_smote_df = smote_df.drop(["education_gender_cohort"], axis = 1)
target_index = smote_df.columns.get_loc('is_promoted')
data_split =  DataSplit(curr_smote_df,target_index , 0.9, 42, False, False, False, True)
train_data, test_data = data_split.Split()
# splitting the training data
x_train2, y_train2 = split_label(train_data)
# splitting the test data
x_test2, y_test2 = split_label(test_data)

# LGBMClassifier Model
clf2 = LGBMClassifier(n_estimators=50)
model2 = clf2.fit(x_train2, y_train2)

pred2 = model2.predict(x_test2)

def conf_matrix(y,pred):
    ((tn, fp), (fn, tp)) = metrics.confusion_matrix(y, pred)
    ((tnr,fpr),(fnr,tpr))= metrics.confusion_matrix(y, pred, normalize='true')
    return pd.DataFrame([[f'TP = {tp} ({tpr:1.2%})', f'FN = {fn} ({fnr:1.2%})'], 
                         [f'FP = {fp} ({fpr:1.2%})', f'TN = {tn} ({tnr:1.2%})']],
                        index=['True', 'False'], 
                        columns=['Pred 1', 'Pred 0'])


In [None]:
# Compare Results
class color:
   PURPLE = '\033[95m'
   CYAN = '\033[96m'
   DARKCYAN = '\033[36m'
   BLUE = '\033[94m'
   GREEN = '\033[92m'
   YELLOW = '\033[93m'
   RED = '\033[91m'
   BOLD = '\033[1m'
   UNDERLINE = '\033[4m'
   END = '\033[0m'


print('')
print(color.PURPLE + color.BOLD + "BEFORE: " + color.END + "number of test dataset instances: " + color.BOLD   + color.GREEN + str(len(y_test)) + color.END)
print("      : number of errors on test dataset: " + color.BOLD   + color.RED + str(sum(pred != y_test)) + color.END)
print('')
print(color.PURPLE + color.BOLD + "AFTER:  " + color.END + "number of test dataset instances: " + color.BOLD   + color.GREEN + str(len(y_test2)) + color.END)
print("     :  number of errors on test dataset: " + color.BOLD  + color.RED + str(sum(pred2 != y_test2)) + color.END)
print('')
print("-----------------------------------------------------------------------")
print("-----------------------------------------------------------------------")
print('')
print(color.BLUE + color.BOLD +"BEFORE: conf_matrix:" + color.END)
print("--------------------")
conf_matrix(y_test,pred) 
print('')
print(color.BLUE + color.BOLD +"AFTER: conf_matrix:" + color.END)
print("-------------------")
conf_matrix(y_test2,pred2)
print("-----------------------------------------------------------------------")
print("-----------------------------------------------------------------------")
print('')
print(color.YELLOW + color.BOLD +"BEFORE: classification_report:" + color.END)
print("--------------------------------")
print(classification_report(y_test, pred)) 
print(color.YELLOW + color.BOLD +"AFTER: classification_report:" + color.END)
print("--------------------------------")
print(classification_report(y_test2, pred2)) 
      

We return the dataframe columns to their origianl form after rebalancing the data. We want to be able to run the data balance analysis again on the newly rebalanced data and see what the difference is from before applying SMOTE and after/ 

In [None]:
post_df = undummify(smote_df, prefix_sep ="-")
post_df["education"] = post_df["education_gender_cohort"].apply( lambda x: x.split("*")[0])
post_df["gender"] = post_df["education_gender_cohort"].apply( lambda x: x.split("*")[1])

In [None]:
feature_measures.measures(post_df)
feat_measures1

In [None]:
dist_measures.measures(post_df)
dist_measures1

In [None]:
agg_measures.measures(post_df)
agg_measures1

TODO: Insert the RAIDashboard again here and look at the error analysis for specific cohorts