# AIFair360 Demo: Binary Classification with the German Credit Dataset

The goal of this tutorial is to introduce the basic functionality of AI Fairness 360 to an interested developer who may not have a background in bias detection and mitigation.

## Biases and Machine Learning 

A machine learning model makes predictions of an outcome for a particular instance. (Given an instance of a loan application, predict if the applicant will repay the loan.) The model makes these predictions based on a training dataset, where many other instances (other loan applications) and actual outcomes (whether they repaid) are provided. Thus, a machine learning algorithm will attempt to find patterns, or generalizations, in the training dataset to use when a prediction for a new instance is needed. (For example, one pattern it might discover is "if a person has a salary > USD 40K and has outstanding debt < USD 5, they will repay the loan".) 

However, there are scenarios in which these patterns may not be fair. Imagine a situation in which our algorithms predict a positive outcome for male candidates 80% of the time, but only 20% for females or vice versa. This is an example of bias in machine learning. It can be caused by different circumstances like underrepresentation in training data.



**AIFair360** is designed to help address this problem with fairness metrics and bias mitigators. 

1.  **Fairness metrics - can be used to check for bias in machine learning workflows.**

    The research literature identifies 3 different categories of fairness metrics:

    1.  Statistical measures: These are metrics that rely on statistical definitions such as True Positives, False Positives, False Negative Rate...

    2.  Similarity-based / Individual measures: Statistical definitions largely ignore all attributes of the classified subject except the sensitive attribute. This may hide unfairness. Similarity-based measures attempt to address such issues by not marginalizing insensitive attributes. 

    3.  Causal reasoning: These definitions assume a given causal graph in which nodes represent attributes and edges relationships between them. These relations are captured by structural equations that aim to build algorithms that ensure a tolerable level of fairness. 

    Most metrics that can be calculated using the class aif360.metrics fall under statistical measures. Before going into examples of such measures, it is worth mentioning that they depend on the definition of protected/sensitive attributes, and consequently (un)protected groups/classes.
    
    - Protected attributes are features that may not be used as the basis for decisions. Protected attributes could be chosen because of legal mandates or because of organizational values. Some common protected attributes include race, religion, national origin, gender, marital status, age, and socioeconomic status. 
    - (Un)protected groups (also known as (un)privileged groups/classes) are the groups that are consequently formed by splitting the data points with regards to the protected attributes. For example, if "gender" is defined as protected attribute (and it only takes either value "female" or "male"), the protected group would be all the data points defined as "male" in terms of this attribute.    
    
    Furthermore, most of these metrics are calculated as differences or ratios of the following rates:   

    -   **Selection rate** = fraction of predicted labels matching the positive outcome

    -   **True positive rate** = fraction of positive cases correctly predicted to be in the positive class out of all actual positive cases

    -   **False positive rate** = fraction of negative cases incorrectly predicted to be in the positive class out of all actual negative cases

    -   **False negative rate** = fraction of positive cases incorrectly predicted to be in the negative class out of all actual positive cases

    -   **True negative rate** = fraction of negative cases correctly predicted to be in the negative class out of all actual negative cases

    The following are a few relevant examples of statistical measures calculated on the rates enumerated above:

    -   **Demographic parity difference** = difference between largest and smallest group-level selection rate (0 means all groups have the same selection rate)

        -   **Demographic parity ratio** = similar to above

    -   **False positive/negative rate difference**

    -   **Equalized odds difference** = quantifies the disparity in accuracy experienced by different demographics; the larger of the following: (1) false positive rate difference and (2) false negative rate difference

        -   **Equalized odds ratio** = similar to above

    A description of how AIFair360 approaches fairness assessment can be found here <https://aif360.mybluemix.net/resources#guidance>, while more metric examples are described here <https://aif360.readthedocs.io/en/latest/modules/metrics.html>.


2.  **Bias mitigators - can be used to overcome bias in the workflow to produce a more fair outcome.**

    Bias can enter the system in 3 critical steps:

    -   Training data: outcomes may be biased towards particular kinds of instances
        - Example: Imagine that historically people under the age of 30 years old were always refused a loan because of stereotypes around young people. The training data collected in such times would then have very few (or even none) samples for this age group with accepted loans. If we now consider that age shouldn’t play a role anymore in this decision (thus young people should receive as many loans as older people), the training dataset wouldn’t be representative of the ideal situation, and instead be biased. It is easy thus to imagine that a model trained on this dataset would also return biased outcomes. 

    -   Algorithm that creates the model: it may generate models that are weighted towards particular features in the input
        - Example: Regularization refers to techniques that are used to calibrate ML models in order to minimize the adjusted loss function and prevent overfitting or underfitting. Regularization will help select a midpoint between high bias and high variance. This ideal goal of generalization in terms of bias and variance is a low bias and a low variance which is near impossible or difficult to achieve. Hence, the need of the trade-off and consequently the possibility of introducing bias. 

    -   Test data: it has expectations on correct answers that may be biased

        - Example: You can think about this in a similar way as for the training data.

    These 3 points in the machine learning process represent points for testing and mitigating bias. In AIF360, we call these points pre-processing, in-processing, and post-processing. The toolkit provides methods that can be applied in each of these stages, these are listed below:

    1.  **Preprocessing algorithms** transform the training data to mitigate possible unfairness. Preprocessing algorithms in AIFair360 follow the [sklearn.base.TransformerMixin](https://scikit-learn.org/stable/modules/generated/sklearn.base.TransformerMixin.html#sklearn.base.TransformerMixin) class, meaning that they can fit to the dataset and transform it.

         - **Correlation Remover** = projects away correlations between sensitive and non-sensitive features in the dataset while details from the original data are retained as much as possible

    2.  **In-processing algorithms** enable unfairness mitigation with respect to user-provided fairness constraints (currently only group-fairness constraints are supported, where group-fairness means that subjects from both protected and unprotected groups have equal probabilities of being assigned to the positive predictive class). 
        - **Prejudice remover** = adds a discrimination-aware regularization term to the learning objective

    3.  **Post-processing algorithms** modify results of previously trained classifiers to achieve the desired results on different groups.

        - **Calibrated equality of odds** =  optimizes over calibrated classifier score outputs to find probabilities with which to change output labels with an equalized odds objective

Note that these are just examples and the complete list of mitigation algorithms can be found at: <https://aif360.readthedocs.io/en/latest/modules/algorithms.html>. 

The rest of this notebook will guide you through a few examples on how to use these metrics and mitigation methods in practice.

## Contents

1. [What is Covered](#What-is-Covered)
1. [Introduction: Import statements](#Introduction)
1. [The German Credit Dataset](#The-German-Credit-Dataset)
1. [Using a Fairness Unaware Model](#Using-a-Fairness-Unaware-Model)
1. [Preprocessing mitigation with Reweighing](#Preprocessing-Mitigation-using-Reweighing)
1. [Inprocessing mitigation with Prejudice Remover](#Inprocessing-Mitigation-using-Prejudice-Remover)
1. [Conclusion](#Conclusion)

## What is Covered

* **Domain:**
  * Finance (loan decisions). 

* **ML task:**
  * Binary classification.

* **Fairness tasks:**
  * Assessment of unfairness using aif360 metrics.
  * Mitigation of unfairness using aif360 mitigation algorithms.

* **Performance metrics:**
  * Accuracy.
  * Balanced accuracy.
  * Error Rate (difference).

* **Fairness metrics:**
  * False-positive rate difference.
  * False-negative rate difference.
  * Statistical parity difference.
  * Averaged odds difference.

* **Mitigation algorithms:**

  * `aif360.algorithms.inprocessing.PrejudiceRemover`
  * `aif360.algorithms.preprocessing.Reweighing`

## Introduction

We consider a scenario where algorithmic tools are deployed to predict the likelihood that an applicant will default on a credit-card loan. For this, we will use the [German credit dataset](https://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29), a dataset reflecting credit-card defaults in Germany, as a substitute dataset to replicate the desired workflow.

We train a fairness-unaware algorithm on this dataset and show the model has a higher false-positive rate as well as a higher false-negative rate for the "male" group than for the "female" group. We then use aifair360 to mitigate this disparity using both the `Reweighting` and `Prejudice remover` algorithms.

### The German Credit Dataset
This dataset is publicly available on the UCI Machine Learning Repository. It consists of 1000 loan applicants, with no missing values. There are 20 attributes


In [None]:
!pip install aif360[all]

In [None]:
# Load all necessary packages
import sys
sys.path.insert(1, "../")  

import numpy as np
np.random.seed(0)
import pandas as pd
import matplotlib.pyplot as plt
import lightgbm as lgb
from sklearn.calibration import CalibratedClassifierCV

from aif360.datasets import GermanDataset
from aif360.metrics import BinaryLabelDatasetMetric
from aif360.algorithms.preprocessing import Reweighing
from aif360.algorithms.inprocessing import PrejudiceRemover

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from aif360.datasets import GermanDataset
from aif360.metrics import BinaryLabelDatasetMetric
from aif360.metrics import ClassificationMetric


from IPython.display import Markdown, display

### Load dataset, specifying protected attribute, and split dataset into train and test
We load the initial dataset, setting the protected attribute to be sex.  We then splits the original dataset into training and testing datasets.  Although we will use only the training dataset in this tutorial, a normal workflow would also use a test dataset for assessing the efficacy (accuracy, fairness, etc.) during the development of a machine learning model.  Finally, we set two variables for the privileged (1) and unprivileged (0) values for the sex attribute.  These are key inputs for detecting and mitigating bias, which will be done later in the notebook.  

In [None]:
def preproc_and_load_data_german():
    """
    Load and pre-process german credit dataset.
    Args: -
    Returns:
        GermanDataset: An instance of GermanDataset with required pre-processing.
    """
    def custom_preprocessing(df):
        """ Custom pre-processing for German Credit Data
        """

        def group_credit_hist(x):
            if x in ['A30', 'A31', 'A32']:
                return 'None/Paid'
            elif x == 'A33':
                return 'Delay'
            elif x == 'A34':
                return 'Other'
            else:
                return 'NA'

        def group_employ(x):
            if x == 'A71':
                return 'Unemployed'
            elif x in ['A72', 'A73']:
                return '1-4 years'
            elif x in ['A74', 'A75']:
                return '4+ years'
            else:
                return 'NA'

        def group_savings(x):
            if x in ['A61', 'A62']:
                return '<500'
            elif x in ['A63', 'A64']:
                return '500+'
            elif x == 'A65':
                return 'Unknown/None'
            else:
                return 'NA'

        def group_status(x):
            if x in ['A11', 'A12']:
                return '<200'
            elif x in ['A13']:
                return '200+'
            elif x == 'A14':
                return 'None'
            else:
                return 'NA'
        
        def group_personal_status(x):
            if x in ['A91']:
                return 'divorced/separated'
            elif x in ['A92']:
                return 'divorced/separated/married'
            elif x in ['A93', 'A95']:
                return 'single'
            elif x in ['A94']:
                return 'married/widowed'
            else:
                return 'NA'

        status_map = {'A91': 1.0, 'A93': 1.0, 'A94': 1.0,
                    'A92': 0.0, 'A95': 0.0}
        
        df['sex'] = df['personal_status'].replace(status_map)
        

        # group credit history, savings, and employment
        df['credit_history'] = df['credit_history'].apply(lambda x: group_credit_hist(x))
        df['savings'] = df['savings'].apply(lambda x: group_savings(x))
        df['employment'] = df['employment'].apply(lambda x: group_employ(x))
        #df['age'] = df['age'].apply(lambda x: np.float(x >= 26))
        df['status'] = df['status'].apply(lambda x: group_status(x))
        df['personal_status'] = df['personal_status'].apply(lambda x: group_personal_status(x))
        
        return df

    # Feature partitions
    XD_features = ['number_of_credits', 'telephone',
                     'foreign_worker', 'people_liable_for', 'skill_level', 'credit_history',\
                   'installment_plans', 'residence_since', 'property', 'other_debtors', \
                   'purpose', 'savings', 'employment', 'sex', 'age', 'month']
    D_features = ['sex'] 
    Y_features = ['credit']
    X_features = list(set(XD_features)-set(D_features))
    categorical_features = ['installment_plans', 'telephone',
                     'foreign_worker', 'skill_level', 'credit_history', 'property',\
                            'other_debtors', 'purpose', 'savings', 'employment']

    # privileged classes
    all_privileged_classes = {"sex": [1.0]}

    # protected attribute maps
    all_protected_attribute_maps = {"sex": {1.0: 'Male', 0.0: 'Female'}}

    return GermanDataset(
        label_name=Y_features[0],
        favorable_classes=[1],
        protected_attribute_names=D_features,
        privileged_classes=[all_privileged_classes[x] for x in D_features],
        instance_weights_name=None,
        categorical_features=categorical_features,
        features_to_keep=X_features+Y_features+D_features,
        features_to_drop=[],
        metadata={ 'label_maps': [{1.0: 'Good Credit', 2.0: 'Bad Credit'}],
                   'protected_attribute_maps': [all_protected_attribute_maps[x]
                                for x in D_features]},
        custom_preprocessing=custom_preprocessing)

dataset_orig = preproc_and_load_data_german()
privileged_groups = [{'sex': 1}]
unprivileged_groups = [{'sex': 0}]

#### Display database information

In [None]:
# Number of records:
print("Number of records: ",  dataset_orig.features.shape[0])
# Number of features:
print("Number of features: ",  dataset_orig.features.shape[1])
# Feature names:
print("Names of features: ",  dataset_orig.feature_names)

Number of records:  1000
Number of features:  43
Names of features:  ['month', 'residence_since', 'age', 'number_of_credits', 'people_liable_for', 'sex', 'credit_history=Delay', 'credit_history=None/Paid', 'credit_history=Other', 'purpose=A40', 'purpose=A41', 'purpose=A410', 'purpose=A42', 'purpose=A43', 'purpose=A44', 'purpose=A45', 'purpose=A46', 'purpose=A48', 'purpose=A49', 'savings=500+', 'savings=<500', 'savings=Unknown/None', 'employment=1-4 years', 'employment=4+ years', 'employment=Unemployed', 'other_debtors=A101', 'other_debtors=A102', 'other_debtors=A103', 'property=A121', 'property=A122', 'property=A123', 'property=A124', 'installment_plans=A141', 'installment_plans=A142', 'installment_plans=A143', 'skill_level=A171', 'skill_level=A172', 'skill_level=A173', 'skill_level=A174', 'telephone=A191', 'telephone=A192', 'foreign_worker=A201', 'foreign_worker=A202']


#### Split into training, validation, and testing data.

In [None]:
dataset_orig_train, dataset_orig_val, dataset_orig_test = \
    dataset_orig.split([0.6, 0.8], shuffle=True, seed=1)

#### Learning a logistic regression classifier.

In [None]:
model = make_pipeline(StandardScaler(),
                      LogisticRegression(solver='liblinear', random_state=1))
fit_params = {'logisticregression__sample_weight': dataset_orig_train.instance_weights}

lr_orig = model.fit(dataset_orig_train.features, dataset_orig_train.labels.ravel(), **fit_params)

### Using a Fairness Unaware Model
Now that we've identified the protected attribute, and defined privileged and unprivileged values, we can use aif360 to detect bias in the dataset. The code below contains the metrics suitable for this task.

In [None]:
from collections import defaultdict

def test(dataset, model, thresh_arr):
    try:
        dataset_pred = dataset.copy(deepcopy=True)
        pos_ind = np.where(model.classes_ == dataset.favorable_label)[0][0]
        dataset_pred.scores = model.predict_proba(dataset_pred.features)[:,pos_ind].reshape(-1,1)
    except AttributeError:
        y_val_pred_prob = model.predict(dataset).scores
        pos_ind = 0
    
    metric_arrs = defaultdict(list)
    for thresh in thresh_arr:
        fav_inds = dataset_pred.scores > thresh
        dataset_pred.labels[fav_inds] = dataset_pred.favorable_label
        dataset_pred.labels[~fav_inds] = dataset_pred.unfavorable_label

        # Computation of various metrics:
        metric = ClassificationMetric(
                dataset, dataset_pred,
                unprivileged_groups=unprivileged_groups,
                privileged_groups=privileged_groups)
        metric_arrs['accuracy'].append(metric.accuracy())
        metric_arrs['balanced_accuracy'].append((metric.true_positive_rate()
                                     + metric.true_negative_rate()) / 2)
        metric_arrs['avg_odds_diff'].append(metric.average_odds_difference())
        metric_arrs['disp_imp'].append(metric.disparate_impact())
        metric_arrs['stat_par_diff'].append(metric.statistical_parity_difference())
        metric_arrs['false_negative_rate_difference'].append(metric.false_negative_rate_difference())
        metric_arrs['false_positive_rate_difference'].append(metric.false_positive_rate_difference())
        metric_arrs['statistical_parity_difference'].append(metric.statistical_parity_difference())
        metric_arrs['selection_rate'].append(metric.selection_rate())
        metric_arrs['Error_Rate'].append(metric.error_rate())
        metric_arrs['Error_Rate_difference'].append(metric.error_rate_difference())
    return metric_arrs

thresh_arr = np.linspace(0.01, 0.99, 100)

In [None]:
def describe_metrics(metrics, thresh_arr):
    best_ind = np.argmax(metrics['balanced_accuracy'])
    print("Selection rate: {:6.4f}".format(metrics['selection_rate'][best_ind]))
    print("Statistical parity difference: {:6.4f}".format(metrics['stat_par_diff'][best_ind]))
    print("False Positive Rate Difference: {:6.4f}".format(metrics['false_positive_rate_difference'][best_ind]))
    print("False negative Rate Difference: {:6.4f}".format(metrics['false_negative_rate_difference'][best_ind]))
    print("Averaged odds difference: {:6.4f}".format(metrics['avg_odds_diff'][best_ind]))
    print("Balanced Accuracy: {:6.4f}".format(metrics['balanced_accuracy'][best_ind]))
    print("Accuracy: {:6.4f}".format(metrics['accuracy'][best_ind]))
    print("Error Rate: {:6.4f}".format(metrics['Error_Rate'][best_ind]))
    print("Error Rate difference: {:6.4f}".format(metrics['Error_Rate_difference'][best_ind]))

In [None]:
thresh_arr = np.linspace(0.01, 0.5, 50)
test_metrics = test(dataset=dataset_orig_test,
                   model=lr_orig,
                   thresh_arr=thresh_arr)
lr_orig_best_ind = np.argmax(test_metrics['balanced_accuracy'])

In [None]:
lr_orig_metrics = test(dataset=dataset_orig_test,
                       model=lr_orig,
                       thresh_arr=[thresh_arr[lr_orig_best_ind]])

In [None]:
describe_metrics(test_metrics, thresh_arr)

Selection rate: 0.8500
Statistical parity difference: -0.0865
False Positive Rate Difference: -0.1652
False negative Rate Difference: 0.0278
Averaged odds difference: -0.0965
Balanced Accuracy: 0.6106
Accuracy: 0.7250
Error Rate: 0.2750
Error Rate difference: -0.0012


As the overall performance metric we use balanced accuracy, which is suited to classification problems with a large imbalance between positive and negative examples. For binary classifiers, this is the same as the 'area under ROC curve (AUC).

As the fairness metric we use *averaged odds difference*, which quantifies the disparity in accuracy experienced by different demographics. Our goal is to assure that neither of the two groups has substantially larger false-positive rates or false-negative rates than the other group. The averaged odds difference can be computed as: average difference of false positive rate (false positives / negatives) and true positive rate (true positives / positives) between unprivileged and privileged groups.

The table above shows a balanced accuracy of 0.61 (based on continuous predictions) and the overall balanced error rate of 0.28 (based on 0/1 predictions). Both of these are satisfactory in our application context. As a sanity check, we also show the Statistical parity difference, whose level (slightly above -0.05) is considered satisfactory in this context same goes for the Averaged odds difference.

### Preprocessing Mitigation using Reweighing

The previous step showed that the privileged group was getting more positive outcomes in the training dataset.   Since this is not desirable, we are going to try to mitigate this bias in the training dataset.  As stated above, this is called _pre-processing_ mitigation because it happens before the creation of the model.  

AI Fairness 360 implements several pre-processing mitigation algorithms.  We will choose the Reweighing algorithm [1], which is implemented in the `Reweighing` class in the `aif360.algorithms.preprocessing` package.  This algorithm will transform the dataset to have more equity in positive outcomes on the protected attribute for the privileged and unprivileged groups.

We then call the fit and transform methods to perform the transformation, producing a newly transformed training dataset (dataset_transf_train).

In the cells below, we transform the data for reweighing, then train a LR model on it, validate it, and test it, with a description of the resulting metrings also found below.

`[1] F. Kamiran and T. Calders,  "Data Preprocessing Techniques for Classification without Discrimination," Knowledge and Information Systems, 2012.`

In [None]:
RW = Reweighing(unprivileged_groups=unprivileged_groups,
                privileged_groups=privileged_groups)
dataset_transf_train = RW.fit_transform(dataset_orig_train.copy())

In [None]:
dataset = dataset_transf_train
model = make_pipeline(StandardScaler(),
                      LogisticRegression(solver='liblinear', random_state=1))
fit_params = {'logisticregression__sample_weight': dataset.instance_weights}
lr_transf = model.fit(dataset.features, dataset.labels.ravel(), **fit_params)

In [None]:
thresh_arr = np.linspace(0.01, 0.5, 50)
test_metrics = test(dataset=dataset_orig_test,
                   model=lr_transf,
                   thresh_arr=thresh_arr)
lr_transf_best_ind = np.argmax(test_metrics['balanced_accuracy'])

In [None]:
lr_transf_metrics = test(dataset=dataset_orig_test,
                         model=lr_transf,
                         thresh_arr=[thresh_arr[lr_transf_best_ind]])

In [None]:
describe_metrics(lr_transf_metrics, [thresh_arr[lr_transf_best_ind]])

Selection rate: 0.8700
Statistical parity difference: 0.0014
False Positive Rate Difference: -0.0044
False negative Rate Difference: -0.0222
Averaged odds difference: 0.0089
Balanced Accuracy: 0.6021
Accuracy: 0.7250
Error Rate: 0.2750
Error Rate difference: 0.0222


The reweighting greatly reduced the disparity in performance across multiple fairness metrics including error rate and averaged odds difference.

### Conclusion
Now that we have a transformed dataset, we can check how effective it was in removing bias by using the same metric we used for the original training dataset in steps above.  

In [None]:
import pandas as pd
pd.set_option('display.multi_sparse', False)
results = [lr_orig_metrics, lr_transf_metrics]
debias = pd.Series(['Unaware']+['Reweighing'], name='Bias Mitigator')
pd.concat([pd.DataFrame(metrics) for metrics in results], axis=0).set_index([debias])

Unnamed: 0_level_0,accuracy,balanced_accuracy,avg_odds_diff,disp_imp,stat_par_diff,false_negative_rate_difference,false_positive_rate_difference,statistical_parity_difference,selection_rate,Error_Rate,Error_Rate_difference
Bias Mitigator,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Unaware,0.725,0.610648,-0.096512,0.90136,-0.086489,0.027835,-0.165188,-0.086489,0.85,0.275,-0.001169
Reweighing,0.725,0.602074,0.008865,1.001613,0.001403,-0.022165,-0.004435,0.001403,0.87,0.275,0.022207


In this notebook, we explored how a fairness-unaware gradient boosted trees model performed on the classification task in contrast to the reweighted model and the prejudice removed model. 

The reweighting greatly reduced the disparity in performance across multiple fairness metrics including error rate and averaged odds difference.

After engaging with relevant stakeholders, the data scientist can deploy the model that balances the performance-fairness trade-off that meets the needs of the business.