# Interpretability and Algorithmic Fairness Project 

The goal of this project is to apply the techniques described in class to a 
the setting of a credit-worthiness prediciton.  

### Group members:
- Nicolas Barbier de la Serre
- Juien Bernardo
- Henrique Brito Leao
- Benjamin Derre
- Hippolyte Guigon 

## Imports

In [1]:
import pandas as pd
import matplotlib.pyplot as plt

from sklearn import tree
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_squared_error

from xgboost import XGBRegressor

## Import data

In [2]:
# import data
data = pd.read_excel('../data/data_project.xlsx')

# set globals
CATEGORICAL_COLS = ['CreditHistory', 'EmploymentDuration', 'Housing', 
                    'Purpose', 'Savings', 'Group', 'Gender']
NUMERICAL_COLS = data.loc[:, ~data.columns.isin(CATEGORICAL_COLS)].columns

## Step 1: Surrogate models

*Use the estimated probability to be classified as good type (no default) provided in the 
dataset (y_hat). Implement one or two surrogate method(s) to interpret the unknown model used to 
generate y_hat.*

## Step 2: Model estimation

*Estimate your own black‐box machine learning model forecasting the  
probability to be classified as good type. For the train and test datasets, use a 70‐30 partition.*

In [3]:
# select full data 
data_full = data.copy()
train_data = data_full.loc[data_full.y_hat.isna()]
test_data = data_full.loc[data_full.y_hat.notnull()]

y_train = train_data['CreditRisk (y)'].to_numpy()
y_test = test_data['CreditRisk (y)'].to_numpy()
X_train = train_data.drop(['CreditRisk (y)', 'y_hat'], axis=1)
X_test = test_data.drop(['CreditRisk (y)', 'y_hat'], axis=1)

transformer = make_column_transformer(
    (OneHotEncoder(), CATEGORICAL_COLS),
    remainder='passthrough'
)

X_train_prep = transformer.fit_transform(X_train)

reg = XGBRegressor()
reg.fit(X_train_prep, y_train)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
             gamma=0, gpu_id=-1, importance_type=None,
             interaction_constraints='', learning_rate=0.300000012,
             max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,
             monotone_constraints='()', n_estimators=100, n_jobs=8,
             num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,
             reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact',
             validate_parameters=1, verbosity=None)

## Step 3: Model performance

*Analyze the performance of your own model.*

# Global interpretability

## Step 4: Surrogate models

*Implement one or two surrogate method(s) to interpret your own 
model. Compare the results provided in Steps 1 and 4.*

## Step 5: PDP

*Implement the PDP method to interpret your own model.*

## Step 6: ALE

*Implement the ALE method to interpret your own model. Compare 
the results provided in Steps 5 and 6.*

# Local Interpretability

## Step 7: ICE

*Implement the ICE method to interpret your own model.*

## Step 8: SHAP

*Implement the SHAP method to interpret your own model. Compare 
the results provided in Steps 7 and 8.*

# Fairness

## Step 9: Fairness assessment

*Use a Pearson statistic for the following three fairness 
definitions: Statistical Parity, Conditional Statistical Parity (groups are given in the dataset), and Equal 
Odds. Discuss your results.*

## Step 10: FPDF

*Implement a FPDP using a fairness measure. Discuss your results.*