This repository contains our work in building fair and explainable models that predict credit default risks. The original dataset was obtained from Kaggle's Home Credit Default Risk competition. For the purpose of running the code, all csv's generated for modeling were stored in a folder named cleaned_tables. CSV's generated for exploratory data analysis (EDA) were stored in a folder named EDA-helpers.
The Jupyter notebooks are numbered in the sequence we approached the problem and described below:
- EDA of the main table containing applicant info; e.g., detecting outliers, treating null values
- EDA of five supplemental tables that contain additional information for some of the applicants; e.g., summarizing key information for an applicant
- Preparing datasets for modeling; e.g., checking bias of the three protected attributes in the dataset (age, gender, marital status), merging all tables, setting up training and testing sets, preparing a SMOTED dataset
- Building a BRCG and a GLRM models (both in IBM's AIX 360 toolkit) based on the SMOTED top 20 features found from a Random Forest (RF) model
- Building a RF and an XGBoosting models based on the SMOTED top 20 features found from the RF model found from #3 above; explaining the models with SHAP
- Building various Decision Tree, Logistic Regression, and Random Forest models to compare with the models built in #3 and #4.
For more detailed descsription of the project, see here.