As the final task of your internship as a Data Scientist at ID/X Partners, this time you will be involved in a project for a lending company. You will collaborate with various other departments in this project to provide technological solutions for the company. You are asked to build a model that can predict credit risk using a dataset provided by the company, which consists of loan data that has been accepted and rejected. Additionally, you also need to prepare visual media to present the solution to the client. Make sure the visual media you create is clear, easy to read, and communicative. You can carry out this end-to-end solution development in your preferred programming language while adhering to the framework/methodology of Data Science.
I will use a dataset from Kaggle that pertains to consumer loans granted from 2007 to 2014 by Lending Club, which is a peer-to-peer lending platform based in the United States.
- loan_data_2007_2014.csv
- LCDataDictionary.xlsx
Target Variable Description:
- Target variable = 0 → Rejected for a loan → Defaulter
- Target variable = 1 → Accepted for a loan → Non-Defaulter
- Programming language: Python.
- Data Tool: Jupyter Notebook.
- Reporting Tool: Microsoft PowerPoint.
- Problem Formulation
- Data Collecting
- Data Understanding
- Data preprocessing
- Exploratory Data Analysis (EDA) and Data Visualization
- Model Selection and Building
- Scorecard Development
- Problem Formulation
- Data Collecting
- Data Understanding
- Data preprocessing
- Exploratory Data Analysis (EDA) and Data Visualization
- Model Selection and Building
- The loan_data_2007_2014.csv file (containing 466,285 rows and 74 columns) contain numerous missing values and outliers, which have been handled using the WOE binning technique.
- No duplicate values are present in the dataset.
- The target variables consist of 89.1% non-defaulters (accepted) and 10.9% defaulters (rejected).
- Feature selection has been performed using Weight of Evidence (WOE) and Information Value (IV).
- Logistic regression was employed in a machine learning model, yielding the following metrics: threshold ≈ 0.22, accuracy ≈ 0.90, precision ≈ 0.93, recall ≈ 0.96, F1 ≈ 0.95, AUROC ≈ 0.84, Gini ≈ 0.67, and AUCPR ≈ 0.97. These metrics are very good for credit risk modeling.
- Consequently, the company is expected to save around 1,000,000,000 USD while incurring a loss of approximately 9,000,000 USD.
- The loan_data_2007_2014.csv file (containing 466,285 rows and 74 columns) contains numerous missing values and outliers, which have been handled through data imputation methods, such as using the mean for numerical variables and the mode for categorical variables.
- No duplicate values are present in the dataset.
- The target variables consist of 89.9% non-defaulters (accepted) and 10.1% defaulters (rejected).
- Feature selection has been performed using the Chi-Square Test, ANOVA, and Correlation Matrix.
- Various machine learning models have been implemented on the data, such as logistic regression, ridge classifier, SGD classifier, passive-aggressive classifier, linear discriminant analysis, quadratic discriminant analysis, decision tree, extra tree, ada boost, Gaussian NB, and LGBM classifier.
- The resulting model achieved a higher AUROC score of 0.99 in the LGBM Classifier. This model proceeded further and produced the following metrics: threshold ≈ 0.5, accuracy ≈ 0.95, precision ≈ 0.99, recall ≈ 0.95, F1 ≈ 0.97, AUROC ≈ 0.98, Gini ≈ 0.97, and AUCPR ≈ 0.99. These metrics are very good for credit risk modeling.
- Consequently, the company is expected to save around 1,000,000,000 USD while incurring a loss of approximately 9,000,000 USD.