Skip to content

ML Project for Data 245 (Predicting Loan Status)

Notifications You must be signed in to change notification settings

groth00/Data245

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 

Repository files navigation

10/28/21 - 10/29/21
I took a closer look at the features (for data understanding purposes) and removed redundant/duplicate variables. 
Then, I checked the correlation between the numerical variables; some features had stronger correlation such as Amount and MonthlyPayment.
However, these pairs of variables weren't perfectly correlated and turned out to be useful/important according to RandomForest, so they were kept.
After cleaning, the full dataset consisted of 130 variables while the truncuated dataset contained 19 "important" features from RandomForest.

Estimators tested were LR, DT, MLP, SVC, SGD, HGB, and RF. HistGradientBoosting and RandomForest performed similarly and outperformed the base models.
This held for precision (~0.75), recall (~0.70), Matthew's correlation coefficient (~0.45), and AUC (~0.70).
I also tested LightGBM, which was the inspiration for scikit-learn's HistogramGradientBoostingClassifier.
Essentially, its performance was identical to HistGradientBoosting.

Then, the best models (HGB, RF, GBM) were tuned using HalvingGridSearchCV.
There are more details in the script, but generally the learning rate, tree depth, and regularization parameters were tuned.
Performance uplift from hyperparameter tuning was very small.

I used the plotting functions from scikit-learn's metrics module to display the confusion matrices, roc_auc_curves, and precision recall curves.
Also, RF and GBM have feature importances built in, so those were extracted and plotted as well. 
Finally, the individual decision trees from RF can be plotted invidually, so I went ahead and did that.

RandomForest was slightly better than the boosting models:
Macro-averaged precision 0.76
Macro-averaged recall 0.72
Matthew's corrcoef 0.48
AUC 0.72

Important features are mostly numerical, such as Interest, Age, IncomeTotal, and MonthlyPayment.




10/16/21

Changed to the Bondora dataset (https://ieee-dataport.org/open-access/bondora-peer-peer-lending-data#files)

The main issue with the LendingClub dataset is that the quality of the data was not sufficient to train good models.
Specifically, when testing several models:
  LogisticRegression, LinearSVC, MLPClassifier, SGDClassifier, DecisionTreeClassifier, RandomForest, HistogramGradientBoostingClassifier
the resulting accuracy (~0.5), AUC score (~0.5), and Matthews correlation coefficient (~0) were low. The models were no better than random guesses.

The Bondora set also contains P2P lending data on loans from the period March 2009 - January 2020. 
Extra preprocessing was done on the available dataset and features were selected using RandomForest.