Supervised Machine Learning - Predicting Credit Risk

In this project, I built a machine learning model that predicts whether a loan from LendingClub will become high risk or not.

Background

LendingClub is a peer-to-peer lending services company that allows individual investors to partially fund personal loans as well as buy and sell notes backing the loans on a secondary market. LendingClub offers their previous data through an API.

I used this data to create machine learning models to classify the risk level of given loans. Specifically, I compared the Logistic Regression model and Random Forest Classifier.

Retrieve the data

Utilizing to .csv files in the resource file, I will be using data from 2019 to predict the credit risk of loans in the first quarter of 2020.

2019loans.csv
2020Q1loans.csv

Preprocessing: Convert categorical data to numeric

I created a training set from the 2019 loans using pd.get_dummies() to convert the categorical data to numeric columns. Similarly, create a testing set from the 2020 loans, also using pd.get_dummies(). The data structures in the .csv files were slightly different, so I used code in Jupyter Notebooks to make the dataframes similar so that the prediction model would work on both datasets.

Consider the models

After looking at the dataset and considering the two model types, I believe that the logistic regression model will perform better for this dataset. Both Random Forest and Logistic Regression models are good for classification problems, but I believe with over 80 variables to consider, the Logistic model will win out. Based on my research, Random Forest is recommended for simpler classification problems.

Fit a LogisticRegression model and RandomForestClassifier model

Without scaling the data, the random forest classifier model is prodcuing the most accurate model at this point, defying my prediction.

Revisit the Preprocessing: Scale the data

Once I scaled the data, the logistic regression became the better classifier. It had an accuracy score of .66.
The Random Forest model decreased in accuracy from .64 to .5. From what I understand, this is to be expected as Random Forest modeling doesn't rely on scaled data to make accurate calculations, while logistic regression models do.

While I was able to produce a model with a .66 accuracy metric, I would be concerned about employing this out into the wild for credit worthiness rating purposes. There is still far too much inaccuracy, in my opinion, and money would be lost on a consistent basis if loan decisions were based on this model.

References

LendingClub (2019-2020) Loan Stats. Retrieved from: https://resources.lendingclub.com/

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
Resources		Resources
Credit Risk Evaluator.ipynb		Credit Risk Evaluator.ipynb
README.md		README.md
writeup.txt		writeup.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Supervised Machine Learning - Predicting Credit Risk

Background

Retrieve the data

Preprocessing: Convert categorical data to numeric

Consider the models

Fit a LogisticRegression model and RandomForestClassifier model

Revisit the Preprocessing: Scale the data

References

About

Releases

Packages

Languages

gpawlows/Supervised_Learning-Challenge

Folders and files

Latest commit

History

Repository files navigation

Supervised Machine Learning - Predicting Credit Risk

Background

Retrieve the data

Preprocessing: Convert categorical data to numeric

Consider the models

Fit a LogisticRegression model and RandomForestClassifier model

Revisit the Preprocessing: Scale the data

References

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages