An organization has a goal of forecasting which individuals are more likely to default on their consumer loan product. The company has collected information about previous customer behavior based on their observations. With this information, the organization intends to make predictions about the risk levels of new customers when they are acquired, thus enabling the organization to determine which customers pose a higher risk and which ones do not.
Loan Prediction Based on Customer Behavior (Kaggle)
- ID: id of the user
- income: income of the user
- age: age of the user
- experience: professional experience of the user in years
- profession: profession
- married: whether married or single
- house_ownership: owned or rented or neither
- car_ownership: does the person own a car
- current_job_years: years of experience in the current job
- current_house_years: number of years in the current residence
- city: city of residence
- state: state of residence
- risk_flag: defaulted on a loan (target variable)
- The risk_flag indicates whether there has been a default in the past or not.
- risk flag = 1 → defaulter: a person who fails to fulfill a duty, obligation, or undertaking, especially to pay a debt.
- risk flag = 0 → non-defaulter
Jupyter notebook:
- data exploring and cleansing
- data visualization
- looking for the best suitable machine learning model
Due to the limitations of GitHub, you can download the trained_model_w_pipe.joblib file from this Google Drive link (200 mb)
-
The datasets contain no missing values, duplicates, or outliers.
-
87.7% of people are considered non-defaulters, indicated by a risk flag of 0, while the remaining 12.3% are identified as defaulters, indicated by a risk flag of 1.
-
The Random Forest Classifier model is deemed the most appropriate for the dataset, with an average cross-validation score of 89.30%. The Extra Trees Classifier model is also a viable option, with a score of 89.22%.