# Team 3 Final Report Notebook (Kahsai, Nichols, Pellerito)

# Introduction
Home Credit Group was founded in the late 1990's in Eastern Europe. The company's business model aims to provide loans to those with little to no credit history. Now operating in several countries across Europe and Asia, including Czech Republic, Slovakia, Russia, Kazahkstan, Ukraine, Belarus, China, India, Indonesia, Philippines, and Vietnam, Home Credit Group has served over 135 million customers. Ultimately, they aim to provide the unbanked population with a safe and positive loan experience.

Using the power of machine learning Home Credit asked Kagglers to examine their data and build a model that can predict a person's ability to repay a loan and to ensure loans are given with terms that can be met. The success of this competition allows for Home Credit to avoid losses and improves their potential for profits.

(Source: https://en.wikipedia.org/wiki/Home_Credit)

# Project Data

<img src="https://storage.googleapis.com/kaggle-media/competitions/home-credit/home_credit.png" width="800"></img>

### Data Table Descriptions
* application_train and application_test are the main tables that contain the each loan and loan applicant 
* bureau contains the application data from other loans that the client took from other credit institutions and were reported to the credit bureau
* bureau_balance contains the monthly balances of previous credits 
* credit_card_balance contains a monthly balance snapshots of previous credit cards that the applicant has with Home Credit
* installments_payments contains any repayment history for previously disbursed credits with Home Credit
* previous_applications contains information about previous loans with Home Credit by the same client
* POS_CASH_balance contains the monthly balance snapshots of previous point of sales and cash loans that the applicant had with Home Credit

### Additional Notes
* SK_ID_CURR connects application_train|test with bureau, previous_application, POS_CASH_balance, installments_payment and credit_card_balance
* SK_ID_PREV connects previous_application to POS_CASH_balance, installments_payment and credit_card_balance 
* SK_ID_BUREAU connects the bureau with bureau_balance dataframes


# EDA Summary
To predict the likelihood of an applicant defaulting on a loan, our team performed predictive analysis to uncover meaningful relationships among variables within the Home Credit Default Risk dataset. We analyzed and evaluated existing features, as well as several developed through feature engineering, to gauge their predictive power. Then, to discover patterns and identify outliers, we created visualizations of the most significant features. This exploratory data analysis informed our choices for model construction and allowed us to improve model performance.

## [Training Data EDA Notebook 1 - Training Data](https://www.kaggle.com/megannichols/team-3-final-eda-notebook-1)
## [Training Data EDA Notebook 2 - Supplemental Data](https://www.kaggle.com/salemgirmai/team-3-eda-notebook-2)

Here are a few important discoveries we made within the our EDA and Model notebooks.

The dataset is not evenly distributed along the target variable.

![target dist.png](attachment:834eb976-4e8e-4b4b-8259-370da7f02eca.png)

Borrowers who are younger, with lower loan amounts, and lower income are appear to be more likely to default on thier loans.

![dist age with default.png](attachment:4a0bc1d8-6d38-493d-8ce4-414e1b1411a9.png)
![loan amount with default.png](attachment:69696637-8989-490d-9b33-dc1ff3149e40.png)
![income with default.png](attachment:39ae41ae-af6b-4ede-9f4a-e9c248ba3467.png)


These are the most important features of all those used in our final model.

![feature importance.png](attachment:fe948556-fef3-4841-bf04-84b22e2637c2.png)

# Project Challenges

The Home Credit Default Risk competition offered several unique challenges. The first challenge we came across was the unbalanced training dataset. Because the distribution of the target was uneven, we were unable to use accuracy to measure model performance. In turn, this led to us having to familiarize ourselves with ROC AUC as well as learn more about model parameters that supported unbalanced data. The next challenge we dealt with was the extensive list supplemental tables. We knew it was important to incorporate these tables into our model, but there was not a simple way to construct the merges. Many of the tables had multiple observations for each borrower and therefore had to be aggregated so that each borrower was represented by a single observation. Merging these condensed tables created another challenge as not all key features were identical across tables. Once we mapped the merge path, we overcame these issues. Of all the technical obstacles we faced, memory management seemed to be the most consistent and significant. The use of garbage collection proved incredibly valuable at multiple points as the categorical features were encoded with dummy variables, numeric features were aggregated, and tables were merged with the training dataset. 

# Modeling Techniques
To reach our final score, we used many modeling methods in order to predict if a borrower would default. The methods we used included Logistic Regression, Random Forest, Decision Tree, XG Boost, Light GBM, Extra Trees, as well as an ensemble model. We were able to successfully run each of these models, but our performance was most impressive with Logistic Regression, LightGBM, and the ensemble model. Because of the unbalanced distribution of the target values, we scored our models based on area under the curve and not accuracy.

This is how our model construction progressed during this project:

Week 3 - 63.491%
* only used the application_train data set (no supplemental tables)
* only used about a dozen features
* no data cleansing (i.e. still had bogus "365243" numbers everywhere)
* no engineered features
* only used three model types: logistic regression, random forest, decision tree

Week 4 - 71.358%
* added light GBM model (although I think our official submission was on a random forest)
* lots of data cleansing, some engineered features

Week 5 - 76.715%
* merged in all the model features from other tables
* added XGBoost model

Week 6 - 77.073%
* added calculation for interest rates (this week was more about getting organized and streamlined)

Week 7 - 78.637%
* cleaned up an error that was causing sub-optimal performance on the test data
* improved performance of Logistic Regression and XG Boost models using Optuna and other model-tuning techniques

Week 8 - 78.737%
* added Extra Trees model
* experimented with ensemble models to find best combination 


## [Data Generation Notebook](https://www.kaggle.com/cloycebox/team-3-data-generation-notebook) 
This notebook handles all of our data cleansing, feature engineering and table merges. 

## [Model Notebook](https://www.kaggle.com/cloycebox/team-3-model-notebook)
This notebook includes each of the modeling techniques we used and ranks each model's performance.

## [Final Submission Notebook](https://www.kaggle.com/cloycebox/team-3-submission-notebook)
This notebook combines the information from both the Data Generation Notebook and the Model Notebook and submits our best model to the Home Credit Default Risk Competition.

# Final Results
Our best performing model on validation was a Light GBM model. It achieved:
* 0.77977 on 5-fold cross-validation
* 0.78810 on the validation data set
* 0.78360 on the test data set

Our best scoring model on the test data was an ensemble model that combined the  Light GBM, logistic regression and random forest models. It scored 0.78737

![leaderboard.png](attachment:f7f0b266-bea1-4d5c-8a5a-1b232f865d46.png)

## [Final Model Evaluation](https://www.kaggle.com/cloycebox/team-3-final-model-evaluation)
This notebook includes:
* ROC_AUC and Precision-Recall Plots
* Confusion Matrix, Recall Matrix, and Precision Matrix
* Boxplot showing the distribution of predicted probabilities
* Density plots of predicted probabilities
* Team 3's place on the competition leaderboard