<h1 style="text-align:center">INFSCI 2595 Machine Learning Project</h1>
<h2 style="text-align:center">Home Credit Default Risk</h2>
<h5 style="text-align:center">Members: Chih Ying Chang, Xinghao Huang, Yuanyuan Zhang</h5>

# Section One: Introduction and Baseline Model

## 1. Abstract

It is very important to evaluate the credibility of a person. In most cases, it is reliable to evaluate with complete credit records and files, however, some people do not have those records. So, it is hard to know if they will pay for the bill in the future. 

In this project, we will use the historical loan application data provided by Home Credit to predict whether or not an applicant will be able to repay a loan. We first build a baseline model using simple data preprocessing and logistic regression model in section one. After that, in the second section, we add new datasets and take various feature engineering strategies to improve accuracy. In addition, we apply a brand new machine learning model - light GBM (Gradient Boosting Decision Tree (GBDT) implementation with GOSS and EFB), which is very effective and efficient for large size data. The result shows that our methods are effective. In the third section, we select the best features based on the results of the previous two sections. We remove the collinear, missing, and unimportant features. The result shows that the feature selection methods are effective. 

## 2. Introduction

There are tons of applicants want to make a loan. So, it is necessary to build an automatical system to evaluate the credit of people. A lot of related studies have been done in recent years.

Khandani, A. E., Kim, A. J., & Lo, A. W. [1] applied machine learning models to forecast the consumer credit risk. They collected the customer transactions and credit bureau data from 2005 to 2009. They constructed out-of-sample forecasts that significantly improve the classification rates of credit-card-holder delinquencies and defaults, with linear regression R2’s of forecasted/realized delinquencies of 85%. 

Galindo, J., and Tamayo, P.[2] introduced a specific modeling methodology based on the study of error curves. The results show that CART decision-tree models provide the best estimation for default with an average 8.31% error rate for a training sample of 2,000 records. 

In addition, Neural Networks provided the second best results with an average error of 11.00%. The K-Nearest Neighbor algorithm had an average error rate of 14.95%. These results outperformed the standard Probit algorithm which attained an average error rate of 15.13%. 

One challenge for their study is the data size. If there is more data for training the model, the accuracy would be higher according to the experience. 

## 3. Data Description

The data is provided by Home Credit, a service dedicated to provided lines of credit (loans) to the unbanked population. 

There are three sources of data will be user in this section:
1. application_train.csv
2. application_test.csv
3. codebook.csv

application_train/application_test data: the main training and testing data with information about each loan application at Home Credit. Every loan has its own row and is identified by the feature SK_ID_CURR. The training application data comes with the TARGET indicating 0: the loan was repaid or 1: the loan was not repaid.

codebook data: describe features in dataset including the full name and description of features.

There are some other data we will use in this project. They can connect with each other by particular feature.

![](https://i.postimg.cc/MH3Jjn5d/download.png)

## 4. Methods

Not everyone has a reliable credit history, which means it is really hard for banks or other institutions to evaluate his/her as a trustworthy lender or not. Due to insufficient or even no credit histories for those people, we can only utilize other records to evaluate their credibility. Home Credit is a company to take advantage of various data, like transactional information, to predict the reliability of those people, especially for their repayment abilities.

In this project, we will use the historical loan application data provided by Home Credit to predict whether or not an applicant will be able to repay a loan. The data size is very large, so it is not a good idea to use traditional machine learning models. Instead, we use Light GBM as the classifier [3]. 

Light GBM is a quick, distributed, efficient gradient boosting framework based on the decision tree algorithm. It splits the tree leaf wise with the best fit whereas other boosting algorithms split the tree depth wise or level wise rather than leaf-wise. So, when growing on the same leaf in Light GBM, the leaf-wise algorithm can reduce more loss than the level-wise algorithm and hence results in much better accuracy which can rarely be achieved by any of the existing boosting algorithms.

Leaf wise splits lead to increase in complexity and may lead to overfitting and it can be overcome by specifying another parameter max-depth which specifies the depth to which splitting will occur. However, Light GBM is sensitive to overfitting and can easily overfit small data. So, it is only suitable for large data. 
![](https://i.postimg.cc/XqY0Jffr/leaf.png)
![](https://i.postimg.cc/DwX38yXP/depth.png)

In addition the Light GBM, we use several methods to make feature engineering and feature selection to improve the accuracy and efficiency. In the second section, we will use the two other data to improve the prediction accuracy. We will provide the detail information in section two. In section three, we try some methods to reduce the number of features, it will improve efficiency and accuracy. We will describe the methods in the third section. 

<hr>
## 5. Baseline Model

# Import Packages

In [None]:
import numpy as np 
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_selection import VarianceThreshold
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import cross_val_predict
from sklearn import linear_model
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import LeaveOneOut
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.feature_selection import RFECV
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score,confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.svm import SVC
from sklearn.model_selection import KFold
from sklearn.metrics import roc_auc_score
import gc
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import PolynomialFeatures

import time
import os
print(os.listdir("../input"))

# Read Data

In [None]:
app_train = pd.read_csv('../input/application_train.csv')
app_train.head()

In [None]:
app_test = pd.read_csv('../input/application_test.csv')
app_test.head()

# Training Data

In [None]:
target = app_train.loc[:, 'TARGET']
columns_feature = app_train.columns.tolist()
columns_feature.remove('TARGET')
features = app_train.loc[:, columns_feature]
features.shape

>Training Data Basic Information

| Item         | Number           |
| -------      | ---------------- |
| Observation  | 307511           |
| Features     | 121              |

# Testing Data

Comment: testing data does not have `TARGET` which is treat as the prediction result. However, we can still use the testing data.

We can upload the prediction result of the testing data using our model, and Kaggle will send the accuracy of our prediction back.

In [None]:
features = app_test
features.shape

>Testing Data Basic Information

| Item         | Number           |
| -------      | ---------------- |
| Observation  | 48744           |
| Features     | 121              |

# TARGET Distribution

In [None]:
df = pd.DataFrame(app_train['TARGET'].value_counts())
df['repaid'] = ['repaid', 'not repaid']
df.columns=['frequency', 'repaid']
df

0 means the loan was repaid on time; 1 means the loan was not repaid on time.

In [None]:
sns.set(style="whitegrid")
size=(5, 5)
fig, ax = plt.subplots(figsize=size)
ax.set_title('TARGET distribution')
sns.barplot(x="repaid", y="frequency", data=df, ax=ax)

It is obviously an imbalanced class problem, which means the not repaid people are far less than the paid people.

# Data Completeness

In [None]:
# calculate null function
def calCompleteness(data):
    row_num = data.shape[0]

    nul = data.isnull().sum()
    nul = pd.DataFrame({'features': nul.index, 'null_number': nul.values})

    comp = []
    for index, row in nul.iterrows():
        temp = row['null_number']
        re = float(row_num - temp) / row_num * 100
        comp.append(re)

    nul['completeness'] = pd.to_numeric(comp, downcast='float')
    nul = nul.sort_values(by=['completeness'], ascending=False)
    nul = nul.reset_index(drop=True)
    return nul

complete = calCompleteness(features)

In [None]:
sns.set(style="whitegrid")
size=(20, 40)
fig, ax = plt.subplots(figsize=size)
ax.set_title('Training Data Completeness')
sns.barplot(x="completeness", y="features", data=complete, ax=ax)

# Data Type

In [None]:
app_train.dtypes.value_counts()

# Check Object Type Data

In [None]:
# pd.Series.nunique: Return number of unique elements in the object.
temp = app_train.select_dtypes('object').apply(pd.Series.nunique, axis = 0)
pd.DataFrame({'feature': temp.index, 'Number of Categories': temp.values})

Most categorical variables have a small number of categories. 

# Encode Categorical Variables

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le_count = 0

# Iterate through the columns
for col in app_train:
    if app_train[col].dtype == 'object':
        # If 2 or fewer unique categories
        if len(list(app_train[col].unique())) <= 2:
            le.fit(app_train[col])
            # Transform both training and testing data
            app_train[col] = le.transform(app_train[col])
            app_test[col] = le.transform(app_test[col])
            
            # Keep track of how many columns were label encoded
            le_count += 1
            
print('%d columns were label encoded.' % le_count)

In [None]:
# one-hot encoding
app_train = pd.get_dummies(app_train)
app_test = pd.get_dummies(app_test)

print('Training Features shape: ', app_train.shape)
print('Testing Features shape: ', app_test.shape)

# Aligning Training and Testing Data

One-hot encoding created more columns in training data because there are some categorical variables not represents in testing data but represents in training data. So, it is necessary to remove the columns which are not in the testing data. 

In [None]:
train_labels = app_train['TARGET']
# Align the training and testing data, keep only columns present in both dataframes
app_train, app_test = app_train.align(app_test, join = 'inner', axis = 1)
# Add the target back in
app_train['TARGET'] = train_labels

print('Training Features shape: ', app_train.shape)
print('Testing Features shape: ', app_test.shape)

# Anomalies Analysis

In [None]:
# DAYS_BIRTH analysis
# age description
(app_train['DAYS_BIRTH'] / -365).describe()

In [None]:
# DAYS_EMPLOYED analysis
# get employed years
(app_train['DAYS_EMPLOYED'] / 365).describe()

In [None]:
(app_train['DAYS_EMPLOYED']/356).plot.hist(title = 'Years Employment Histogram');
plt.xlabel('Years Employment');

Some people are employed 1000 years, it is not correct. In addition, those anomalies values have the same value.

In [None]:
# Create an anomalous flag column
app_train['DAYS_EMPLOYED_ANOM'] = app_train["DAYS_EMPLOYED"] == 365243

# Replace the anomalous values with nan
app_train['DAYS_EMPLOYED'].replace({365243: np.nan}, inplace = True)

app_train['DAYS_EMPLOYED'].plot.hist(title = 'Days Employment Histogram');
plt.xlabel('Days Employment');

In [None]:
# do same thing on the testing data
app_test['DAYS_EMPLOYED_ANOM'] = app_test["DAYS_EMPLOYED"] == 365243
app_test["DAYS_EMPLOYED"].replace({365243: np.nan}, inplace = True)

print('There are %d anomalies in the test data out of %d entries' % (app_test["DAYS_EMPLOYED_ANOM"].sum(), len(app_test)))

# Correlation Analysis

In [None]:
# Find correlations with the target and sort
corr = app_train.corr()
correlations = corr['TARGET'].sort_values()

# Display correlations
print('Most Positive Correlations:\n', correlations.tail(15))
print('\nMost Negative Correlations:\n', correlations.head(15))

# Baseling
## Logistic Regression Implementation

In [None]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import Imputer
# Drop the target from the training data
if 'TARGET' in app_train:
    train = app_train.drop(columns = ['TARGET'])
else:
    train = app_train.copy()
    
# Feature names
features = list(train.columns)

# Copy of the testing data
test = app_test.copy()

# Median imputation of missing values
imputer = Imputer(strategy = 'median')

# Scale each feature to 0-1
scaler = MinMaxScaler(feature_range = (0, 1))

# Fit on the training data
imputer.fit(train)

# Transform both training and testing data
train = imputer.transform(train)
test = imputer.transform(app_test)

# Repeat with the scaler
scaler.fit(train)
train = scaler.transform(train)
test = scaler.transform(test)

print('Training data shape: ', train.shape)
print('Testing data shape: ', test.shape)

In [None]:
from sklearn.linear_model import LogisticRegression

# Make the model with the specified regularization parameter
log_reg = LogisticRegression(C = 0.0001, solver='lbfgs')

# Train on the training data
log_reg.fit(train, train_labels)

# Make predictions
# Make sure to select the second column only
log_reg_pred = log_reg.predict_proba(test)[:, 1]
# Submission dataframe
submit = app_test.loc[:, ['SK_ID_CURR']]
submit['TARGET'] = log_reg_pred

submit.head()

In [None]:
# Save the submission to a csv file
submit.to_csv('baseling_submission.csv', index = False)

![](https://i.postimg.cc/3NRP7Khp/baseline.png)

## References:
[1] Khandani, A. E., Kim, A. J., & Lo, A. W. (2010). Consumer credit-risk models via machine-learning algorithms. Journal of Banking & Finance, 34(11), 2767-2787.

[2] Galindo, J., & Tamayo, P. (2000). Credit risk assessment using statistical and machine learning: basic methodology and risk modeling applications. Computational Economics, 15(1-2), 107-143.

[3] Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., ... & Liu, T. Y. (2017). Lightgbm: A highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems (pp. 3146-3154).