# Home Credit Default Risk

## Problem Definition

Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this population is often taken advantage of by untrustworthy lenders.

Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.

While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.

## Data

There are 7 different sources of data:

application_train: 
- Main training data with information about each loan application at Home Credit 
- Every loan has its own row (`SK_ID_CURR`)
- The value predict is given in `TARGET` column indicating 0 (the loan was repaid) or 1 (the loan was not repaid)

application_test: 
- Main testing data with information about each loan application at Home Credit 
- Every loan has its own row (`SK_ID_CURR`)

bureau: 
- Contains data about client's previous credits from other financial institutions
- Each previous credit has its own row in bureau (identified by `SK_ID_BUREAU`)
- One loan (`SK_ID_CURR`) in the application data can have multiple previous credits

bureau_balance: 
- Contains monthly data about the previous credits in bureau
- Each row is one month of a previous credit (`SK_ID_BUREAU`)
- A single previous credit (`SK_ID_BUREAU`) can have multiple rows, one for each month of the credit length 

previous_application: 
- Contains previous applications for loans (`SK_ID_PREV`) at Home Credit of clients who have loans in the application data (`SK_ID_CURR`)
- Each current loan (`SK_ID_CURR`) in the application data can have multiple previous loans (`SK_ID_PREV`)

POS_CASH_BALANCE: 
- Contains monthly data about previous point of sale or cash loans clients have had with Home Credit 
- Each row is one month of a previous point of sale or cash loan, and a single previous loan (`SK_ID_PREV`) can have many rows

credit_card_balance: 
- Monthly data about previous credit cards clients have had with Home Credit
- Each row is one month of a credit card balance, and a single credit card can have many rows

installments_payment: 
- Contains payment history for previous loans (`SK_ID_PREV`) at Home Credit
- There is one row for every made payment and one row for every missed payment

In [1]:
import pandas as pd
import numpy as np
from seaborn import countplot, kdeplot
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt    

In [2]:
%matplotlib inline

In [3]:
application_train_df = pd.read_csv('data/application_train.csv')
application_test_df = pd.read_csv('data/application_test.csv')
bureau_df = pd.read_csv('data/bureau.csv')
bureau_balance_df = pd.read_csv('data/bureau_balance.csv')
previous_application_df = pd.read_csv('data/previous_application.csv')
installments_payments_df = pd.read_csv('data/installments_payments.csv')
credit_card_balance_df = pd.read_csv('data/credit_card_balance.csv')
pos_cash_balance_path = pd.read_csv('data/POS_CASH_balance.csv')

In [4]:
LABEL_NAME = 'TARGET'
CURRENT_LOAN_ID = 'SK_ID_CURR'
PREVIOUS_LOAN_ID = 'SK_ID_PREV'
BUREAU_LOAN_ID = 'SK_ID_BUREAU'
ID_FEATURE_NAMES = [CURRENT_LOAN_ID, PREVIOUS_LOAN_ID, BUREAU_LOAN_ID]
NUMERICAL_AGGREGATIONS = ['mean', 'max', 'min', 'sum']
CATEGORICAL_AGGREGATIONS = ['mean', 'sum']

In [5]:
train_feature_names = list(set(application_train_df.columns).difference(LABEL_NAME))
test_feature_names = list(application_test_df.columns)

In [6]:
labels = application_train_df[LABEL_NAME]
train_features = application_train_df[train_feature_names]
test_features = application_train_df[test_feature_names]

In [7]:
train_features.shape

(307511, 122)

In [8]:
def create_aggregate_column_names(column_names: list, exclude_item: str):
    new_column_names = []
    for column_name, agg_name in column_names:
        if column_name is exclude_item:
            new_column_names.append(column_name)
        else:
            new_column_names.append('{}_{}'.format(column_name, agg_name))
    return new_column_names

In [9]:
def extend_features_one_level_aggregates(df: pd.DataFrame, exclude_feature: str, grouping_feature: str):
    categorical_columns = df.select_dtypes(include=['object']).columns
    numerical_columns = df.columns.difference([exclude_feature]).difference(categorical_columns)
    
    numerical_group = df[numerical_columns].groupby(grouping_feature)
    numercial_aggregate = numerical_group.agg(NUMERICAL_AGGREGATIONS).reset_index()
    numercial_aggregate.columns = create_aggregate_column_names(numercial_aggregate.columns.ravel(), 
                                                                exclude_item=grouping_feature)
    
    category_df = pd.get_dummies(df[categorical_columns])
    category_df[grouping_feature] = df[grouping_feature]
    category_group = category_df.groupby(grouping_feature)
    category_aggregate = category_group.agg([CATEGORICAL_AGGREGATIONS]).reset_index()
    category_aggregate.columns = create_aggregate_column_names(category_aggregate.columns.ravel(), 
                                                               exclude_item=grouping_feature)
    
    combined_df = numercial_aggregate.merge(category_aggregate, on=grouping_feature)
    if exclude_feature:
        combined_df = combined_df.merge(df[[grouping_feature, exclude_feature]], on=grouping_feature)
    return combined_df

In [10]:
bureau_aggregate = extend_features_one_level_aggregates(df=bureau_df, 
                                                        exclude_feature=BUREAU_LOAN_ID, 
                                                        grouping_feature=CURRENT_LOAN_ID)

In [11]:
bureau_aggregate.head()

Unnamed: 0,SK_ID_CURR,AMT_ANNUITY_mean,AMT_ANNUITY_max,AMT_ANNUITY_min,AMT_ANNUITY_sum,AMT_CREDIT_MAX_OVERDUE_mean,AMT_CREDIT_MAX_OVERDUE_max,AMT_CREDIT_MAX_OVERDUE_min,AMT_CREDIT_MAX_OVERDUE_sum,AMT_CREDIT_SUM_mean,...,CREDIT_TYPE_Loan for business development_mean,CREDIT_TYPE_Loan for purchase of shares (margin lending)_mean,CREDIT_TYPE_Loan for the purchase of equipment_mean,CREDIT_TYPE_Loan for working capital replenishment_mean,CREDIT_TYPE_Microloan_mean,CREDIT_TYPE_Mobile operator loan_mean,CREDIT_TYPE_Mortgage_mean,CREDIT_TYPE_Real estate loan_mean,CREDIT_TYPE_Unknown type of loan_mean,SK_ID_BUREAU
0,100001,3545.357143,10822.5,0.0,24817.5,,,,,207623.571429,...,0,0,0,0,0,0,0,0,0,5896630
1,100001,3545.357143,10822.5,0.0,24817.5,,,,,207623.571429,...,0,0,0,0,0,0,0,0,0,5896631
2,100001,3545.357143,10822.5,0.0,24817.5,,,,,207623.571429,...,0,0,0,0,0,0,0,0,0,5896632
3,100001,3545.357143,10822.5,0.0,24817.5,,,,,207623.571429,...,0,0,0,0,0,0,0,0,0,5896633
4,100001,3545.357143,10822.5,0.0,24817.5,,,,,207623.571429,...,0,0,0,0,0,0,0,0,0,5896634


In [12]:
bureau_balance_aggregate = extend_features_one_level_aggregates(df=bureau_balance_df, 
                                                                exclude_feature='', 
                                                                grouping_feature=BUREAU_LOAN_ID)

In [13]:
train_features_extended = train_features.merge(bureau_aggregate, 
                                               on=CURRENT_LOAN_ID, 
                                               how='left').merge(bureau_balance_aggregate, 
                                                                 on=BUREAU_LOAN_ID, 
                                                                 how='left')

In [14]:
train_features_extended.shape

(1509345, 206)

In [15]:
train_features_extended.head()

Unnamed: 0,REG_REGION_NOT_LIVE_REGION,AMT_CREDIT,FLAG_DOCUMENT_19,FLAG_EMP_PHONE,FLAG_DOCUMENT_11,CODE_GENDER,LANDAREA_MEDI,FLAG_DOCUMENT_4,REG_CITY_NOT_LIVE_CITY,NONLIVINGAREA_MODE,...,MONTHS_BALANCE_min,MONTHS_BALANCE_sum,STATUS_0_mean,STATUS_1_mean,STATUS_2_mean,STATUS_3_mean,STATUS_4_mean,STATUS_5_mean,STATUS_C_mean,STATUS_X_mean
0,0,406597.5,0,1,0,M,0.0375,0,0,0.0,...,-36.0,-561.0,18.0,1.0,0.0,0.0,0.0,0.0,2.0,1.0
1,0,406597.5,0,1,0,M,0.0375,0,0,0.0,...,-15.0,-120.0,3.0,0.0,0.0,0.0,0.0,0.0,13.0,0.0
2,0,406597.5,0,1,0,M,0.0375,0,0,0.0,...,-47.0,-632.0,5.0,6.0,0.0,0.0,0.0,0.0,2.0,3.0
3,0,406597.5,0,1,0,M,0.0375,0,0,0.0,...,-36.0,-456.0,5.0,6.0,0.0,0.0,0.0,0.0,2.0,3.0
4,0,406597.5,0,1,0,M,0.0375,0,0,0.0,...,-21.0,-78.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0
