# Home Credit Default Risk

## Problem Definition

Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this population is often taken advantage of by untrustworthy lenders.

Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.

While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.

## Data

There are 7 different sources of data:

application_train: 
- Main training data with information about each loan application at Home Credit 
- Every loan has its own row (`SK_ID_CURR`)
- The value predict is given in `TARGET` column indicating 0 (the loan was repaid) or 1 (the loan was not repaid)

application_test: 
- Main testing data with information about each loan application at Home Credit 
- Every loan has its own row (`SK_ID_CURR`)

bureau: 
- Contains data about client's previous credits from other financial institutions
- Each previous credit has its own row in bureau (identified by `SK_ID_BUREAU`)
- One loan (`SK_ID_CURR`) in the application data can have multiple previous credits

In [1]:
import gc
import lightgbm
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from seaborn import countplot, kdeplot
from sklearn.model_selection import KFold
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import Imputer, MinMaxScaler

In [2]:
%matplotlib inline

In [3]:
LABEL_NAME = 'TARGET'
CURRENT_LOAN_ID = 'SK_ID_CURR'
PREVIOUS_LOAN_ID = 'SK_ID_PREV'
BUREAU_LOAN_ID = 'SK_ID_BUREAU'
ID_FEATURE_NAMES = [CURRENT_LOAN_ID, PREVIOUS_LOAN_ID, BUREAU_LOAN_ID]
NUMERICAL_AGGREGATIONS = ['mean', 'max', 'min', 'sum']
CATEGORICAL_AGGREGATIONS = ['mean', 'sum']

In [4]:
def get_train_test_features(train_data_path: str, test_data_path: str):
    train_features = pd.read_csv(train_data_path)
    test_features = pd.read_csv(test_data_path)
    labels = train_features[LABEL_NAME]
    
    return train_features.drop(LABEL_NAME, axis=1), test_features, labels

In [5]:
train_data_path = 'data/application_train.csv'
test_data_path = 'data/application_test.csv'
train_features, test_features, labels = get_train_test_features(train_data_path, test_data_path)

In [6]:
train_features.shape, labels.shape, test_features.shape

((307511, 121), (307511,), (48744, 121))

In [7]:
def create_aggregate_column_names(column_names: list, exclude_item: str):
    new_column_names = []
    for column_name, agg_name in column_names:
        if column_name is exclude_item:
            new_column_names.append(column_name)
        else:
            new_column_names.append('{}_{}'.format(column_name, agg_name))
    return new_column_names

In [8]:
def extend_features(df: pd.DataFrame, exclude_feature: str, grouping_feature: str):
    categorical_columns = df.select_dtypes(include=['object']).columns
    numerical_columns = df.columns.difference([exclude_feature]).difference(categorical_columns)
    
    numerical_group = df[numerical_columns].groupby(grouping_feature)
    numercial_aggregate = numerical_group.agg(NUMERICAL_AGGREGATIONS).reset_index()
    numercial_aggregate.columns = create_aggregate_column_names(numercial_aggregate.columns.ravel(), 
                                                                exclude_item=grouping_feature)
    
    category_df = pd.get_dummies(df[categorical_columns])
    category_df[grouping_feature] = df[grouping_feature]
    category_group = category_df.groupby(grouping_feature)
    category_aggregate = category_group.agg([CATEGORICAL_AGGREGATIONS]).reset_index()
    category_aggregate.columns = create_aggregate_column_names(category_aggregate.columns.ravel(), 
                                                               exclude_item=grouping_feature)
    
    combined_df = numercial_aggregate.merge(category_aggregate, on=grouping_feature)
    if exclude_feature:
        combined_df = combined_df.merge(df[[grouping_feature, exclude_feature]], on=grouping_feature)
    return combined_df

In [9]:
def extend_train_test_data_two_level(level_one_data_path: str, 
                                     level_one_exclude_feature: str, 
                                     level_one_grouping_feature: str,
                                     level_two_data_path: str, 
                                     level_two_exclude_feature: str, 
                                     level_two_grouping_feature: str,
                                     train_df: pd.DataFrame, 
                                     test_df: pd.DataFrame):
    level_one_df = pd.read_csv(level_one_data_path)
    level_one_aggregated = extend_features(df=level_one_df,
                                           exclude_feature=level_one_exclude_feature, 
                                           grouping_feature=level_one_grouping_feature
                                          ).drop_duplicates(level_one_grouping_feature)
    level_two_df = pd.read_csv(level_two_data_path)
    level_two_df_extended = level_two_df.merge(level_one_aggregated, how='left', on=level_one_grouping_feature)
    level_two_aggregated = extend_features(df=level_two_df_extended, 
                                           exclude_feature=level_two_exclude_feature, 
                                           grouping_feature=level_two_grouping_feature
                                          ).drop_duplicates(level_two_grouping_feature)
    train_df = train_df.merge(level_two_aggregated, how='left', on=level_two_grouping_feature)
    test_df = test_df.merge(level_two_aggregated, how='left', on=level_two_grouping_feature)
        
    return train_df, test_df

In [10]:
bureau_balance_data_path = 'data/bureau_balance.csv'
bureau_data_path = 'data/bureau.csv'

bureau_balance_exclude_feature = ''
bureau_exclude_feature = BUREAU_LOAN_ID

bureau_balance_grouping_feature = BUREAU_LOAN_ID
bureau_grouping_feature = CURRENT_LOAN_ID

train_features, test_features = extend_train_test_data_two_level(
    level_one_data_path=bureau_balance_data_path,
    level_one_exclude_feature=bureau_balance_exclude_feature, 
    level_one_grouping_feature=bureau_balance_grouping_feature,
    level_two_data_path=bureau_data_path, 
    level_two_exclude_feature=bureau_exclude_feature, 
    level_two_grouping_feature=bureau_grouping_feature,
    train_df=train_features, 
    test_df=test_features
)

In [11]:
train_features.shape, test_features.shape

((307511, 241), (48744, 241))

In [12]:
def extend_train_test_data_one_level(df_path: str, 
                                     exclude_feature: str, 
                                     grouping_feature: str,
                                     train_df: pd.DataFrame, 
                                     test_df: pd.DataFrame):
    extend_df = pd.read_csv(df_path)
    extend_aggregate = extend_features(df=extend_df, 
                                       exclude_feature=exclude_feature, 
                                       grouping_feature=grouping_feature).drop_duplicates(CURRENT_LOAN_ID)
    train_df = train_df.merge(extend_aggregate, how='left', on=grouping_feature)
    test_df = test_df.merge(extend_aggregate, how='left', on=grouping_feature)
        
    return train_df, test_df

Extend train and test data with auxiliary data

In [13]:
df_path = 'data/previous_application.csv'
train_features, test_features = extend_train_test_data_one_level(df_path=df_path,
                                                                 exclude_feature=PREVIOUS_LOAN_ID, 
                                                                 grouping_feature=CURRENT_LOAN_ID, 
                                                                 train_df=train_features, 
                                                                 test_df=test_features)
train_features.shape, test_features.shape

((307511, 461), (48744, 461))

In [14]:
train_features.to_csv('input/train.csv')
test_features.to_csv('input/test.csv')
labels.to_csv('input/labels.csv')

Deal with category data by one hot encoding

In [15]:
def make_one_hot_encoded(train_features: pd.DataFrame, test_features: pd.DataFrame):
    train_1h = pd.get_dummies(train_features)
    test_1h = pd.get_dummies(test_features)
    
    return train_1h.align(test_1h, join='inner', axis=1)

In [16]:
train_features, test_features = make_one_hot_encoded(train_features, test_features)
train_features.shape, test_features.shape

((307511, 582), (48744, 582))

Deal with missing values

In [17]:
def impute(train_data: pd.DataFrame, test_data: pd.DataFrame, strategy: str) -> tuple:
    imputer = Imputer(strategy=strategy)
    
    train_imputed = imputer.fit_transform(train_data)
    train_features = pd.DataFrame(train_imputed, columns=train_data.columns)
    
    test_imputed = imputer.transform(test_data) 
    test_features = pd.DataFrame(test_imputed, columns=test_data.columns)
    
    return train_features, test_features

In [18]:
train_features, test_features = impute(train_features, test_features, strategy='median')
train_features.shape, test_features.shape

((307511, 582), (48744, 582))

In [19]:
def scale(train_data: pd.DataFrame, test_data: pd.DataFrame, feature_range):
    
    scaler = MinMaxScaler(feature_range=feature_range)
    train_ids = train_data[CURRENT_LOAN_ID].apply(int)
    test_ids = test_data[CURRENT_LOAN_ID].apply(int)
    train_scaled = scaler.fit_transform(train_data)
    test_scaled = scaler.transform(test_data)
    train_data_scaled = pd.DataFrame(train_scaled, columns=train_data.columns )
    test_data_scaled = pd.DataFrame(test_scaled, columns=test_data.columns)
    
    train_data_scaled[CURRENT_LOAN_ID] = train_ids
    test_data_scaled[CURRENT_LOAN_ID] = test_ids
    
    return test_data_scaled, test_data_scaled

In [20]:
train_features, test_features = scale(train_features, test_features, feature_range=(0, 1))
train_features.shape, test_features.shape

((48744, 582), (48744, 582))

In [None]:
def gbm_model(train: pd.DataFrame, target: pd.DataFrame, test: pd.DataFrame, id_field: str, n_splits: int):
    feature_columns = list(set(train.columns).difference(id_field))
    train_ids = train[id_field]
    train_wo_ids = train[feature_columns]
    test_wo_ids = test[feature_columns]
    
    train_matrix = train_wo_ids.as_matrix()
    target_matrix = target.as_matrix()
    test_matrix = test_wo_ids.as_matrix()
    
    k_fold = KFold(n_splits=n_splits, shuffle=False, random_state=50)
    test_predictions = np.zeros(test_wo_ids.shape[0])
    feature_importances = np.zeros(len(feature_columns))
    out_of_fold = np.zeros(train.shape[0])
    
    valid_scores = []
    train_scores = []
    
    for train_indices, valid_indices in k_fold.split(train_matrix):
        model = lightgbm.LGBMClassifier(n_estimators=10000, 
                                        objective='binary', 
                                        class_weight='balanced', 
                                        learning_rate=0.05, 
                                        reg_alpha=0.1, 
                                        reg_lambda=0.1, 
                                        subsample=0.8, 
                                        n_jobs=-1, 
                                        random_state=50)
        train_feature_sample = train_matrix[train_indices]
        train_target_sample = target_matrix[train_indices]
        valid_feature_sample = train_matrix[valid_indices]
        valid_target_sample = target_matrix[valid_indices]
        
        model.fit(train_feature_sample, 
                  train_target_sample, 
                  eval_metric='auc', 
                  eval_set=[(valid_feature_sample, valid_target_sample), 
                            (train_feature_sample, train_target_sample)], 
                  eval_names=['valid', 'train'], 
                  categorical_feature='auto', 
                  early_stopping_rounds=100, 
                  verbose=200)
    
        feature_importances += model.feature_importances_ / k_fold.n_splits
        
        test_predictions += model.predict_proba(test_matrix, num_iteration=model.best_iteration_)[ :,1]/k_fold.n_splits
        
        out_of_fold[valid_indices] = model.predict_proba(valid_feature_sample, num_iteration=model.best_iteration_)[ :,1]
        
        valid_scores.append(model.best_score_['valid']['auc'])
        train_scores.append(model.best_score_['train']['auc'])
        
        gc.enable()
        del model, train_feature_sample, valid_feature_sample
        gc.collect()
        
    submission = pd.DataFrame({id_field: test[id_field], 'TARGET': test_predictions})
    
    feature_importances_pd = pd.DataFrame({'feature': feature_columns, 'importance': feature_importances})
    
    valid_auc = roc_auc_score(target_matrix, out_of_fold)    
    
    metrics = pd.DataFrame({'fold': list(range(n_splits)), 
                            'train': train_scores.append(np.mean(train_scores)),
                            'valid': valid_scores.append(valid_auc)})
    
    return submission, feature_importances_pd, metrics

In [None]:
submission, feature_importances, metrics = gbm_model(train=train_features, 
                                                     target=labels, 
                                                     test=test_features, 
                                                     id_field=CURRENT_LOAN_ID, 
                                                     n_splits=5)

Training until validation scores don't improve for 100 rounds.
Early stopping, best iteration is:
[76]	valid's auc: 0.499569	train's auc: 0.93345
Training until validation scores don't improve for 100 rounds.
Early stopping, best iteration is:
[2]	valid's auc: 0.519155	train's auc: 0.574629
Training until validation scores don't improve for 100 rounds.
Early stopping, best iteration is:
[10]	valid's auc: 0.512492	train's auc: 0.742154
Training until validation scores don't improve for 100 rounds.
Early stopping, best iteration is:
[41]	valid's auc: 0.516183	train's auc: 0.888119
Training until validation scores don't improve for 100 rounds.


In [None]:
submission.head()

In [None]:
submission.shape

In [None]:
submission.to_csv('gbm_feature_submission_6.csv', index=False)