"Home Credit Default Risk" project with automatic feature engineering library featuretools
=======
This project is from the kaggle competition https://www.kaggle.com/c/home-credit-default-risk/overview and is a good practice to deal with large relational database. Doing feature engineering manually in this kind of datasets is possible but very exhausting. The usage of the automatic feature engineering library featuretools ( https://docs.featuretools.com/en/stable/ ) can really save our life. Before using this library, we still need to do exploratory data analysis (EDA) in order to understand these datasets (at least the main table) more. The understanding of the data can help us to do better data preprocessing before feature engineering. This EDA process is done in the seperated file Home_Credit_Default_Risk_app.ipynb.

In [111]:
import pandas as pd
import matplotlib.pyplot as plt

from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
import numpy as np
import seaborn
from sklearn.model_selection import train_test_split, TimeSeriesSplit, GridSearchCV
from sklearn.preprocessing import StandardScaler,OneHotEncoder,LabelEncoder
from sklearn.metrics import mean_squared_error, roc_auc_score, accuracy_score
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor, GradientBoostingRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR, SVC
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
import xgboost as xgb
import pickle

import featuretools as ft
import featuretools.variable_types as vtypes

In [2]:
app_train = pd.read_csv('application_train.csv')
app_test = pd.read_csv('application_test.csv')
bureau = pd.read_csv('bureau.csv')
bureau_balance = pd.read_csv('bureau_balance.csv')
cash = pd.read_csv('POS_CASH_balance.csv')
credit = pd.read_csv('credit_card_balance.csv')
previous = pd.read_csv('previous_application.csv')
installments = pd.read_csv('installments_payments.csv')

In [3]:
app = app_train.append(app_test, ignore_index = True, sort = True)

In [4]:
app.shape

(356255, 122)

Data anomoly handling
--
From the analysis of the app data from Home_Credit_Default_Risk_app.ipynb. We know that the data anomoly happens in the number of days field : 365243 days (365243 / 365.25 = 999 yrs). Therefore, we should repace 365243 with np.nan in every dataframe.

In [5]:
app = app.replace({365243: np.nan})
bureau = bureau.replace({365243: np.nan})
bureau_balance = bureau_balance.replace({365243: np.nan})
cash = cash.replace({365243: np.nan})
credit = credit.replace({365243: np.nan})
previous = previous.replace({365243: np.nan})
installments = installments.replace({365243: np.nan})

Memory Reduction
--
Converting the data type to some smaller subtype such as float64 to float32 can reduce the memory usage. This step is important when the datasets is large. Without this step, the feature matrix calculation might fail due to the lack of memory in the disk. In addition, we should also need to make sure that the keys, which are used to define the relationships, must have the same data type.

In [6]:
print('type of bureau[SK_ID_BUREAU]',bureau['SK_ID_BUREAU'].dtypes)
print('type of bureau_balance[SK_ID_BUREAU]',bureau_balance['SK_ID_BUREAU'].dtypes)
print('type of app[SK_ID_CURR]',app['SK_ID_CURR'].dtypes)
print('type of bureau[SK_ID_CURR]',bureau['SK_ID_CURR'].dtypes)
print('type of previous[SK_ID_CURR]',previous['SK_ID_CURR'].dtypes)
print('type of previous[SK_ID_PREV]',previous['SK_ID_PREV'].dtypes)
print('type of cash[SK_ID_PREV]',cash['SK_ID_PREV'].dtypes)
print('type of installments[SK_ID_PREV]',installments['SK_ID_PREV'].dtypes)
print('type of credit[SK_ID_PREV]',credit['SK_ID_PREV'].dtypes)

type of bureau[SK_ID_BUREAU] float64
type of bureau_balance[SK_ID_BUREAU] int64
type of app[SK_ID_CURR] float64
type of bureau[SK_ID_CURR] float64
type of previous[SK_ID_CURR] float64
type of previous[SK_ID_PREV] float64
type of cash[SK_ID_PREV] float64
type of installments[SK_ID_PREV] float64
type of credit[SK_ID_PREV] int64


In [7]:
bureau = bureau.astype({'SK_ID_BUREAU': 'float32','SK_ID_CURR':'float32'})
bureau_balance = bureau_balance.astype({'SK_ID_BUREAU': 'float32'})
app = app.astype({'SK_ID_CURR': 'float32'})
previous = previous.astype({'SK_ID_CURR': 'float32','SK_ID_PREV': 'float32'})
cash = cash.astype({'SK_ID_PREV': 'float32'})
installments = installments.astype({'SK_ID_PREV': 'float32'})
credit = credit.astype({'SK_ID_PREV': 'float32'})

In [13]:
def convert_types(df):
    """Convert data types in a pandas dataframe. Purpose is to reduce size of dataframe."""
    for col in df:
        if ('SK_ID' in col):
            df[col] = df[col].fillna(0).astype(np.int32)
        elif (df[col].dtype == 'object') and (df[col].nunique() < df.shape[0]):
            df[col] = df[col].astype('category')
        elif list(df[col].unique()) == [1, 0]:
            df[col] = df[col].astype(bool)
        elif df[col].dtype == float:
            df[col] = df[column].astype(np.float32)
        elif df[col].dtype == int:
            df[col] = df[column].astype(np.int32)
    return df

In [14]:
app = convert_types(app)
bureau = convert_types(bureau)
bureau_balance = convert_types(bureau_balance)
cash = convert_types(cash)
credit = convert_types(credit)
previous = convert_types(previous)
installments = convert_types(installments)

In [15]:
credit.head()

Unnamed: 0,SK_ID_PREV,SK_ID_CURR,MONTHS_BALANCE,AMT_BALANCE,AMT_CREDIT_LIMIT_ACTUAL,AMT_DRAWINGS_ATM_CURRENT,AMT_DRAWINGS_CURRENT,AMT_DRAWINGS_OTHER_CURRENT,AMT_DRAWINGS_POS_CURRENT,AMT_INST_MIN_REGULARITY,...,AMT_RECIVABLE,AMT_TOTAL_RECEIVABLE,CNT_DRAWINGS_ATM_CURRENT,CNT_DRAWINGS_CURRENT,CNT_DRAWINGS_OTHER_CURRENT,CNT_DRAWINGS_POS_CURRENT,CNT_INSTALMENT_MATURE_CUM,NAME_CONTRACT_STATUS,SK_DPD,SK_DPD_DEF
0,2562384,378907,-6,56.970001,135000,0.0,877.5,0.0,877.5,1700.324951,...,0.0,0.0,0.0,1,0.0,1.0,35.0,Active,0,0
1,2582071,363914,-1,63975.554688,45000,2250.0,2250.0,0.0,0.0,2250.0,...,64875.554688,64875.554688,1.0,1,0.0,0.0,69.0,Active,0,0
2,1740877,371185,-7,31815.224609,450000,0.0,0.0,0.0,0.0,2250.0,...,31460.085938,31460.085938,0.0,0,0.0,0.0,30.0,Active,0,0
3,1389973,337855,-4,236572.109375,225000,2250.0,2250.0,0.0,0.0,11795.759766,...,233048.96875,233048.96875,1.0,1,0.0,0.0,10.0,Active,0,0
4,1891521,126868,-1,453919.46875,450000,0.0,11547.0,0.0,11547.0,22924.890625,...,453919.46875,453919.46875,0.0,1,0.0,1.0,101.0,Active,0,0


In [14]:
bureau_balance.head()

Unnamed: 0,SK_ID_BUREAU,MONTHS_BALANCE,STATUS
0,5715448,0,C
1,5715448,-1,C
2,5715448,-2,C
3,5715448,-3,C
4,5715448,-4,C


In [16]:
bureau.dtypes

SK_ID_CURR                   int32
SK_ID_BUREAU                 int32
CREDIT_ACTIVE             category
CREDIT_CURRENCY           category
DAYS_CREDIT                float32
CREDIT_DAY_OVERDUE         float32
DAYS_CREDIT_ENDDATE        float32
DAYS_ENDDATE_FACT          float32
AMT_CREDIT_MAX_OVERDUE     float32
CNT_CREDIT_PROLONG         float32
AMT_CREDIT_SUM             float32
AMT_CREDIT_SUM_DEBT        float32
AMT_CREDIT_SUM_LIMIT       float32
AMT_CREDIT_SUM_OVERDUE     float32
CREDIT_TYPE               category
DAYS_CREDIT_UPDATE         float32
AMT_ANNUITY                float32
dtype: object

Feature Engineering with the usage of featuretools
--

In [17]:
    # Empty entityset
    es = ft.EntitySet(id = 'clients')
    
    # Entities with a unique index
    es = es.entity_from_dataframe(entity_id = 'app', dataframe = app, index = 'SK_ID_CURR')

    es = es.entity_from_dataframe(entity_id = 'bureau', dataframe = bureau, index = 'SK_ID_BUREAU')

    es = es.entity_from_dataframe(entity_id = 'previous', dataframe = previous, index = 'SK_ID_PREV')

    # Entities that do not have a unique index
    es = es.entity_from_dataframe(entity_id = 'bureau_balance', dataframe = bureau_balance, 
                                  make_index = True, index = 'bureaubalance_index')

    es = es.entity_from_dataframe(entity_id = 'cash', dataframe = cash, 
                                  make_index = True, index = 'cash_index')

    es = es.entity_from_dataframe(entity_id = 'installments', dataframe = installments,
                                  make_index = True, index = 'installments_index')

    es = es.entity_from_dataframe(entity_id = 'credit', dataframe = credit,
                                  make_index = True, index = 'credit_index')

In [18]:
    r_app_bureau = ft.Relationship(es['app']['SK_ID_CURR'], es['bureau']['SK_ID_CURR'])

    # Relationship between bureau and bureau balance
    r_bureau_balance = ft.Relationship(es['bureau']['SK_ID_BUREAU'], es['bureau_balance']['SK_ID_BUREAU'])

    # Relationship between current app and previous apps
    r_app_previous = ft.Relationship(es['app']['SK_ID_CURR'], es['previous']['SK_ID_CURR'])

    # Relationships between previous apps and cash, installments, and credit
    r_previous_cash = ft.Relationship(es['previous']['SK_ID_PREV'], es['cash']['SK_ID_PREV'])
    r_previous_installments = ft.Relationship(es['previous']['SK_ID_PREV'], es['installments']['SK_ID_PREV'])
    r_previous_credit = ft.Relationship(es['previous']['SK_ID_PREV'], es['credit']['SK_ID_PREV'])
    
    # Add in the defined relationships
    es = es.add_relationships([r_app_bureau, r_bureau_balance, r_app_previous,
                               r_previous_cash, r_previous_installments, r_previous_credit])

In [19]:
agg_primitives =  ["sum", "max", "min", "mean", "count", "percent_true", "num_unique", "mode"]
# Deep feature synthesis 
feature_names = ft.dfs(entityset=es, target_entity='app',
                       agg_primitives = agg_primitives,
                       n_jobs = 1, verbose = 1,
                       features_only = True,
                       max_depth = 2)

Built 1294 features


In [20]:
import datetime

In [22]:
now1 = datetime.datetime.now()

feature_matrix = ft.calculate_feature_matrix(feature_names, 
                                             entityset=es, 
                                             n_jobs = 1, 
                                             verbose = 0,
                                             chunk_size = 0.2 )

now2 = datetime.datetime.now()
print(now1,now2)

2020-03-30 08:17:52.288074 2020-03-30 12:17:56.572946


In [23]:
import os
feature_matrix.reset_index(inplace = True)

if not os.path.exists('feature_matrix'):
    os.makedirs('feature_matrix')

feature_matrix.to_csv('feature_matrix/feature_matrix.csv', index = False)

In [67]:
# Read the saved feature matrix.
feature_matrix=pd.read_csv('feature_matrix/feature_matrix.csv')
feature_matrix = convert_types(feature_matrix)

In [68]:
print('Shape: ', feature_matrix.shape)

Shape:  (356255, 1295)


Feature selection
--

**Remove any columns built on the SK_ID_CURR**

In [69]:
cols_with_curr_id = [col for col in feature_matrix.columns if 'SK_ID_CURR' in col]
cols_with_bureau_id = [col for col in feature_matrix.columns if 'SK_ID_BUREAU' in col]
cols_with_previous_id = [col for col in feature_matrix.columns if 'SK_ID_PREV' in col]
print('There are {} columns that contain SK_ID_CURR'.format( len(cols_with_curr_id)) )
print('There are {} columns that contain SK_ID_BUREAU'.format( len(cols_with_bureau_id)) )
print('There are {} columns that contain SK_ID_PREV'.format( len(cols_with_previous_id)) )

There are 60 columns that contain SK_ID_CURR
There are 0 columns that contain SK_ID_BUREAU
There are 0 columns that contain SK_ID_PREV


In [70]:
feature_matrix = feature_matrix.drop(columns = cols_with_curr_id)
print('Shape: ', feature_matrix.shape)

Shape:  (356255, 1235)


**Remove Missing Values**

We must set up the minimum threshold of percentage of missing values for removing. Here we set up this threshold 60%. That means if any columns have greater than 60% missing values, they will be removed.

In [74]:
# Missing values in percent
feature_missing = (feature_matrix.isnull().sum() / len(feature_matrix)).sort_values(ascending = False)
feature_missing.head()

MIN(credit.previous.RATE_INTEREST_PRIVILEGED)     1.0
MAX(credit.previous.RATE_INTEREST_PRIMARY)        1.0
MEAN(credit.previous.RATE_INTEREST_PRIVILEGED)    1.0
MAX(credit.previous.DAYS_LAST_DUE_1ST_VERSION)    1.0
MAX(credit.previous.RATE_INTEREST_PRIVILEGED)     1.0
dtype: float64

In [75]:
# Identify missing values above threshold
missing_threshold = 0.6
feature_missing_0 = feature_missing.index[feature_missing > 0.5/feature_matrix.shape[0] ]
feature_missing = feature_missing.index[feature_missing > missing_threshold]

print('There are {} columns with missing values'.format( len(feature_missing_0) ) )
print('There are {} columns with more than {}% missing values'.format( len(feature_missing), missing_threshold*100) )

There are 946 columns with missing values
There are 397 columns with more 60.0% than missing values


In [78]:
feature_matrix = feature_matrix.drop(columns = feature_missing)
print('Shape: ', feature_matrix.shape)

Shape:  (356255, 838)


**Remove Collinear Variables**

Collinear features lead to decreased generalization performance on the test set. In order to solve this problem, only one of the collinear feature is preserved and others are removed. In order to achieve this purpose, the correlation matrix must be calculated first. Then we traverse across the strickly upper triangular part of correlation matrix to remove a highly correlated variable (Here threshold = 0.9) in the column of the matrix.

In [79]:
# Absolute value correlation matrix
corr_matrix = feature_matrix.corr().abs()
# Upper triangle of correlations
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
upper.head()

Unnamed: 0,AMT_ANNUITY,AMT_CREDIT,AMT_GOODS_PRICE,AMT_INCOME_TOTAL,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_YEAR,...,SUM(credit.previous.AMT_CREDIT),SUM(credit.previous.AMT_GOODS_PRICE),SUM(credit.previous.NFLAG_INSURED_ON_APPROVAL),SUM(credit.previous.SELLERPLACE_AREA),SUM(credit.previous.DAYS_LAST_DUE),SUM(credit.previous.DAYS_FIRST_DRAWING),SUM(credit.previous.AMT_ANNUITY),SUM(credit.previous.HOUR_APPR_PROCESS_START),SUM(credit.previous.DAYS_DECISION),PERCENT_TRUE(credit.previous.NFLAG_LAST_APPL_IN_DAY)
AMT_ANNUITY,,0.762521,0.768123,0.204391,0.001472,0.003162,0.029877,0.017053,0.009583,0.015458,...,0.030761,0.009676,0.003224,0.013448,0.020788,0.013295,0.03133,0.019343,0.015796,0.00561
AMT_CREDIT,,,0.987159,0.16659,0.005411,0.002333,0.05805,0.004341,0.002961,0.047656,...,0.044191,0.000673,0.008461,0.018851,0.041174,0.041111,0.044895,0.041083,0.044078,0.005019
AMT_GOODS_PRICE,,,,0.169445,0.005816,0.001688,0.059851,0.004691,0.003229,0.05008,...,0.037172,0.004166,0.007913,0.017994,0.039672,0.036732,0.037896,0.035151,0.03954,0.005478
AMT_INCOME_TOTAL,,,,,0.002743,0.000767,0.022736,0.006712,0.001447,0.011153,...,0.033342,0.020313,0.000648,0.005811,0.006089,0.006888,0.033191,0.016191,0.007592,0.014562
AMT_REQ_CREDIT_BUREAU_DAY,,,,,,0.227493,0.003313,0.005853,0.214474,0.003224,...,0.002717,0.001974,0.001722,0.002056,0.00405,0.002201,0.002812,0.000183,0.001099,0.003246


In [80]:
corr_matrix.head()

Unnamed: 0,AMT_ANNUITY,AMT_CREDIT,AMT_GOODS_PRICE,AMT_INCOME_TOTAL,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_YEAR,...,SUM(credit.previous.AMT_CREDIT),SUM(credit.previous.AMT_GOODS_PRICE),SUM(credit.previous.NFLAG_INSURED_ON_APPROVAL),SUM(credit.previous.SELLERPLACE_AREA),SUM(credit.previous.DAYS_LAST_DUE),SUM(credit.previous.DAYS_FIRST_DRAWING),SUM(credit.previous.AMT_ANNUITY),SUM(credit.previous.HOUR_APPR_PROCESS_START),SUM(credit.previous.DAYS_DECISION),PERCENT_TRUE(credit.previous.NFLAG_LAST_APPL_IN_DAY)
AMT_ANNUITY,1.0,0.762521,0.768123,0.204391,0.001472,0.003162,0.029877,0.017053,0.009583,0.015458,...,0.030761,0.009676,0.003224,0.013448,0.020788,0.013295,0.03133,0.019343,0.015796,0.00561
AMT_CREDIT,0.762521,1.0,0.987159,0.16659,0.005411,0.002333,0.05805,0.004341,0.002961,0.047656,...,0.044191,0.000673,0.008461,0.018851,0.041174,0.041111,0.044895,0.041083,0.044078,0.005019
AMT_GOODS_PRICE,0.768123,0.987159,1.0,0.169445,0.005816,0.001688,0.059851,0.004691,0.003229,0.05008,...,0.037172,0.004166,0.007913,0.017994,0.039672,0.036732,0.037896,0.035151,0.03954,0.005478
AMT_INCOME_TOTAL,0.204391,0.16659,0.169445,1.0,0.002743,0.000767,0.022736,0.006712,0.001447,0.011153,...,0.033342,0.020313,0.000648,0.005811,0.006089,0.006888,0.033191,0.016191,0.007592,0.014562
AMT_REQ_CREDIT_BUREAU_DAY,0.001472,0.005411,0.005816,0.002743,1.0,0.227493,0.003313,0.005853,0.214474,0.003224,...,0.002717,0.001974,0.001722,0.002056,0.00405,0.002201,0.002812,0.000183,0.001099,0.003246


In [82]:
# Identigy the threshold for removing correlated variables
corr_threshold = 0.9
# Select columns with correlations above threshold
feature_collinear = [col for col in upper.columns if any(upper[col] > corr_threshold)]

print('There are {} columns to remove.'.format( len(feature_collinear) ) )

There are 381 columns to remove.


In [83]:
feature_matrix = feature_matrix.drop(columns = feature_collinear)
print('Shape: ', feature_matrix.shape)

Shape:  (356255, 457)


Encode the categorical variables
--
Label encoder for categorical variables with 2 unique categories (or 1 categories + np.nan) and One-hot Encoder for categorical variables with more than 2 unique categories.

In [92]:
# Create a label encoder object
le = LabelEncoder()
le_count = 0

# Iterate through the columns
for col in feature_matrix.columns:
    if (feature_matrix[col].dtype.name == 'object' or feature_matrix[col].dtype.name == 'category'):
        # If 2 or fewer unique categories
        if len(list(feature_matrix[col].unique())) <= 2:
            print(col ,feature_matrix[col].unique())
            if( feature_matrix[col].dtype.name == 'category' and feature_matrix[col].isnull().sum() > 0 ):
                feature_matrix[col] = feature_matrix[col].cat.add_categories(['NA'])
            feature_matrix[col].fillna('NA', inplace=True)
            le.fit(feature_matrix[col])
            feature_matrix[col] = le.transform(feature_matrix[col])
            # number of columns were label encoded
            le_count += 1
print('{} columns were label encoded.'.format(le_count) )

2 ['Y' nan]
2 ['XAP' nan]
2 ['Approved' nan]
2 ['Y' nan]
2 ['Approved' nan]
5 columns were label encoded.


In [93]:
# Get one-hot encoder
feature_matrix = pd.get_dummies( feature_matrix, dummy_na=True )
print('shape', feature_matrix.shape )

shape (356255, 1004)


Split the train and submission data
--

In [97]:
print(feature_matrix.TARGET.isnull()[feature_matrix.TARGET.isnull()==True].index)

Int64Index([307511, 307512, 307513, 307514, 307515, 307516, 307517, 307518,
            307519, 307520,
            ...
            356245, 356246, 356247, 356248, 356249, 356250, 356251, 356252,
            356253, 356254],
           dtype='int64', length=48744)


In [98]:
N_train = feature_matrix.TARGET.isnull()[feature_matrix.TARGET.isnull()==True].index[0]
print(N_train)

307511


In [99]:
train_data = feature_matrix[:N_train]
submission_data = feature_matrix[N_train:]

Split the train and test data
--

In [100]:
features_input = []
for col in train_data.columns:
    if col != 'TARGET':
      features_input.append( col )

In [101]:
X_train, X_test, y_train, y_test = train_test_split( train_data[features_input] , train_data['TARGET'] , test_size = 0.2  )

Train the XGBoost classification model with grid search
--
The package supporting missing value imputation should be used because there are still many missing values we don't want to handle manually. Here XGBoost is chosen, LightGBM is also an ideal choice. Here the tree model is chosen as the base model because the attribute feature_importance is favorable.

In [108]:
# next step: choose more max_depth
param_grid = {
    'max_depth': [ 5],
    'n_estimators': [100]
}

model = GridSearchCV( xgb.XGBClassifier() , param_grid = param_grid , scoring  = 'roc_auc', cv = 5, verbose=4, n_jobs=1)

In [109]:
model.fit( X_train, y_train )

Fitting 5 folds for each of 1 candidates, totalling 5 fits
[CV] max_depth=5, n_estimators=100 ...................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] ....... max_depth=5, n_estimators=100, score=0.774, total=34.2min


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed: 34.2min remaining:    0.0s


[CV] max_depth=5, n_estimators=100 ...................................
[CV] ....... max_depth=5, n_estimators=100, score=0.771, total=34.6min
[CV] max_depth=5, n_estimators=100 ...................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed: 68.9min remaining:    0.0s


[CV] ....... max_depth=5, n_estimators=100, score=0.778, total=30.0min
[CV] max_depth=5, n_estimators=100 ...................................


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed: 98.9min remaining:    0.0s


[CV] ....... max_depth=5, n_estimators=100, score=0.769, total=34.1min
[CV] max_depth=5, n_estimators=100 ...................................
[CV] ....... max_depth=5, n_estimators=100, score=0.771, total=33.4min


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed: 166.4min finished


GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=XGBClassifier(base_score=0.5, booster='gbtree',
                                     colsample_bylevel=1, colsample_bynode=1,
                                     colsample_bytree=1, gamma=0,
                                     learning_rate=0.1, max_delta_step=0,
                                     max_depth=3, min_child_weight=1,
                                     missing=None, n_estimators=100, n_jobs=1,
                                     nthread=None, objective='binary:logistic',
                                     random_state=0, reg_alpha=0, reg_lambda=1,
                                     scale_pos_weight=1, seed=None, silent=None,
                                     subsample=1, verbosity=1),
             iid='warn', n_jobs=1,
             param_grid={'max_depth': [5], 'n_estimators': [100]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='roc_a

In [114]:
#save the trained model
pkl_filename = "feature_matrix/xgb_cv.pkl"
with open(pkl_filename, 'wb') as file:
    pickle.dump(model, file)

In [None]:
# Load from file. If there is trained model and we want to use it directly, uncomment these the following lines.
#pkl_filename = "feature_matrix/xgb_cv.pkl"
#with open(pkl_filename, 'rb') as file:
#    pickle_model = pickle.load(file)

In [115]:
print(model.best_estimator_)
print(model.best_score_)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=5,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)
0.7727704642388527


Make prediction on the submission data
--

In [127]:
prediction = model.best_estimator_.predict_proba(X_test)[:, 1]

In [117]:
roc_auc = roc_auc_score(y_true=y_test, y_score= prediction)
print("roc_auc",roc_auc)

roc_auc 0.7703464935083872


In [118]:
submission_prediction = model.best_estimator_.predict_proba(submission_data[features_input])[:, 1]

In [125]:
print(len(submission_prediction),submission_prediction)

48744 [0.05690423 0.13115685 0.02771353 ... 0.01806962 0.03141293 0.17831348]


In [120]:
app_test['SK_ID_CURR']

0        100001
1        100005
2        100013
3        100028
4        100038
5        100042
6        100057
7        100065
8        100066
9        100067
10       100074
11       100090
12       100091
13       100092
14       100106
15       100107
16       100109
17       100117
18       100128
19       100141
20       100150
21       100168
22       100169
23       100170
24       100171
25       100172
26       100184
27       100187
28       100212
29       100222
          ...  
48714    455963
48715    455965
48716    456007
48717    456008
48718    456009
48719    456010
48720    456011
48721    456013
48722    456028
48723    456058
48724    456111
48725    456114
48726    456115
48727    456116
48728    456119
48729    456120
48730    456122
48731    456123
48732    456166
48733    456167
48734    456168
48735    456169
48736    456170
48737    456189
48738    456202
48739    456221
48740    456222
48741    456223
48742    456224
48743    456250
Name: SK_ID_CURR, Length

In [123]:
submission = pd.DataFrame({'SK_ID_CURR':app_test['SK_ID_CURR'], 'TARGET': submission_prediction})

In [124]:
submission

Unnamed: 0,SK_ID_CURR,TARGET
0,100001,0.056904
1,100005,0.131157
2,100013,0.027714
3,100028,0.038743
4,100038,0.166512
5,100042,0.043595
6,100057,0.019209
7,100065,0.028271
8,100066,0.022901
9,100067,0.079798


In [126]:
submission.to_csv('submission.csv', index = False)