# Problem Definition

Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this population is often taken advantage of by untrustworthy lenders.

- Home Credit Group

Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.

While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.

### Data Description

See https://www.kaggle.com/c/home-credit-default-risk/data

There are 7 different sources of data:

* <b>application_train.csv</b>: contains the main training data with information about each loan application at Home Credit. Every loan has its own row and is identified by the feature `SK_ID_CURR`. The training application data comes with the `TARGET` indicating 0 (the loan was repaid) or 1 (the loan was not repaid). 
* <b>application_test.csv</b>: contains the main testing data with information about each loan application at Home Credit. Every loan has its own row and is identified by the feature `SK_ID_CURR`. 
* <b>bureau.csv</b>: contains data concerning client's previous credits from other financial institutions. Each previous credit has its own row in bureau, but one loan in the application data can have multiple previous credits.
* <b>bureau_balance.csv</b>: contains monthly data about the previous credits in bureau. Each row is one month of a previous credit, and a single previous credit can have multiple rows, one for each month of the credit length. 
* <b>previous_application.csv</b>: contains previous applications for loans at Home Credit of clients who have loans in the application data. Each current loan in the application data can have multiple previous loans. Each previous application has one row and is identified by the feature `SK_ID_PREV`. 
* <b>POS_CASH_BALANCE.csv</b>: contains monthly data about previous point of sale or cash loans clients have had with Home Credit. Each row is one month of a previous point of sale or cash loan, and a single previous loan can have many rows.
* <b>credit_card_balance.csv</b>: monthly data about previous credit cards clients have had with Home Credit. Each row is one month of a credit card balance, and a single credit card can have many rows.
* <b>installments_payment.csv</b>: contains payment history for previous loans at Home Credit. There is one row for every made payment and one row for every missed payment. 

#### Imports

In [1]:
import numpy as np
import pandas as pd

from sklearn.preprocessing import OneHotEncoder, Imputer, PolynomialFeatures, MinMaxScaler

# Data

#### Main train data

In [2]:
def get_train_data(path: str, target_name: str) -> tuple:
    df = pd.read_csv(path)
    target = df[target_name]    
    features = df.drop(target_name, axis=1)
    return features, target

In [3]:
path = 'data/application_train.csv'
target_name = 'TARGET'
train_features, train_target = get_train_data(path, target_name)

In [4]:
train_features.shape, train_target.shape

((307511, 121), (307511,))

In [5]:
train_features.head()

Unnamed: 0,SK_ID_CURR,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,351000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,1129500.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,135000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,297000.0,...,0,0,0,0,,,,,,
4,100007,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,513000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


In [6]:
train_target.head()

0    1
1    0
2    0
3    0
4    0
Name: TARGET, dtype: int64

#### Auxiliary training data

Previous out of network loans

In [7]:
bureau_df = pd.read_csv('data/bureau.csv')
bureau_df.shape

(1716428, 17)

In [8]:
bureau_df.head()

Unnamed: 0,SK_ID_CURR,SK_ID_BUREAU,CREDIT_ACTIVE,CREDIT_CURRENCY,DAYS_CREDIT,CREDIT_DAY_OVERDUE,DAYS_CREDIT_ENDDATE,DAYS_ENDDATE_FACT,AMT_CREDIT_MAX_OVERDUE,CNT_CREDIT_PROLONG,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_LIMIT,AMT_CREDIT_SUM_OVERDUE,CREDIT_TYPE,DAYS_CREDIT_UPDATE,AMT_ANNUITY
0,215354,5714462,Closed,currency 1,-497,0,-153.0,-153.0,,0,91323.0,0.0,,0.0,Consumer credit,-131,
1,215354,5714463,Active,currency 1,-208,0,1075.0,,,0,225000.0,171342.0,,0.0,Credit card,-20,
2,215354,5714464,Active,currency 1,-203,0,528.0,,,0,464323.5,,,0.0,Consumer credit,-16,
3,215354,5714465,Active,currency 1,-203,0,,,,0,90000.0,,,0.0,Credit card,-16,
4,215354,5714466,Active,currency 1,-629,0,1197.0,,77674.5,0,2700000.0,,,0.0,Consumer credit,-21,


Extend main train data with previous out of network loans

In [9]:
previous_loan_counts = bureau_df.groupby('SK_ID_CURR')['SK_ID_CURR'].count()
previous_loan_counts.shape

(305811,)

In [10]:
previous_loan_counts.head()

SK_ID_CURR
100001    7
100002    8
100003    4
100004    2
100005    3
Name: SK_ID_CURR, dtype: int64

In [11]:
previous_loan_df = previous_loan_counts.to_frame().rename(columns={'SK_ID_CURR': 'LOAN_COUNT_BUREAU'}).reset_index()
previous_loan_df.shape

(305811, 2)

In [12]:
previous_loan_df.head()

Unnamed: 0,SK_ID_CURR,LOAN_COUNT_BUREAU
0,100001,7
1,100002,8
2,100003,4
3,100004,2
4,100005,3


In [13]:
train_features_bureau = train_features.merge(previous_loan_df, on='SK_ID_CURR', how='left')
train_features_bureau.shape

(307511, 122)

In [14]:
train_features_bureau[['SK_ID_CURR', 'LOAN_COUNT_BUREAU']].head()

Unnamed: 0,SK_ID_CURR,LOAN_COUNT_BUREAU
0,100002,8.0
1,100003,4.0
2,100004,2.0
3,100006,
4,100007,1.0


In [15]:
bureau_balance_df = pd.read_csv('data/bureau_balance.csv')
bureau_balance_df.shape

(27299925, 3)

In [16]:
bureau_balance_df.head()

Unnamed: 0,SK_ID_BUREAU,MONTHS_BALANCE,STATUS
0,5715448,0,C
1,5715448,-1,C
2,5715448,-2,C
3,5715448,-3,C
4,5715448,-4,C


In [17]:
max_loan_per_month = bureau_balance_df.groupby(
    'SK_ID_BUREAU', 
    as_index=False)['MONTHS_BALANCE'].max().rename(columns={'MONTHS_BALANCE': 'MAX_MONTHS_BALANCE'})

In [18]:
max_loan_per_month.head()

Unnamed: 0,SK_ID_BUREAU,MAX_MONTHS_BALANCE
0,5001709,0
1,5001710,0
2,5001711,0
3,5001712,0
4,5001713,0


In [19]:
max_loan_month_df = max_loan_per_month.merge(
    bureau_df[['SK_ID_CURR', 'SK_ID_BUREAU']], 
    on='SK_ID_BUREAU',
    how='left'
)

In [20]:
max_loan_month_df.head()

Unnamed: 0,SK_ID_BUREAU,MAX_MONTHS_BALANCE,SK_ID_CURR
0,5001709,0,
1,5001710,0,162368.0
2,5001711,0,162368.0
3,5001712,0,162368.0
4,5001713,0,150635.0


In [21]:
mean_max_monthly_loan = max_loan_month_df.groupby(
    'SK_ID_CURR', 
    as_index=False)['MAX_MONTHS_BALANCE'].mean().rename(columns={'MAX_MONTHS_BALANCE': 'MEAN_MAX_MONTHS_BALANCE'})

In [22]:
mean_max_monthly_loan.head()

Unnamed: 0,SK_ID_CURR,MEAN_MAX_MONTHS_BALANCE
0,100001.0,0.0
1,100002.0,-15.5
2,100005.0,0.0
3,100010.0,-28.5
4,100013.0,0.0


In [23]:
train_features_bureau_ext = train_features_bureau.merge(
    mean_max_monthly_loan[['SK_ID_CURR', 'MEAN_MAX_MONTHS_BALANCE']], 
    on='SK_ID_CURR',
    how='left'
)

In [24]:
train_features_bureau_ext.head()

Unnamed: 0,SK_ID_CURR,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,...,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR,LOAN_COUNT_BUREAU,MEAN_MAX_MONTHS_BALANCE
0,100002,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,351000.0,...,0,0,0.0,0.0,0.0,0.0,0.0,1.0,8.0,-15.5
1,100003,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,1129500.0,...,0,0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,
2,100004,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,135000.0,...,0,0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,
3,100006,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,297000.0,...,0,0,,,,,,,,
4,100007,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,513000.0,...,0,0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,


#### Test data

In [25]:
test_features = pd.read_csv('data/application_test.csv')
test_features.shape

(48744, 121)

In [26]:
test_features.head()

Unnamed: 0,SK_ID_CURR,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100001,Cash loans,F,N,Y,0,135000.0,568800.0,20560.5,450000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
1,100005,Cash loans,M,N,Y,0,99000.0,222768.0,17370.0,180000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,3.0
2,100013,Cash loans,M,Y,Y,0,202500.0,663264.0,69777.0,630000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,1.0,4.0
3,100028,Cash loans,F,N,Y,2,315000.0,1575000.0,49018.5,1575000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,3.0
4,100038,Cash loans,M,Y,N,1,180000.0,625500.0,32067.0,625500.0,...,0,0,0,0,,,,,,


#### Dealing with categorical values

Convert categorical features to one-hot encoded features. This is due to the fact that implementations of most machine learning models cannot directly deal with categorical data. We will use pandas native one hot encoder. Howerver, to align train and test data, we will simply remove the additional columns, since they reflect only a small portion of data.

In [27]:
def make_one_hot_encoded(train_features: pd.DataFrame, test_features: pd.DataFrame) -> tuple:
    train_1h = pd.get_dummies(train_features)
    test_1h = pd.get_dummies(test_features)
    return train_1h.align(test_1h, join='inner', axis=1)

In [30]:
train_1h, test_1h = make_one_hot_encoded(train_features, test_features)
assert train_1h.shape[1]==test_1h.shape[1]

In [31]:
train_1h.shape, test_1h.shape

((307511, 242), (48744, 242))

Lets look at the one hot encoded colum s

In [32]:
def get_extra_columns(before: np.ndarray, after: np.ndarray) -> list:
    return list(set(after).difference(set(before)))

In [33]:
train_1h[get_extra_columns(train_features.columns.values, train_1h.columns.values)].head()

Unnamed: 0,ORGANIZATION_TYPE_University,ORGANIZATION_TYPE_Agriculture,NAME_EDUCATION_TYPE_Lower secondary,NAME_FAMILY_STATUS_Single / not married,NAME_CONTRACT_TYPE_Cash loans,ORGANIZATION_TYPE_Industry: type 2,ORGANIZATION_TYPE_Trade: type 2,NAME_FAMILY_STATUS_Married,ORGANIZATION_TYPE_Trade: type 3,NAME_INCOME_TYPE_Unemployed,...,WALLSMATERIAL_MODE_Others,NAME_TYPE_SUITE_Family,ORGANIZATION_TYPE_Industry: type 3,ORGANIZATION_TYPE_Industry: type 11,OCCUPATION_TYPE_Cooking staff,ORGANIZATION_TYPE_Industry: type 8,ORGANIZATION_TYPE_XNA,ORGANIZATION_TYPE_Military,ORGANIZATION_TYPE_Medicine,"WALLSMATERIAL_MODE_Stone, brick"
0,0,0,0,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,0,0,0,0,1,0,0,1,0,0,...,0,1,0,0,0,0,0,0,0,0
2,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [34]:
test_1h[get_extra_columns(test_features.columns.values, test_1h.columns.values)].head()

Unnamed: 0,ORGANIZATION_TYPE_University,ORGANIZATION_TYPE_Agriculture,NAME_EDUCATION_TYPE_Lower secondary,NAME_FAMILY_STATUS_Single / not married,NAME_CONTRACT_TYPE_Cash loans,ORGANIZATION_TYPE_Industry: type 2,ORGANIZATION_TYPE_Trade: type 2,NAME_FAMILY_STATUS_Married,ORGANIZATION_TYPE_Trade: type 3,NAME_INCOME_TYPE_Unemployed,...,WALLSMATERIAL_MODE_Others,NAME_TYPE_SUITE_Family,ORGANIZATION_TYPE_Industry: type 3,ORGANIZATION_TYPE_Industry: type 11,OCCUPATION_TYPE_Cooking staff,ORGANIZATION_TYPE_Industry: type 8,ORGANIZATION_TYPE_XNA,ORGANIZATION_TYPE_Military,ORGANIZATION_TYPE_Medicine,"WALLSMATERIAL_MODE_Stone, brick"
0,0,0,0,0,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,1
1,0,0,0,0,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


#### Imputing

We learned earlier that a lot of features are missing values. We need to deal with the missing values before we derive the polynomial feature. 

In [35]:
type(Imputer(strategy='median'))

sklearn.preprocessing.imputation.Imputer

In [36]:
def impute(train_data: pd.DataFrame, test_data: pd.DataFrame, strategy: str) -> tuple:
    imputer = Imputer(strategy=strategy)
    
    train_imputed = imputer.fit_transform(train_data)
    train_features = pd.DataFrame(train_imputed, columns=train_data.columns)
    
    test_imputed = imputer.transform(test_data) 
    test_features = pd.DataFrame(test_imputed, columns=test_data.columns)
    
    return train_features, test_features

In [37]:
train_imputed, test_imputed = impute(train_1h, test_1h, strategy='median')

In [38]:
def count_missing_stats(data: pd.DataFrame) -> int:
    return len([(row, stat) for row, stat in (data.isnull().sum()/data.shape[0]).items() if stat>0])    

In [39]:
print ("Columns having null valies: {} (before Imputing), {} (after Imputing)"
       .format(count_missing_stats(train_1h),
               count_missing_stats(train_imputed)))

Columns having null valies: 61 (before Imputing), 0 (after Imputing)


In [40]:
print ("Columns having null valies: {} (before Imputing), {} (after Imputing)"
       .format(count_missing_stats(test_1h), 
               count_missing_stats(test_imputed)))

Columns having null valies: 58 (before Imputing), 0 (after Imputing)


### Feature Engineering

#### Polynomial feature from correlated features

In our analysis, we learned that `TARGET` is positively correlated with `DAYS_BIRTH` and negatively correlated with `EXT_SOURCE_1`, `EXT_SOURCE_2`, and `EXT_SOURCE_3`. It is worth trying out polynomial features. `scikit-learn` provides a utitlity to generate one.

In [41]:
def EngieerPolynomialFeatures(train_data: pd.DataFrame, 
                              test_data: pd.DataFrame, 
                              feature_columns: list, 
                              degree: int, 
                              merge_id: str, 
                              merge_how:str) -> tuple:
    poly_transformer = PolynomialFeatures(degree=degree)
    
    train_features = poly_transformer.fit_transform(train_data[feature_columns])
    test_features = poly_transformer.transform(test_data[feature_columns])
    
    engineered_column_names = poly_transformer.get_feature_names(feature_columns)
    
    poly_df_train = pd.DataFrame(train_features, columns=engineered_column_names)
    poly_df_train[merge_id] = train_data[merge_id]
    poly_train_features = train_data.merge(poly_df_train, how=merge_how, on=merge_id)
    
    poly_df_test = pd.DataFrame(test_features, columns=engineered_column_names)
    poly_df_test[merge_id] = test_data[merge_id]
    poly_test_features = test_data.merge(poly_df_test, how=merge_how, on=merge_id)
    
    return poly_train_features, poly_test_features

In [42]:
feature_columns = ['DAYS_BIRTH', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']
degree = 3    
merge_id = 'SK_ID_CURR'
merge_how = 'left'
poly_train_features, poly_test_features = EngieerPolynomialFeatures(train_data=train_imputed,
                                                                    test_data=test_imputed,
                                                                    feature_columns=feature_columns, 
                                                                    degree=degree, 
                                                                    merge_id=merge_id, 
                                                                    merge_how=merge_how)

In [43]:
poly_train_features.shape, poly_test_features.shape

((307511, 277), (48744, 277))

Lets look at the polynomial features

In [44]:
poly_train_features[get_extra_columns(train_imputed.columns.values, poly_train_features.columns.values)].head()

Unnamed: 0,EXT_SOURCE_2^2 EXT_SOURCE_3,EXT_SOURCE_3^2,DAYS_BIRTH EXT_SOURCE_2,DAYS_BIRTH EXT_SOURCE_3^2,EXT_SOURCE_1^3,DAYS_BIRTH^2 EXT_SOURCE_2,DAYS_BIRTH_x,EXT_SOURCE_1_x,DAYS_BIRTH EXT_SOURCE_1,EXT_SOURCE_3_y,...,EXT_SOURCE_1 EXT_SOURCE_2,EXT_SOURCE_1 EXT_SOURCE_3^2,EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3,EXT_SOURCE_1^2 EXT_SOURCE_3,DAYS_BIRTH EXT_SOURCE_2^2,DAYS_BIRTH EXT_SOURCE_1 EXT_SOURCE_3,DAYS_BIRTH^2 EXT_SOURCE_3,EXT_SOURCE_1 EXT_SOURCE_3,EXT_SOURCE_2_x,EXT_SOURCE_2^2
0,0.009637,0.019426,-2487.756636,-183.785678,0.000573,23536670.0,-9461.0,0.083037,-785.612748,0.139376,...,0.021834,0.001613,0.003043,0.000961,-654.152107,-109.49539,12475600.0,0.011573,0.262949,0.069142
1,0.207254,0.286521,-10431.950422,-4803.518937,0.030158,174891600.0,-16765.0,0.311267,-5218.396475,0.535276,...,0.193685,0.089185,0.103675,0.051861,-6491.237078,-2793.283699,150447500.0,0.166614,0.622246,0.38719
2,0.225464,0.532268,-10587.90154,-10137.567875,0.129553,201657200.0,-19046.0,0.505998,-9637.236584,0.729567,...,0.28129,0.269326,0.20522,0.186794,-5885.942404,-7031.006802,264650400.0,0.369159,0.555912,0.309038
3,0.226462,0.286521,-12361.644326,-5445.325225,0.129553,234933100.0,-19005.0,0.505998,-9616.490669,0.535276,...,0.329122,0.144979,0.176171,0.137049,-8040.528832,-5147.479068,193336400.0,0.270849,0.650442,0.423074
4,0.055754,0.286521,-6432.819536,-5710.929881,0.129553,128219000.0,-19932.0,0.505998,-10085.550751,0.535276,...,0.163305,0.144979,0.087413,0.137049,-2076.117157,-5398.55579,212657000.0,0.270849,0.322738,0.10416


In [45]:
poly_test_features[get_extra_columns(test_imputed.columns.values, poly_test_features.columns.values)].head()

Unnamed: 0,EXT_SOURCE_2^2 EXT_SOURCE_3,EXT_SOURCE_3^2,DAYS_BIRTH EXT_SOURCE_2,DAYS_BIRTH EXT_SOURCE_3^2,EXT_SOURCE_1^3,DAYS_BIRTH^2 EXT_SOURCE_2,DAYS_BIRTH_x,EXT_SOURCE_1_x,DAYS_BIRTH EXT_SOURCE_1,EXT_SOURCE_3_y,...,EXT_SOURCE_1 EXT_SOURCE_2,EXT_SOURCE_1 EXT_SOURCE_3^2,EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3,EXT_SOURCE_1^2 EXT_SOURCE_3,DAYS_BIRTH EXT_SOURCE_2^2,DAYS_BIRTH EXT_SOURCE_1 EXT_SOURCE_3,DAYS_BIRTH^2 EXT_SOURCE_3,EXT_SOURCE_1 EXT_SOURCE_3,EXT_SOURCE_2_x,EXT_SOURCE_2^2
0,0.099469,0.025446,-15193.73937,-489.615795,0.426302,292342700.0,-19241.0,0.752614,-14481.055414,0.15952,...,0.594305,0.019151,0.094803,0.090356,-11997.802403,-2310.011305,59056700.0,0.120057,0.789654,0.623554
1,0.036829,0.187456,-5268.46553,-3386.201665,0.180353,95169560.0,-18064.0,0.56499,-10205.983005,0.432962,...,0.164783,0.105911,0.071345,0.138207,-1536.577117,-4418.799416,141278900.0,0.244619,0.291656,0.085063
2,0.299203,0.37331,-14022.328504,-7480.393855,0.129553,280979400.0,-20038.0,0.505998,-10139.186531,0.610991,...,0.354091,0.188894,0.216346,0.156434,-9812.640816,-6194.955045,245326100.0,0.30916,0.699787,0.489702
3,0.159163,0.375406,-7123.246872,-5246.681115,0.145311,99554500.0,-13976.0,0.525734,-7347.658072,0.612704,...,0.267955,0.197364,0.164177,0.169349,-3630.555667,-4501.941285,119678600.0,0.322119,0.509677,0.259771
4,0.096997,0.286521,-5550.962315,-3736.229463,0.00826,72384550.0,-13040.0,0.202145,-2635.970697,0.535276,...,0.086051,0.057919,0.046061,0.021873,-2362.974127,-1410.972511,91019230.0,0.108203,0.425687,0.18121


In [47]:
scaler_poly = MinMaxScaler(feature_range=(0, 1))

In [48]:
train_scaled_poly = scaler_poly.fit_transform(poly_train_features)
test_scaled_poly = scaler_poly.transform(poly_test_features)

In [69]:
train_scaled_poly_df = pd.DataFrame(train_scaled_poly, columns=poly_train_features.columns)
train_scaled_poly_df.head()

Unnamed: 0,SK_ID_CURR,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,REGION_POPULATION_RELATIVE,DAYS_BIRTH_x,DAYS_EMPLOYED,DAYS_REGISTRATION,...,EXT_SOURCE_1^3,EXT_SOURCE_1^2 EXT_SOURCE_2,EXT_SOURCE_1^2 EXT_SOURCE_3,EXT_SOURCE_1 EXT_SOURCE_2^2,EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3,EXT_SOURCE_1 EXT_SOURCE_3^2,EXT_SOURCE_2^3,EXT_SOURCE_2^2 EXT_SOURCE_3,EXT_SOURCE_2 EXT_SOURCE_3^2,EXT_SOURCE_3^3
0,0.0,0.0,0.001512,0.090287,0.090032,0.077441,0.256321,0.888839,0.045086,0.85214,...,0.000638,0.002563,0.001302,0.009369,0.00492,0.002314,0.029088,0.01633,0.008212,0.003764
1,3e-06,0.0,0.002089,0.311736,0.132924,0.271605,0.045016,0.477114,0.043648,0.951929,...,0.033798,0.085217,0.070268,0.196667,0.167608,0.127953,0.385468,0.351196,0.286645,0.213204
2,6e-06,0.0,0.000358,0.022472,0.020025,0.023569,0.134897,0.348534,0.046161,0.827335,...,0.145203,0.201187,0.253093,0.255173,0.331772,0.386401,0.274866,0.382054,0.475732,0.53983
3,1.1e-05,0.0,0.000935,0.066837,0.109477,0.063973,0.107023,0.350846,0.038817,0.601451,...,0.145203,0.235398,0.185692,0.349332,0.28481,0.208001,0.440278,0.383744,0.299633,0.213204
4,1.4e-05,0.0,0.000819,0.116854,0.078975,0.117845,0.39288,0.298591,0.03882,0.825268,...,0.145203,0.1168,0.185692,0.086005,0.141318,0.208001,0.053784,0.094477,0.148673,0.213204


In [70]:
test_scaled_poly_df = pd.DataFrame(test_scaled_poly, columns=poly_test_features.columns)
test_scaled_poly_df.head()

Unnamed: 0,SK_ID_CURR,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,REGION_POPULATION_RELATIVE,DAYS_BIRTH_x,DAYS_EMPLOYED,DAYS_REGISTRATION,...,EXT_SOURCE_1^3,EXT_SOURCE_1^2 EXT_SOURCE_2,EXT_SOURCE_1^2 EXT_SOURCE_3,EXT_SOURCE_1 EXT_SOURCE_2^2,EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3,EXT_SOURCE_1 EXT_SOURCE_3^2,EXT_SOURCE_2^3,EXT_SOURCE_2^2 EXT_SOURCE_3,EXT_SOURCE_2 EXT_SOURCE_3^2,EXT_SOURCE_3^3
0,-3e-06,0.0,0.000935,0.130787,0.073886,0.102132,0.257,0.337542,0.04067,0.790451,...,0.477807,0.632235,0.122427,0.765808,0.153265,0.027476,0.787795,0.168552,0.032307,0.005643
1,8e-06,0.0,0.000627,0.044387,0.061443,0.034792,0.491595,0.40389,0.035085,0.630431,...,0.202141,0.131598,0.187261,0.078425,0.11534,0.15195,0.039693,0.062408,0.087901,0.112826
2,3.1e-05,0.0,0.001512,0.154373,0.26583,0.147026,0.260475,0.292616,0.035114,0.911843,...,0.145203,0.253256,0.211958,0.404346,0.34976,0.271006,0.548276,0.507007,0.420012,0.317079
3,7.3e-05,0.105263,0.002474,0.382022,0.184872,0.382716,0.361433,0.634329,0.041879,0.918936,...,0.162865,0.199124,0.229456,0.222859,0.265419,0.283157,0.21183,0.269705,0.307626,0.319753
4,0.000101,0.052632,0.00132,0.144944,0.118761,0.145903,0.134897,0.687091,0.04103,0.837873,...,0.009255,0.024587,0.029636,0.059775,0.074465,0.083096,0.123417,0.164364,0.196098,0.213204


Model creation

In [52]:
import lightgbm

In [54]:
from sklearn.model_selection import KFold
from sklearn.metrics import roc_auc_score
import gc

In [141]:
def gbm_model(train: pd.DataFrame, target: pd.DataFrame, test: pd.DataFrame, id_field: str, n_splits: int):
    feature_columns = list(set(train.columns).difference(id_field))
    train_ids = train[id_field]
    train_wo_ids = train[feature_columns]
    test_wo_ids = test[feature_columns]
    
    train_matrix = train_wo_ids.values
    target_matrix = target.values
    test_matrix = test_wo_ids.values
    
    k_fold = KFold(n_splits=n_splits, shuffle=False, random_state=50)
    test_predictions = np.zeros(test_wo_ids.shape[0])
    feature_importances = np.zeros(len(feature_columns))
    out_of_fold = np.zeros(train.shape[0])
    
    valid_scores = []
    train_scores = []
    
    for train_indices, valid_indices in k_fold.split(train_matrix):
        model = lightgbm.LGBMClassifier(n_estimators=10000, 
                                        objective='binary', 
                                        class_weight='balanced', 
                                        learning_rate=0.05, 
                                        reg_alpha=0.1, 
                                        reg_lambda=0.1, 
                                        subsample=0.8, 
                                        n_jobs=-1, 
                                        random_state=50)
        train_feature_sample = train_matrix[train_indices]
        train_target_sample = target_matrix[train_indices]
        valid_feature_sample = train_matrix[valid_indices]
        valid_target_sample = target_matrix[valid_indices]
        
        model.fit(train_feature_sample, 
                  train_target_sample, 
                  eval_metric='auc', 
                  eval_set=[(valid_feature_sample, valid_target_sample), 
                            (train_feature_sample, train_target_sample)], 
                  eval_names=['valid', 'train'], 
                  categorical_feature='auto', 
                  early_stopping_rounds=100, 
                  verbose=200)
    
        feature_importances += model.feature_importances_ / k_fold.n_splits
        
        test_predictions += model.predict_proba(test_matrix, num_iteration=model.best_iteration_)[ :,1]/k_fold.n_splits
        
        out_of_fold[valid_indices] = model.predict_proba(valid_feature_sample, num_iteration=model.best_iteration_)[ :,1]
        
        valid_scores.append(model.best_score_['valid']['auc'])
        train_scores.append(model.best_score_['train']['auc'])
        
        gc.enable()
        del model, train_feature_sample, valid_feature_sample
        gc.collect()
        
    submission = pd.DataFrame({id_field: test[id_field], 'TARGET': test_predictions})
    
    feature_importances_pd = pd.DataFrame({'feature': feature_columns, 'importance': feature_importances})
    
    valid_auc = roc_auc_score(target_matrix, out_of_fold)    
    
    metrics = pd.DataFrame({'fold': list(range(n_splits)), 
                            'train': train_scores.append(np.mean(train_scores)),
                            'valid': valid_scores.append(valid_auc)})
    
    return submission, feature_importances_pd, metrics

In [142]:
submission, feature_importances, metrics = gbm_model(train=train_scaled_poly_df, 
                                                     target=train_target, 
                                                     test=test_scaled_poly_df, 
                                                     id_field='SK_ID_CURR', 
                                                     n_splits=5)

Training until validation scores don't improve for 100 rounds.
[200]	valid's auc: 0.757611	train's auc: 0.798551
Early stopping, best iteration is:
[261]	valid's auc: 0.758111	train's auc: 0.809237
Training until validation scores don't improve for 100 rounds.
[200]	valid's auc: 0.760094	train's auc: 0.798631
Early stopping, best iteration is:
[194]	valid's auc: 0.760178	train's auc: 0.797521
Training until validation scores don't improve for 100 rounds.
[200]	valid's auc: 0.751336	train's auc: 0.800254
Early stopping, best iteration is:
[242]	valid's auc: 0.751495	train's auc: 0.807426
Training until validation scores don't improve for 100 rounds.
[200]	valid's auc: 0.759823	train's auc: 0.798622
Early stopping, best iteration is:
[181]	valid's auc: 0.759916	train's auc: 0.795287
Training until validation scores don't improve for 100 rounds.
[200]	valid's auc: 0.759447	train's auc: 0.798124
[400]	valid's auc: 0.759946	train's auc: 0.830254
Early stopping, best iteration is:
[340]	vali

In [149]:
submission['SK_ID_CURR'] = test_features['SK_ID_CURR']

In [150]:
submission.head()

Unnamed: 0,SK_ID_CURR,TARGET
0,100001,0.3471
1,100005,0.625327
2,100013,0.171674
3,100028,0.259154
4,100038,0.635496


In [152]:
submission.to_csv('gbm_submission_1.csv', index=False)