# Score:
CV 0.78879859253631179 [3800]

LB 0.783

<a id="toc"></a>

# <u>Table of Contents</u>
1.) [TODO](#todo)  
2.) [Imports](#imports)  
3.) [Bureau](#pos_cash)  
&nbsp;&nbsp;&nbsp;&nbsp; 3.1.) [Data Processing](#bureau_process)  
4.) [Bureau Balance](#bureau_bal)  
&nbsp;&nbsp;&nbsp;&nbsp; 4.1.) [Merge into Bureau](#merge_bureau_bal)  
5.) [Previous Application](#prev_app)  
&nbsp;&nbsp;&nbsp;&nbsp; 5.1.) [Data Processing](#prev_process)  
6.) [POS CASH balance](#pos_cash)  
&nbsp;&nbsp;&nbsp;&nbsp; 6.1.) [Data Processing](#pos_process)  
&nbsp;&nbsp;&nbsp;&nbsp; 6.2.) [Merge into Previous Application](#merge_pos_cash)  
7.) [Installment Payments](#install_pay)  
&nbsp;&nbsp;&nbsp;&nbsp; 7.1.) [Merge into Previous Application](#merge_install_pay)  
8.) [Credit Card Balance](#credit)  
&nbsp;&nbsp;&nbsp;&nbsp; 8.1.) [Data Processing](#credit_process)  
&nbsp;&nbsp;&nbsp;&nbsp; 8.2.) [Merge into Previous Application](#merge_credit)  
9.) [Miscellaneous clean up](#misc)  
10.) [Final Data Prep](#final_merge)  
&nbsp;&nbsp;&nbsp;&nbsp; 10.1.) [Data Processing](#final_process)  
&nbsp;&nbsp;&nbsp;&nbsp; 10.2.) [Create Features](#train_feat)  
&nbsp;&nbsp;&nbsp;&nbsp; 10.3.) [Categorical values](#train_cat)  
&nbsp;&nbsp;&nbsp;&nbsp; 10.4.) [Merge Previous Application with Full](#merge_prev)  
&nbsp;&nbsp;&nbsp;&nbsp; 10.5.) [Merge Bureau with Full](#merge_bureau)  
11.) [Modeling](#models)  
&nbsp;&nbsp;&nbsp;&nbsp; 11.1.) [Feature Reduction](#feat_reduction)  
&nbsp;&nbsp;&nbsp;&nbsp; 11.2.) [Most important features](#important_feats)  
&nbsp;&nbsp;&nbsp;&nbsp; 11.3.) [Parameter tuning](#param_tuning)  
&nbsp;&nbsp;&nbsp;&nbsp; 11.4.) [CV Score](#cv)  
12.) [Final submission](#final)  
&nbsp;&nbsp;&nbsp;&nbsp; 12.1.) [Final predictions](#final_pred)  

<a id="todo"></a>

# [^](#toc) <u>TODO</u>

- Fix skew on columns
- Tinker with the best way to replace missing values (dropping cols?)
- Look for outliers
- Include timeline relatoinships like MONTHS_BALANCE
- Added coded value of -1 to missing numerical values
- Address [this](https://www.kaggle.com/c/home-credit-default-risk/discussion/57248)

---
<a id="imports"></a>

# [^](#toc) <u>Imports</u>

In [1]:
### Standard imports
import pandas as pd
import numpy as np

# Time keeper
import time

# Randomize seeds
import random

# Garbage collector
import gc

# Progress bar
from tqdm import tqdm

### Removes warnings from output
import warnings
warnings.filterwarnings('ignore')

### Setup

In [2]:
def get_dummies(df, cats):
    for col in cats:
        df = pd.concat([df, pd.get_dummies(df[col], prefix=col)], axis=1)
    return df 

def factorize_df(df, cats):
    for col in cats:
        df[col], _ = pd.factorize(df[col])
    return df 

DATA_PATH = "../data/home_default/"

---
<a id="bureau"></a>

# [^](#toc) Bureau

In [3]:
bureau   = pd.read_csv(DATA_PATH + "bureau.csv")
print("Shape of bureau:", bureau.shape)

print("\nColumns of bureau:")
print(" --- ".join(bureau.columns.values))

Shape of bureau: (1716428, 17)

Columns of bureau:
SK_ID_CURR --- SK_ID_BUREAU --- CREDIT_ACTIVE --- CREDIT_CURRENCY --- DAYS_CREDIT --- CREDIT_DAY_OVERDUE --- DAYS_CREDIT_ENDDATE --- DAYS_ENDDATE_FACT --- AMT_CREDIT_MAX_OVERDUE --- CNT_CREDIT_PROLONG --- AMT_CREDIT_SUM --- AMT_CREDIT_SUM_DEBT --- AMT_CREDIT_SUM_LIMIT --- AMT_CREDIT_SUM_OVERDUE --- CREDIT_TYPE --- DAYS_CREDIT_UPDATE --- AMT_ANNUITY


<a id="bureau_process"></a>

### [^](#toc) Data Processing

In [4]:
### Lump together values with low counts
# CREDIT_CURRENCY
cols = ["currency 3", "currency 4"]
bureau.CREDIT_CURRENCY = bureau.CREDIT_CURRENCY.map(lambda x: "MISC" if x in cols else x)

# # CREDIT_TYPE
# cols = ["Cash loan (non-earmarked)", "Real estate loan", "Loan for the purchase of equipment",
#         "Loan for purchase of shares (margin lending)", "Interbank credit", "Mobile operator loan"]
# bureau.CREDIT_TYPE = bureau.CREDIT_TYPE.map(lambda x: "MISC" if x in cols else x)

<a id="bureau_bal"></a>

# [^](#toc) <u>Bureau Balance</u>

In [5]:
bureau_balance = pd.read_csv(DATA_PATH + "bureau_balance.csv")
print("Shape of bureau_balance:",  bureau_balance.shape)

print("\nColumns of bureau_balance:")
print(" --- ".join(bureau_balance.columns.values))

Shape of bureau_balance: (27299925, 3)

Columns of bureau_balance:
SK_ID_BUREAU --- MONTHS_BALANCE --- STATUS


<a id="merge_bureau_bal"></a>

### [^](#toc) <u>Merge into Bureau</u>

In [6]:
### Get sum of counts in categorical column
merge_df = get_dummies(bureau_balance, ["STATUS"])
cols = ['STATUS_0', 'STATUS_1', 'STATUS_2', 'STATUS_3', 'STATUS_4', 'STATUS_5', 'STATUS_C', 'STATUS_X']
for col in cols:
    merge_df[col] = merge_df[col] / (merge_df["MONTHS_BALANCE"] - 1)
merge_df = merge_df.drop(["MONTHS_BALANCE", "STATUS"], axis=1)
merge_df = merge_df.groupby("SK_ID_BUREAU").sum().reset_index()

### Add the median of the rest of the columns
right    = bureau_balance.groupby("SK_ID_BUREAU").median().reset_index()
merge_df = merge_df.merge(right=right, how="left", on="SK_ID_BUREAU").set_index("SK_ID_BUREAU")

### Prefix column names
merged_cols = ['bur_bal_' + col for col in merge_df.columns]
merge_df.columns = merged_cols

# Merge
bureau = bureau.merge(right=merge_df.reset_index(), how='left', on='SK_ID_BUREAU')

# Mark missing values
bureau["no_bureau_bal"] = bureau[merged_cols[0]].map(lambda x: 1 if np.isnan(x) else 0)

### Delete old variables
del bureau_balance, merge_df, merged_cols, right
gc.collect()

126

---
<a id="prev_app"></a>

# [^](#toc) <u>Previous Application</u>

In [7]:
prev_app = pd.read_csv(DATA_PATH + "previous_application.csv")
print("Shape of prev_app:",  prev_app.shape)

print("\nColumns of prev_app:")
print(" --- ".join(prev_app.columns.values))

Shape of prev_app: (1670214, 37)

Columns of prev_app:
SK_ID_PREV --- SK_ID_CURR --- NAME_CONTRACT_TYPE --- AMT_ANNUITY --- AMT_APPLICATION --- AMT_CREDIT --- AMT_DOWN_PAYMENT --- AMT_GOODS_PRICE --- WEEKDAY_APPR_PROCESS_START --- HOUR_APPR_PROCESS_START --- FLAG_LAST_APPL_PER_CONTRACT --- NFLAG_LAST_APPL_IN_DAY --- RATE_DOWN_PAYMENT --- RATE_INTEREST_PRIMARY --- RATE_INTEREST_PRIVILEGED --- NAME_CASH_LOAN_PURPOSE --- NAME_CONTRACT_STATUS --- DAYS_DECISION --- NAME_PAYMENT_TYPE --- CODE_REJECT_REASON --- NAME_TYPE_SUITE --- NAME_CLIENT_TYPE --- NAME_GOODS_CATEGORY --- NAME_PORTFOLIO --- NAME_PRODUCT_TYPE --- CHANNEL_TYPE --- SELLERPLACE_AREA --- NAME_SELLER_INDUSTRY --- CNT_PAYMENT --- NAME_YIELD_GROUP --- PRODUCT_COMBINATION --- DAYS_FIRST_DRAWING --- DAYS_FIRST_DUE --- DAYS_LAST_DUE_1ST_VERSION --- DAYS_LAST_DUE --- DAYS_TERMINATION --- NFLAG_INSURED_ON_APPROVAL


### Create Features

In [8]:
prev_app["ASKED_ACTUAL_RATIO"] = prev_app["AMT_APPLICATION"] / prev_app["AMT_CREDIT"]

<a id="prev_process"></a>

### [^](#toc) Data Processing

In [9]:
### Fill in values that should be null
prev_app['DAYS_FIRST_DRAWING'       ].replace(365243, np.nan, inplace= True)
prev_app['DAYS_FIRST_DUE'           ].replace(365243, np.nan, inplace= True)
prev_app['DAYS_LAST_DUE_1ST_VERSION'].replace(365243, np.nan, inplace= True)
prev_app['DAYS_LAST_DUE'            ].replace(365243, np.nan, inplace= True)
prev_app['DAYS_TERMINATION'         ].replace(365243, np.nan, inplace= True)

### Lump together values with low counts
# NAME_GOODS_CATEGORY
prev_app.NAME_GOODS_CATEGORY = prev_app.NAME_GOODS_CATEGORY.map(
    lambda x: "MISC" if x in ["Weapon", "Insurance"] else x)

# NAME_CASH_LOAN_PURPOSE
prev_app.NAME_CASH_LOAN_PURPOSE = prev_app.NAME_CASH_LOAN_PURPOSE.map(
    lambda x: "MISC" if x in ["Buying a garage", "Misc"] else x)

# Create features
prev_app["APP_CREDIT_PERC"] = prev_app['AMT_APPLICATION'] / prev_app['AMT_CREDIT']

---
<a id="pos_cash"></a>

# [^](#toc) <u>POS CASH balance</u>

In [10]:
pcb = pd.read_csv(DATA_PATH + "POS_CASH_balance.csv")
print("Shape of pcb:",  pcb.shape)

print("\nColumns of pcb:")
print(" --- ".join(pcb.columns.values))

Shape of pcb: (10001358, 8)

Columns of pcb:
SK_ID_PREV --- SK_ID_CURR --- MONTHS_BALANCE --- CNT_INSTALMENT --- CNT_INSTALMENT_FUTURE --- NAME_CONTRACT_STATUS --- SK_DPD --- SK_DPD_DEF


<a id="pos_process"></a>

### [^](#toc) Data Processing

In [11]:
# Remove Outliers
pcb = pcb.drop(pcb[pcb.NAME_CONTRACT_STATUS.isin(["XNA", "Canceled"])].index)

<a id="merge_pos_cash"></a>

### [^](#toc) <u>Merge into Previous Application</u>

In [12]:
### Get Dummies
merge_df = pcb[["SK_ID_PREV", "NAME_CONTRACT_STATUS"]]
merge_df = get_dummies(merge_df, ["NAME_CONTRACT_STATUS"])
merge_df = merge_df.drop("NAME_CONTRACT_STATUS", axis=1)

# Prep for merge
count    = merge_df.groupby("SK_ID_PREV").count()
merge_df = merge_df.groupby("SK_ID_PREV").sum().reset_index()
merge_df["N"] = list(count.iloc[:,0])

### Add the median of the rest of the columns
right    = pcb.drop("SK_ID_CURR", axis=1).groupby("SK_ID_PREV").median().reset_index()
merge_df = merge_df.merge(right=right, how="left", on="SK_ID_PREV").set_index("SK_ID_PREV")

### Prefix column names
merged_cols = ['pos_' + col for col in merge_df.columns]
merge_df.columns = merged_cols

# Merge
prev_app = prev_app.merge(right=merge_df.reset_index(), how='left', on='SK_ID_PREV')

# Mark missing values
prev_app["no_pcb"] = prev_app[merged_cols[0]].map(lambda x: 1 if np.isnan(x) else 0)

### Delete old variables
del pcb, count, merge_df, merged_cols, right
gc.collect()

149

---
<a id="install_pay"></a>

# [^](#toc) <u>Installment Payments</u>

In [13]:
install_pay = pd.read_csv(DATA_PATH + "installments_payments.csv")
print("Shape of install_pay:",  install_pay.shape)

print("\nColumns of install_pay:")
print(" --- ".join(install_pay.columns.values))

Shape of install_pay: (13605401, 8)

Columns of install_pay:
SK_ID_PREV --- SK_ID_CURR --- NUM_INSTALMENT_VERSION --- NUM_INSTALMENT_NUMBER --- DAYS_INSTALMENT --- DAYS_ENTRY_PAYMENT --- AMT_INSTALMENT --- AMT_PAYMENT


<a id="merge_install_pay"></a>

### [^](#toc) <u>Merge into Previous Application</u>

In [14]:
### Create new feature
install_pay["AMT_MISSING"]  = install_pay["AMT_INSTALMENT"]     - install_pay["AMT_PAYMENT"]
install_pay['PAYMENT_PERC'] = install_pay['AMT_PAYMENT']        / install_pay['AMT_INSTALMENT']

# Days past due and days before due (no negative values)
install_pay['DPD']          = install_pay['DAYS_ENTRY_PAYMENT'] - install_pay['DAYS_INSTALMENT']
install_pay['DBD']          = install_pay['DAYS_INSTALMENT']    - install_pay['DAYS_ENTRY_PAYMENT']
install_pay['DPD']          = install_pay['DPD'].apply(lambda x: x if x > 0 else 0)
install_pay['DBD']          = install_pay['DBD'].apply(lambda x: x if x > 0 else 0)

# Amount of values missing in AMT_PAYMENT
install_pay["temp"]         = install_pay["AMT_PAYMENT"].map(lambda x: 1 if np.isnan(x) else 0)

### Select important features
merge_df = pd.DataFrame({
    "missing_max": install_pay.groupby("SK_ID_PREV")["AMT_MISSING"].max(),
    "missing_min": install_pay.groupby("SK_ID_PREV")["AMT_MISSING"].min(),
    "payment_max": install_pay.groupby("SK_ID_PREV")['PAYMENT_PERC'].max(),
    "payment_min": install_pay.groupby("SK_ID_PREV")['PAYMENT_PERC'].min(),
    
    "payment_nan": install_pay.groupby("SK_ID_PREV")["temp"].sum(),
    "N":           install_pay.groupby("SK_ID_PREV")["AMT_MISSING"].count(),
    
    "unique_ver":  install_pay.groupby("SK_ID_PREV")["NUM_INSTALMENT_VERSION"].nunique()
})

# Delete temp column
install_pay = install_pay.drop("temp", axis=1)

# Select median of everything
right = install_pay.drop("SK_ID_CURR", axis=1).groupby("SK_ID_PREV").median().reset_index()

### Merge the two
merge_df = merge_df.reset_index()
merge_df = merge_df.merge(right=right, how="left", on="SK_ID_PREV").set_index("SK_ID_PREV")

### Prefix column names
merged_cols = ['install_' + col for col in merge_df.columns]
merge_df.columns = merged_cols

# Merge
prev_app = prev_app.merge(right=merge_df.reset_index(), how='left', on='SK_ID_PREV')

# Mark missing values
prev_app["no_install"] = prev_app[merged_cols[0]].map(lambda x: 1 if np.isnan(x) else 0)

### Delete old variables
del install_pay, merge_df, merged_cols, right
gc.collect()

56

---
<a id="credit"></a>

# [^](#toc) <u>Credit Card Balance</u>

In [15]:
credit_card = pd.read_csv(DATA_PATH + "credit_card_balance.csv")
print("Shape of credit_card:",  credit_card.shape)

print("\nColumns of credit_card:")
print(" --- ".join(credit_card.columns.values))

Shape of credit_card: (3840312, 23)

Columns of credit_card:
SK_ID_PREV --- SK_ID_CURR --- MONTHS_BALANCE --- AMT_BALANCE --- AMT_CREDIT_LIMIT_ACTUAL --- AMT_DRAWINGS_ATM_CURRENT --- AMT_DRAWINGS_CURRENT --- AMT_DRAWINGS_OTHER_CURRENT --- AMT_DRAWINGS_POS_CURRENT --- AMT_INST_MIN_REGULARITY --- AMT_PAYMENT_CURRENT --- AMT_PAYMENT_TOTAL_CURRENT --- AMT_RECEIVABLE_PRINCIPAL --- AMT_RECIVABLE --- AMT_TOTAL_RECEIVABLE --- CNT_DRAWINGS_ATM_CURRENT --- CNT_DRAWINGS_CURRENT --- CNT_DRAWINGS_OTHER_CURRENT --- CNT_DRAWINGS_POS_CURRENT --- CNT_INSTALMENT_MATURE_CUM --- NAME_CONTRACT_STATUS --- SK_DPD --- SK_DPD_DEF


<a id="credit_process"></a>

### [^](#toc) <u>Data Processing</u>

In [16]:
# Gets indices with outlier values
temp = credit_card[credit_card.NAME_CONTRACT_STATUS.isin(["Refused", "Approved"])].index

# Drops outlier values
credit_card = credit_card.drop(temp, axis=0)

### Create Features

In [17]:
credit_card["AVG_DRAWINGS_ATM_CURRENT"]   = (credit_card["AMT_DRAWINGS_ATM_CURRENT"] /
                                             credit_card["CNT_DRAWINGS_ATM_CURRENT"])

credit_card["AVG_DRAWINGS_CURRENT"]       = (credit_card["AMT_DRAWINGS_CURRENT"] /
                                             credit_card["CNT_DRAWINGS_CURRENT"])

credit_card["AVG_DRAWINGS_OTHER_CURRENT"] = (credit_card["AMT_DRAWINGS_OTHER_CURRENT"] /
                                             credit_card["CNT_DRAWINGS_OTHER_CURRENT"])

credit_card["AVG_DRAWINGS_POS_CURRENT"]   = (credit_card["AMT_DRAWINGS_POS_CURRENT"] /
                                             credit_card["CNT_DRAWINGS_POS_CURRENT"])

<a id="merge_credit"></a>

### [^](#toc) <u>Merge into Previous Application</u>

In [18]:
### Create features
merge_df = pd.DataFrame({
    "AMT_BALANCE": credit_card.groupby("SK_ID_PREV").AMT_BALANCE.mean(),
    "SK_DPD":      credit_card.groupby("SK_ID_PREV").SK_DPD.max(),
    "SK_DPD_DEF":  credit_card.groupby("SK_ID_PREV").SK_DPD_DEF.max(),
    "N":           credit_card.groupby("SK_ID_PREV").count().iloc[:,0]
})

### Categorical column
temp = get_dummies(credit_card, ["NAME_CONTRACT_STATUS"])
cols = ['NAME_CONTRACT_STATUS_Active',
       'NAME_CONTRACT_STATUS_Completed', 'NAME_CONTRACT_STATUS_Demand',
       'NAME_CONTRACT_STATUS_Sent proposal', 'NAME_CONTRACT_STATUS_Signed']
for col in cols:
    temp[col] = temp[col] / (temp["MONTHS_BALANCE"] - 1)
cols.extend(["SK_ID_PREV"])
temp = temp[cols]
temp = temp.groupby("SK_ID_PREV").sum()

# Merge categorical and numerical df
merge_df = temp.join(merge_df)

### Add the rest of the columns
right = credit_card.drop("SK_ID_CURR", axis=1).groupby("SK_ID_PREV").median().reset_index()
merge_df = merge_df.reset_index()
merge_df = merge_df.merge(right=right, how="left", on="SK_ID_PREV").set_index("SK_ID_PREV")

### Prefix column names
merged_cols = ['credit_' + col for col in merge_df.columns]
merge_df.columns = merged_cols

# Merge
prev_app = prev_app.merge(right=merge_df.reset_index(), how='left', on='SK_ID_PREV')

# Mark missing values
prev_app["no_credit"] = prev_app[merged_cols[0]].map(lambda x: 1 if np.isnan(x) else 0)

### Delete old variables
del credit_card, merge_df, merged_cols, right
gc.collect()

56

---
<a id="misc"></a>

# [^](#toc) <u>Miscellaneous clean up</u>

In [19]:
### Drop unneeded ID columns
prev_app = prev_app.drop("SK_ID_PREV", axis=1)
bureau   = bureau.drop("SK_ID_BUREAU", axis=1)

---
<a id="final_merge"></a>

# [^](#toc) <u>Final Data Prep</u>

In [35]:
train = pd.read_csv(DATA_PATH + "train.csv")
test  = pd.read_csv(DATA_PATH + "test.csv")

print("Shape of train:", train.shape)
print("Shape of test:",  test.shape)

Shape of train: (307511, 122)
Shape of test: (48744, 121)


### Split into predictors, target, and id

In [36]:
train_y = train.TARGET
train_x = train.drop(["TARGET"], axis=1)

test_id = test.SK_ID_CURR
test_x  = test

### Merge train and test data

In [37]:
full    = pd.concat([train_x, test_x])
train_N = len(train_x)

<a id="final_process"></a>

### [^](#toc) <u>Data Processing</u>

In [38]:
### Replace maxed values with NaN
full['DAYS_EMPLOYED'].replace(365243, np.nan, inplace=True)

# ### Fill in outlier values
# full["CODE_GENDER"]        = full["CODE_GENDER"].map(lambda x: "F" if x == "XNA" else x)
# full["NAME_FAMILY_STATUS"] = full["NAME_FAMILY_STATUS"].map(lambda x: "Married" if x == "Unknown" else x)

# # NAME_INCOME_TYPE
# cols = ["Unemployed", "Student", "Businessman", "Maternity leave"]
# full["NAME_INCOME_TYPE"] = full["NAME_INCOME_TYPE"].map(lambda x: "MISC" if x in cols else x)

# # ORGANIZATION_TYPE
# cols = ["Trade: type 4", "Trade: type 5"]
# full["ORGANIZATION_TYPE"] = full["ORGANIZATION_TYPE"].map(lambda x: "MISC Trade" if x in cols else x)
# cols = ["Industry: type 13", "Industry: type 8"]
# full["ORGANIZATION_TYPE"] = full["ORGANIZATION_TYPE"].map(lambda x: "MISC Industry" if x in cols else x)

<a id="train_cat"></a>

### [^](#toc) Categorical values

In [39]:
# ### Get dummies
# cols  = ["WALLSMATERIAL_MODE", "NAME_TYPE_SUITE", "NAME_INCOME_TYPE", "NAME_FAMILY_STATUS",
#                "NAME_HOUSING_TYPE", "OCCUPATION_TYPE", "WEEKDAY_APPR_PROCESS_START", "ORGANIZATION_TYPE",
#                "FONDKAPREMONT_MODE", "NAME_EDUCATION_TYPE"]
# full = get_dummies(full, cols)
# full = full.drop(cols, axis=1)

# ### Factorize the dataframe
# cols = ["NAME_CONTRACT_TYPE", "CODE_GENDER", "FLAG_OWN_CAR",
#                "FLAG_OWN_REALTY", "HOUSETYPE_MODE", "EMERGENCYSTATE_MODE"]
# full = factorize_df(full, cols)

<a id="train_feat"></a>

### [^](#toc) Create Features

FIXME:

Scale age and create age in year feature

Monthly debt payments, alimony,living costs divided by monthy gross income

Total balance on credit cards and personal lines of credit except real estate and no installment debt like car loans divided by the sum of credit limits

In [40]:
full['DAYS_EMPLOYED_PERC']  = full['DAYS_EMPLOYED']    / full['DAYS_BIRTH']
full['INCOME_CREDIT_PERC']  = full['AMT_INCOME_TOTAL'] / full['AMT_CREDIT']
full['INCOME_PER_PERSON']   = full['AMT_INCOME_TOTAL'] / full['CNT_FAM_MEMBERS']
full['ANNUITY_INCOME_PERC'] = full['AMT_ANNUITY']      / full['AMT_INCOME_TOTAL']
full['NUM_PROPERTY']        = full['FLAG_OWN_CAR']     + full['FLAG_OWN_REALTY']

### Create feature marking number of enquires
full["NUM_ENQUIRIES"]       = np.zeros(len(full))
for enquiry in tqdm(("AMT_REQ_CREDIT_BUREAU_HOUR", "AMT_REQ_CREDIT_BUREAU_DAY",
                     "AMT_REQ_CREDIT_BUREAU_WEEK", "AMT_REQ_CREDIT_BUREAU_MON",
                     "AMT_REQ_CREDIT_BUREAU_QRT",  "AMT_REQ_CREDIT_BUREAU_YEAR")):
    full["NUM_ENQUIRIES"] += full[enquiry]
    
### Create feature marking number of discrepancies
full["DISCREPANCIES"]       = np.zeros(len(full))
for discrepancy in ("REG_REGION_NOT_LIVE_REGION", "REG_REGION_NOT_WORK_REGION",
                    "LIVE_REGION_NOT_WORK_REGION", "REG_CITY_NOT_LIVE_CITY",
                    "REG_CITY_NOT_WORK_CITY", "LIVE_CITY_NOT_WORK_CITY"):
    full["DISCREPANCIES"] += full[discrepancy]
    
### Create feature marking number of info provided
full["PROVIDE_INFO"]       = np.zeros(len(full))
for info in ("FLAG_MOBIL", "FLAG_EMP_PHONE", "FLAG_WORK_PHONE",
             "FLAG_PHONE", "FLAG_EMAIL"):
    full["PROVIDE_INFO"] += full[info]

### Create feature marking number of flags
full["NUM_FLAGS"]           = np.zeros(len(full))
for flag in tqdm(('FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4',
                 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7',
                 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10',
                 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13',
                 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16',
                 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19',
                 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21')):
    full["NUM_FLAGS"] += full[flag]

100%|██████████| 6/6 [00:00<00:00, 187.38it/s]
100%|██████████| 20/20 [00:00<00:00, 245.62it/s]


<a id="merge_prev"></a>

### [^](#toc) Merge Previous Application with Full

In [41]:
cat_cols = [
        "NAME_CONTRACT_TYPE", "WEEKDAY_APPR_PROCESS_START",
        "FLAG_LAST_APPL_PER_CONTRACT", "NAME_CASH_LOAN_PURPOSE",
        "NAME_CONTRACT_STATUS", "NAME_PAYMENT_TYPE",
        "CODE_REJECT_REASON", "NAME_TYPE_SUITE", "NAME_CLIENT_TYPE",
        "NAME_GOODS_CATEGORY", "NAME_PORTFOLIO", "NAME_PRODUCT_TYPE",
        "CHANNEL_TYPE", "NAME_SELLER_INDUSTRY", "NAME_YIELD_GROUP",
        "PRODUCT_COMBINATION", "SK_ID_CURR"]
min_max_cols = ['AMT_ANNUITY', 'AMT_APPLICATION', 'AMT_CREDIT',
                 'AMT_DOWN_PAYMENT', 'AMT_GOODS_PRICE', 'HOUR_APPR_PROCESS_START',
                 'NFLAG_LAST_APPL_IN_DAY', 'RATE_DOWN_PAYMENT', 'RATE_INTEREST_PRIMARY',
                 'RATE_INTEREST_PRIVILEGED', 'DAYS_DECISION', 'SELLERPLACE_AREA',
                 'CNT_PAYMENT', 'DAYS_FIRST_DRAWING', 'DAYS_FIRST_DUE',
                 'DAYS_LAST_DUE_1ST_VERSION', 'DAYS_LAST_DUE', 'DAYS_TERMINATION',
                 'NFLAG_INSURED_ON_APPROVAL', 'APP_CREDIT_PERC', "SK_ID_CURR"]
num_cols = [col for col in prev_app.columns if col not in cat_cols]
num_cols.append("SK_ID_CURR")

### numerical columns - median
merge_df         = prev_app[num_cols].groupby('SK_ID_CURR').median()
merge_df.columns = ["MED_" + col for col in merge_df.columns]
print("Selected median of numerical columns")

### numerical columns - max
right         = prev_app[min_max_cols].groupby("SK_ID_CURR").max()
right.columns = ["MAX_" + col for col in right.columns]
print("Selected max of numerical columns")

### Merge median and max
merge_df = merge_df.reset_index()
merge_df = merge_df.merge(right=right.reset_index(), how="left", on="SK_ID_CURR").set_index("SK_ID_CURR")
print("Merged median and max")

### numerical columns - min
right = prev_app[min_max_cols].groupby("SK_ID_CURR").min()
right.columns = ["MIN_" + col for col in right.columns]
print("Selected min of numerical columns")

### Merge min with median and max
merge_df = merge_df.reset_index()
merge_df = merge_df.merge(right=right.reset_index(), how="left", on="SK_ID_CURR").set_index("SK_ID_CURR")
print("Merged min with median and max")

### Categorical columns
right = prev_app[cat_cols].set_index("SK_ID_CURR")
right = pd.get_dummies(right).reset_index()
right = right.groupby("SK_ID_CURR").sum().reset_index()
print("Selected categorical columns")

### Merge categorical and numerical
merge_df = merge_df.reset_index()
merge_df = merge_df.merge(right=right, how="left", on="SK_ID_CURR").set_index("SK_ID_CURR")
print("Merged categorical and numerical")

### Prefix column names
merge_df["N"]    = prev_app.groupby('SK_ID_CURR').count().iloc[:,0]
merged_cols      = ['p_' + col for col in merge_df.columns]
merge_df.columns = merged_cols

# Merge
full = full.merge(right=merge_df.reset_index(), how='left', on='SK_ID_CURR')
print("Merged into full")

# Mark missing values
full["no_prev_app"] = full[merged_cols[0]].map(lambda x: 1 if np.isnan(x) else 0)
print("Marked missing values")

### Delete old variables
del merge_df, merged_cols, right, cat_cols, num_cols
gc.collect()

Selected median of numerical columns
Selected max of numerical columns
Merged median and max
Selected min of numerical columns
Merged min with median and max
Selected categorical columns
Merged categorical and numerical
Merged into full
Marked missing values


451

<a id="merge_bureau"></a>

### [^](#toc) Merge Bureau with Full

In [42]:
cat_cols = ['CREDIT_ACTIVE', 'CREDIT_CURRENCY', 'CREDIT_TYPE', 'SK_ID_CURR']
num_cols = [col for col in bureau.columns if col not in cat_cols]
num_cols.append("SK_ID_CURR")

### Numeric columns - median
merge_df         = bureau[num_cols].groupby('SK_ID_CURR').median()
merge_df.columns = ["MED_" + col for col in merge_df.columns]
print("Selected median of numerical columns")

### Numeric columns - max
right         = bureau[num_cols].groupby("SK_ID_CURR").max()
right.columns = ["MAX_" + col for col in right.columns]
print("Selected max of numerical columns")

### Merge median and max
merge_df = merge_df.reset_index()
merge_df = merge_df.merge(right=right.reset_index(), how="left", on="SK_ID_CURR").set_index("SK_ID_CURR")
print("Merged median and max")

### Numeric columns - min
right = bureau[num_cols].groupby("SK_ID_CURR").min()
right.columns = ["MIN_" + col for col in right.columns]
print("Selected min of numerical columns")

### Merge min with median and max
merge_df = merge_df.reset_index()
merge_df = merge_df.merge(right=right.reset_index(), how="left", on="SK_ID_CURR").set_index("SK_ID_CURR")
print("Merged min with median and max")

### Categorical columns
right = bureau[cat_cols].set_index("SK_ID_CURR")
right = pd.get_dummies(right).reset_index()
right = right.groupby("SK_ID_CURR").sum()
right["NUM_UNQ_CREDIT"] = bureau.groupby("SK_ID_CURR")["CREDIT_TYPE"].nunique()
print("Selected categorical columns")

### Merge categorical and numeric
merge_df = merge_df.reset_index()
merge_df = merge_df.merge(right=right.reset_index(), how="left", on="SK_ID_CURR").set_index("SK_ID_CURR")
print("Merged categorical and numerical")

### Prefix column names
merge_df["N"] = bureau.groupby('SK_ID_CURR').count().iloc[:,0]
merged_cols      = ['b_' + col for col in merge_df.columns]
merge_df.columns = merged_cols

# Merge
full = full.merge(right=merge_df.reset_index(), how='left', on='SK_ID_CURR')
print("Merged into full")

# Mark missing values
full["no_bureau"] = full[merged_cols[0]].map(lambda x: 1 if np.isnan(x) else 0)
print("Marked missing values")

### Delete old variables
del merge_df, merged_cols, right, cat_cols, num_cols
gc.collect()

Selected median of numerical columns
Selected max of numerical columns
Merged median and max
Selected min of numerical columns
Merged min with median and max
Selected categorical columns
Merged categorical and numerical
Merged into full
Marked missing values


63

### Delete unneeded columns

In [43]:
full = full.drop("SK_ID_CURR", axis=1)

### Factorize and save Categorical columns

In [44]:
cat_cols = [col for col in full.columns if full[col].dtype == object]
full     = factorize_df(full, cat_cols)

### Split full back into train and test

In [45]:
train_x = full[:train_N]
test_x = full[train_N:]

### Processed data look
train_x.head()

Unnamed: 0,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,NAME_TYPE_SUITE,...,b_CREDIT_TYPE_Loan for the purchase of equipment,b_CREDIT_TYPE_Loan for working capital replenishment,b_CREDIT_TYPE_Microloan,b_CREDIT_TYPE_Mobile operator loan,b_CREDIT_TYPE_Mortgage,b_CREDIT_TYPE_Real estate loan,b_CREDIT_TYPE_Unknown type of loan,b_NUM_UNQ_CREDIT,b_N,no_bureau
0,0,0,0,0,0,202500.0,406597.5,24700.5,351000.0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,8.0,0
1,0,1,0,1,0,270000.0,1293502.5,35698.5,1129500.0,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,4.0,0
2,1,0,1,0,0,67500.0,135000.0,6750.0,135000.0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,0
3,0,1,0,0,0,135000.0,312682.5,29686.5,297000.0,0,...,,,,,,,,,,1
4,0,0,0,0,0,121500.0,513000.0,21865.5,513000.0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0


<a id="models"></a>

# [^](#toc) <u>Modeling</u>

### Machine Learning Imports

In [46]:
from sklearn.model_selection import train_test_split 
from sklearn.metrics         import roc_auc_score, precision_recall_curve, roc_curve
from sklearn.model_selection import KFold

import lightgbm as lgb
from lightgbm                import LGBMClassifier

<a id="feat_reduction"></a>

### [^](#toc) Feature Reduction

In [48]:
training_x, val_x, training_y, val_y = train_test_split(train_x, train_y, test_size=0.2, random_state=17)

lgb_train = lgb.Dataset(data=training_x, label=training_y, categorical_feature=cat_cols)
lgb_eval  = lgb.Dataset(data=val_x, label=val_y, categorical_feature=cat_cols)

# try feature_fraction
params = {'task': 'train', 'boosting_type': 'gbdt', 'objective': 'binary', 'metric': 'auc', 
          'learning_rate': 0.03, 'num_leaves': 55, 'num_iteration': 2000, 'verbose': 0 ,
          'subsample':.9, 'max_depth':7, 'reg_alpha':20, 'reg_lambda':20, 
          'min_split_gain':.05, 'min_child_weight':1, "min_data_in_leaf": 40,
          "feature_fraction":0.5}

start = time.time()
model = lgb.train(params, lgb_train, valid_sets=lgb_eval, early_stopping_rounds=150, verbose_eval=200)
print("Training took {} seconds".format(round(time.time() - start)))

Training until validation scores don't improve for 150 rounds.
[200]	valid_0's auc: 0.740518
[400]	valid_0's auc: 0.758108
[600]	valid_0's auc: 0.7682
[800]	valid_0's auc: 0.774756
[1000]	valid_0's auc: 0.778598
[1200]	valid_0's auc: 0.780772
[1400]	valid_0's auc: 0.782256
[1600]	valid_0's auc: 0.783186
[1800]	valid_0's auc: 0.783862
[2000]	valid_0's auc: 0.784419
[2200]	valid_0's auc: 0.784848
[2400]	valid_0's auc: 0.785169
[2600]	valid_0's auc: 0.78544
[2800]	valid_0's auc: 0.785672
[3000]	valid_0's auc: 0.785815
[3200]	valid_0's auc: 0.785878
[3400]	valid_0's auc: 0.785963
[3600]	valid_0's auc: 0.786049
[3800]	valid_0's auc: 0.786095
Early stopping, best iteration is:
[3773]	valid_0's auc: 0.786124
Training took 1753 seconds


<a id="important_feats"></a>

### [^](#toc) Most important features

In [49]:
NUM_FEATS = 350

feats = sorted(list(zip(model.feature_importance(), train_x.columns)))
feats = list(list(zip(*feats[-NUM_FEATS:]))[1])

<a id="param_tuning"></a>

### [^](#toc) Parameter tuning

#### OLD 

lambda tuning (L2 regularization)

<div hidden>

lambda = 10, num_iter = 2500
- [2500] train: 0.871737, test: 0.78456

lambda = 20, num_iter = 5000
- [3270] train: 0.882077, test: 0.785751

lambda = 40, num_iter = 3000
- [3000] train: 0.868533, test: 0.78583

##### Implemented random_state=17

lambda = 80, num_iter = 3000
- [3000] train: 0.859416, test: 0.785235

lambda = 160, num_iter = 4000
- [3789] train: 0.86189, test: 0.785736

lambda = 0.1, num_iter = 4000
- [2603] train: 0.895891, test: 0.784205

##### change max_depth from 6 to 7




Other variables
--------------

learning_rate = 0.01,
num_leaves = 48,
colsample_bytree = 0.8,
subsample = 0.9,
max_depth = 6,
reg_alpha = 0.1,
min_split_gain = 0.01,
min_child_weight = 1,

</div>

max_depth tuning

<div hidden>

max_depth = 6, num_iter = 3000
- [3000] train: 0.859416, test: 0.785235

max_depth = 7, num_iter = 4000
- [3415] train: 0.878744, test: 0.786247

max_depth = 8, num_iter = 4000
- [2965] train: 0.875784, test: 0.786024

##### Change reg_alpha from 0.1 to 40

max_depth = 8, num_iter = 4000
- [3420] train: 0.858222, test: 0.785642

max_depth = 9, num_iter = 4000
- [] train: , test: 

max_depth = 10, num_iter = 4000
- [] train: , test: 

Other variables
--------------

learning_rate = 0.01,
num_leaves = 48,
colsample_bytree = 0.8,
subsample = 0.9,
reg_alpha = 0.1,
reg_lambda = 80,
min_split_gain = 0.01,
min_child_weight = 1,
random_state=17

</div>

#### 16062018 - Reduce Overfitting 1

<div hidden>

Starting params

'task': 'train', 'boosting_type': 'gbdt', 'objective': 'binary', 'metric': 'auc', 
'learning_rate': 0.01, 'num_leaves': 55, 'num_iteration': 4000, 'verbose': 0 ,
'subsample':.9, 'max_depth':7, 'reg_alpha':20, 'reg_lambda':20, 
'min_split_gain':.05, 'min_child_weight':1, "min_data_in_leaf": 40,
"bagging_freq": 4, "bagging_fraction":0.5

___Initial___

 - Validation: 0.786
 - Test:       0.779

Remove bagging_freq (=4), bagging_fraction (=0.5), add feature_fraction (=0.5).

 - Validation: 0.789602
 - Test:       0.785184954089
 
2nd run

 - Validation: 0.789004
 - Test:       0.784571125086
 
Change min_data_in_leaf (40 --> 60)

 - Validation: 0.789002
 - Test:       0.784891310112
 
Change min_data_in_leaf (60 --> 20)

 - Validation: 0.788626
 - Test:       0.78465452505
 
Change min_data_in_leaf (20 --> 40) and feature_fraction (0.5 --> 0.6)

 - Validation: 0.788533
 - Test:       0.784679323774
 
Change max_depth (7 --> 6) and feature_fraction (0.6 --> 0.5)

 - Validation: 0.788849
 - Test:       0.784880849041
 
Change learning_rate (0.01 --> 0.03) and max_depth (6 --> 7)

 - Validation: 0.788195
 - Test:       0.784589015086

</div>

16062018 - Reduce Overfitting 2

<div hidden>

Starting params

'task': 'train', 'boosting_type': 'gbdt', 'objective': 'binary', 'metric': 'auc', 
'learning_rate': 0.01, 'num_leaves': 55, 'num_iteration': 4000, 'verbose': 0 ,
'subsample':0.9, 'max_depth':7, 'reg_alpha':20, 'reg_lambda':20, 
'min_split_gain':0.05, 'min_child_weight':1, "min_data_in_leaf": 40,
"feature_fraction":0.5

___Initial___

 - Validation: 0.788926
 - Test:       0.784734949879
 
Change subsample (0.9 --> 0.7)

 - Validation: 0.788926
 - Test:       0.784734949879
 
Change min_child_weight (1 --> 4) and remove subsample

 - Validation: 0.78953
 - Test:       0.785232093044
 
Change min_split_gain (0.05 --> 0.1)
 
 - Validation: 0.789338
 - Test:       0.784930068808
 
Change min_child_weight (4 --> 10) and min_split_gain (0.1 --> 0.05)
 
 - Validation: 0.788998
 - Test:       0.784947160688
 
</div>

In [54]:
training_x, testing_x, training_y, testing_y = train_test_split(train_x[feats], train_y, test_size=0.2, random_state=17)
training_x, val_x, training_y, val_y = train_test_split(training_x, training_y, test_size=0.25, random_state=17)

lgb_train = lgb.Dataset(data=training_x,
                        label=training_y,
                        categorical_feature=cat_cols)
lgb_eval  = lgb.Dataset(data=val_x,
                        label=val_y,
                        categorical_feature=cat_cols)

params = {'task': 'train', 'boosting_type': 'gbdt', 'objective': 'binary', 'metric': 'auc', 
          'learning_rate': 0.01, 'num_leaves': 55, 'num_iteration': 4000, 'verbose': 0 ,
          'max_depth':7, 'reg_alpha':20, 'reg_lambda':20, 
          'min_split_gain':0.05, 'min_child_weight':4, "min_data_in_leaf": 40,
          "feature_fraction":0.5}

start = time.time()
model = lgb.train(params, lgb_train, valid_sets=lgb_eval, early_stopping_rounds=150, verbose_eval=100)
print("Training took {} seconds".format(round(time.time() - start)))

prediction = model.predict(testing_x)

score = roc_auc_score(testing_y, prediction)

print("Testing score:", score)

Training until validation scores don't improve for 150 rounds.
[100]	valid_0's auc: 0.735437
[200]	valid_0's auc: 0.747542
[300]	valid_0's auc: 0.756961
[400]	valid_0's auc: 0.763622
[500]	valid_0's auc: 0.769285
[600]	valid_0's auc: 0.773791
[700]	valid_0's auc: 0.776977
[800]	valid_0's auc: 0.77961
[900]	valid_0's auc: 0.781509
[1000]	valid_0's auc: 0.782992
[1100]	valid_0's auc: 0.784065
[1200]	valid_0's auc: 0.785
[1300]	valid_0's auc: 0.785791
[1400]	valid_0's auc: 0.786315
[1500]	valid_0's auc: 0.786691
[1600]	valid_0's auc: 0.787025
[1700]	valid_0's auc: 0.787238
[1800]	valid_0's auc: 0.787481
[1900]	valid_0's auc: 0.787766
[2000]	valid_0's auc: 0.787977
[2100]	valid_0's auc: 0.788174
[2200]	valid_0's auc: 0.788328
[2300]	valid_0's auc: 0.788517
[2400]	valid_0's auc: 0.788673
[2500]	valid_0's auc: 0.788729
[2600]	valid_0's auc: 0.788802
[2700]	valid_0's auc: 0.788826
[2800]	valid_0's auc: 0.788905
[2900]	valid_0's auc: 0.788972
[3000]	valid_0's auc: 0.788943
[3100]	valid_0's auc

<a id="cv"></a>

### [^](#toc) CV Score

In [63]:
def get_score(train_x, train_y, usecols, params, dropcols=[]):  
    dtrain = lgb.Dataset(train_x[usecols].drop(dropcols, axis=1),
                         train_y,
                         categorical_feature=cat_cols)
    
    eval = lgb.cv(params,
             dtrain,
             nfold=5,
             stratified=True,
             num_boost_round=20000,
             early_stopping_rounds=200,
             verbose_eval=100,
             seed = 5,
             show_stdv=True)
    return max(eval['auc-mean'])

params = {'task': 'train', 'boosting_type': 'gbdt', 'objective': 'binary', 'metric': 'auc', 
          'learning_rate': 0.01, 'num_leaves': 55, 'num_iteration': 4000, 'verbose': 0 ,
          'max_depth':7, 'reg_alpha':20, 'reg_lambda':20, 
          'min_split_gain':0.05, 'min_child_weight':4, "min_data_in_leaf": 40,
          "feature_fraction":0.5,}
    
get_score(train_x, train_y, feats, params)

[100]	cv_agg's auc: 0.734262 + 0.0013434
[200]	cv_agg's auc: 0.746387 + 0.00153842
[300]	cv_agg's auc: 0.755584 + 0.00145228
[400]	cv_agg's auc: 0.762399 + 0.00152256
[500]	cv_agg's auc: 0.767869 + 0.00156887
[600]	cv_agg's auc: 0.771965 + 0.00153819
[700]	cv_agg's auc: 0.775268 + 0.00145693
[800]	cv_agg's auc: 0.777931 + 0.00136613
[900]	cv_agg's auc: 0.779935 + 0.00133051
[1000]	cv_agg's auc: 0.781432 + 0.00134249
[1100]	cv_agg's auc: 0.7826 + 0.00137212
[1200]	cv_agg's auc: 0.783581 + 0.0013923
[1300]	cv_agg's auc: 0.784306 + 0.0013814
[1400]	cv_agg's auc: 0.784915 + 0.00140205
[1500]	cv_agg's auc: 0.785411 + 0.00139166
[1600]	cv_agg's auc: 0.785825 + 0.00137735
[1700]	cv_agg's auc: 0.786204 + 0.00137201
[1800]	cv_agg's auc: 0.786534 + 0.00133517
[1900]	cv_agg's auc: 0.786841 + 0.00131041
[2000]	cv_agg's auc: 0.78711 + 0.00128903
[2100]	cv_agg's auc: 0.787321 + 0.00125792
[2200]	cv_agg's auc: 0.7875 + 0.00123477
[2300]	cv_agg's auc: 0.787682 + 0.00120538
[2400]	cv_agg's auc: 0.78783

0.78879859253631179

---
<a id="final"></a>

# [^](#toc) <u>Final submission</u>

In [59]:
folds       = KFold(n_splits=5, shuffle=True, random_state=17)
predictions = np.zeros(test_x.shape[0])

for n_fold, (trn_idx, val_idx) in enumerate(folds.split(train_x)):
    trn_x, trn_y = train_x[feats].iloc[trn_idx], train_y.iloc[trn_idx]
    val_x, val_y = train_x[feats].iloc[val_idx], train_y.iloc[val_idx]
    
    lgb_train = lgb.Dataset(data=trn_x, label=trn_y, categorical_feature=cat_cols)
    lgb_eval  = lgb.Dataset(data=val_x, label=val_y, categorical_feature=cat_cols)

    params = {'task': 'train', 'boosting_type': 'gbdt', 'objective': 'binary', 'metric': 'auc', 
              'learning_rate': 0.01, 'num_leaves': 55, 'num_iteration': 4000, 'verbose': 0 ,
              'max_depth':7, 'reg_alpha':20, 'reg_lambda':20, 
              'min_split_gain':0.05, 'min_child_weight':4, "min_data_in_leaf": 40,
              "feature_fraction":0.5,
              "random_state": random.randint(1, 100)}# Recommended to make the seed random


    model = lgb.train(params, lgb_train, valid_sets=lgb_eval,
                      early_stopping_rounds=150, verbose_eval=100) 
    
    predictions += model.predict(test_x[feats]) / folds.n_splits
    
    print("\n Done with {} fold".format(n_fold + 1))
    
    del model, trn_x, trn_y, val_x, val_y
    gc.collect()

Training until validation scores don't improve for 150 rounds.
[100]	valid_0's auc: 0.728036
[200]	valid_0's auc: 0.739861
[300]	valid_0's auc: 0.749731
[400]	valid_0's auc: 0.757379
[500]	valid_0's auc: 0.763477
[600]	valid_0's auc: 0.768177
[700]	valid_0's auc: 0.77187
[800]	valid_0's auc: 0.77496
[900]	valid_0's auc: 0.777106
[1000]	valid_0's auc: 0.778788
[1100]	valid_0's auc: 0.77996
[1200]	valid_0's auc: 0.780998
[1300]	valid_0's auc: 0.781803
[1400]	valid_0's auc: 0.782387
[1500]	valid_0's auc: 0.782914
[1600]	valid_0's auc: 0.783389
[1700]	valid_0's auc: 0.783822
[1800]	valid_0's auc: 0.784147
[1900]	valid_0's auc: 0.784533
[2000]	valid_0's auc: 0.784751
[2100]	valid_0's auc: 0.784945
[2200]	valid_0's auc: 0.785191
[2300]	valid_0's auc: 0.785439
[2400]	valid_0's auc: 0.785655
[2500]	valid_0's auc: 0.785796
[2600]	valid_0's auc: 0.78597
[2700]	valid_0's auc: 0.786039
[2800]	valid_0's auc: 0.786111
[2900]	valid_0's auc: 0.786182
[3000]	valid_0's auc: 0.78627
[3100]	valid_0's auc:

<a id="final_pred"></a>

### [^](#toc) Final predictions

In [61]:
pd.DataFrame({
    "SK_ID_CURR": test_id,
    "TARGET": predictions
}).to_csv("../submissions/more_tuning.csv", index=False)