# Score:

<a id="toc"></a>

# <u>Table of Contents</u>
1.) [TODO](#todo)  
2.) [Imports](#imports)  
3.) [Bureau](#pos_cash)  
&nbsp;&nbsp;&nbsp;&nbsp; 3.1.) [Data Processing](#bureau_process)  
4.) [Bureau Balance](#bureau_bal)  
&nbsp;&nbsp;&nbsp;&nbsp; 4.1.) [Merge into Bureau](#merge_bureau_bal)  
5.) [Previous Application](#prev_app)  
&nbsp;&nbsp;&nbsp;&nbsp; 5.1.) [Data Processing](#prev_process)  
6.) [POS CASH balance](#pos_cash)  
&nbsp;&nbsp;&nbsp;&nbsp; 6.1.) [Data Processing](#pos_process)  
&nbsp;&nbsp;&nbsp;&nbsp; 6.2.) [Merge into Previous Application](#merge_pos_cash)  
7.) [Installment Payments](#install_pay)  
&nbsp;&nbsp;&nbsp;&nbsp; 7.1.) [Merge into Previous Application](#merge_install_pay)  
8.) [Credit Card Balance](#credit)  
&nbsp;&nbsp;&nbsp;&nbsp; 8.1.) [Data Processing](#credit_process)  
&nbsp;&nbsp;&nbsp;&nbsp; 8.2.) [Merge into Previous Application](#merge_credit)  
9.) [Miscellaneous clean up](#misc)  
10.) [Final Data Prep](#final_merge)  
&nbsp;&nbsp;&nbsp;&nbsp; 10.1.) [Data Processing](#final_process)  
&nbsp;&nbsp;&nbsp;&nbsp; 10.2.) [Create Features](#train_feat)  
&nbsp;&nbsp;&nbsp;&nbsp; 10.3.) [Categorical values](#train_cat)  
&nbsp;&nbsp;&nbsp;&nbsp; 10.4.) [Merge Previous Application with Full](#merge_prev)  
&nbsp;&nbsp;&nbsp;&nbsp; 10.5.) [Merge Bureau with Full](#merge_bureau)  
11.) [Modeling](#models)  
&nbsp;&nbsp;&nbsp;&nbsp; 11.1.) [Feature Reduction](#feat_reduction)  
&nbsp;&nbsp;&nbsp;&nbsp; 11.2.) [Most important features](#important_feats)  
&nbsp;&nbsp;&nbsp;&nbsp; 11.3.) [Parameter tuning](#param_tuning)  
&nbsp;&nbsp;&nbsp;&nbsp; 11.4.) [CV Score](#cv)  
12.) [Final submission](#final)  
&nbsp;&nbsp;&nbsp;&nbsp; 12.1.) [Final predictions](#final_pred)  

<a id="todo"></a>

# [^](#toc) <u>TODO</u>

- Fix skew on columns
- Tinker with the best way to replace missing values (dropping cols?)
- Look for outliers
- Include timeline relatoinships like MONTHS_BALANCE
- Address [this](https://www.kaggle.com/c/home-credit-default-risk/discussion/57248)

---
<a id="imports"></a>

# [^](#toc) <u>Imports</u>

In [1]:
### Standard imports
import pandas as pd
import numpy as np

# Time keeper
import time

# Randomize seeds
import random

# Garbage collector
import gc

# Progress bar
from tqdm import tqdm

### Removes warnings from output
import warnings
warnings.filterwarnings('ignore')

### Setup

In [2]:
def get_dummies(df, cats):
    for col in cats:
        df = pd.concat([df, pd.get_dummies(df[col], prefix=col)], axis=1)
    return df 

def factorize_df(df, cats):
    for col in cats:
        df[col], _ = pd.factorize(df[col])
    return df 

DATA_PATH = "../data/home_default/"

---
<a id="bureau"></a>

# [^](#toc) Bureau

In [3]:
bureau   = pd.read_csv(DATA_PATH + "bureau.csv")
print("Shape of bureau:", bureau.shape)

print("\nColumns of bureau:")
print(" --- ".join(bureau.columns.values))

Shape of bureau: (1716428, 17)

Columns of bureau:
SK_ID_CURR --- SK_ID_BUREAU --- CREDIT_ACTIVE --- CREDIT_CURRENCY --- DAYS_CREDIT --- CREDIT_DAY_OVERDUE --- DAYS_CREDIT_ENDDATE --- DAYS_ENDDATE_FACT --- AMT_CREDIT_MAX_OVERDUE --- CNT_CREDIT_PROLONG --- AMT_CREDIT_SUM --- AMT_CREDIT_SUM_DEBT --- AMT_CREDIT_SUM_LIMIT --- AMT_CREDIT_SUM_OVERDUE --- CREDIT_TYPE --- DAYS_CREDIT_UPDATE --- AMT_ANNUITY


<a id="bureau_process"></a>

### [^](#toc) Data Processing

In [4]:
### Lump together values with low counts
# CREDIT_CURRENCY
cols = ["currency 3", "currency 4"]
bureau.CREDIT_CURRENCY = bureau.CREDIT_CURRENCY.map(lambda x: "MISC" if x in cols else x)

# CREDIT_TYPE
cols = ["Cash loan (non-earmarked)", "Real estate loan", "Loan for the purchase of equipment",
        "Loan for purchase of shares (margin lending)", "Interbank credit", "Mobile operator loan"]
bureau.CREDIT_TYPE = bureau.CREDIT_TYPE.map(lambda x: "MISC" if x in cols else x)

<a id="bureau_bal"></a>

# [^](#toc) <u>Bureau Balance</u>

In [5]:
bureau_balance = pd.read_csv(DATA_PATH + "bureau_balance.csv")
print("Shape of bureau_balance:",  bureau_balance.shape)

print("\nColumns of bureau_balance:")
print(" --- ".join(bureau_balance.columns.values))

Shape of bureau_balance: (27299925, 3)

Columns of bureau_balance:
SK_ID_BUREAU --- MONTHS_BALANCE --- STATUS


<a id="merge_bureau_bal"></a>

### [^](#toc) <u>Merge into Bureau</u>

In [6]:
### Get sum of counts in categorical column
merge_df = get_dummies(bureau_balance, ["STATUS"])
cols = ['STATUS_0', 'STATUS_1', 'STATUS_2', 'STATUS_3', 'STATUS_4', 'STATUS_5', 'STATUS_C', 'STATUS_X']
for col in cols:
    merge_df[col] = merge_df[col] / (merge_df["MONTHS_BALANCE"] - 1)
merge_df = merge_df.drop(["MONTHS_BALANCE", "STATUS"], axis=1)
merge_df = merge_df.groupby("SK_ID_BUREAU").sum().reset_index()

### Add the median of the rest of the columns
right    = bureau_balance.groupby("SK_ID_BUREAU").median().reset_index()
merge_df = merge_df.merge(right=right, how="left", on="SK_ID_BUREAU").set_index("SK_ID_BUREAU")

### Prefix column names
merged_cols = ['bur_bal_' + col for col in merge_df.columns]
merge_df.columns = merged_cols

# Merge
bureau = bureau.merge(right=merge_df.reset_index(), how='left', on='SK_ID_BUREAU')

# Mark missing values
bureau["no_bureau_bal"] = bureau[merged_cols[0]].map(lambda x: 1 if np.isnan(x) else 0)

### Delete old variables
del bureau_balance, merge_df, merged_cols, right
gc.collect()

133

---
<a id="prev_app"></a>

# [^](#toc) <u>Previous Application</u>

In [7]:
prev_app = pd.read_csv(DATA_PATH + "previous_application.csv")
print("Shape of prev_app:",  prev_app.shape)

print("\nColumns of prev_app:")
print(" --- ".join(prev_app.columns.values))

Shape of prev_app: (1670214, 37)

Columns of prev_app:
SK_ID_PREV --- SK_ID_CURR --- NAME_CONTRACT_TYPE --- AMT_ANNUITY --- AMT_APPLICATION --- AMT_CREDIT --- AMT_DOWN_PAYMENT --- AMT_GOODS_PRICE --- WEEKDAY_APPR_PROCESS_START --- HOUR_APPR_PROCESS_START --- FLAG_LAST_APPL_PER_CONTRACT --- NFLAG_LAST_APPL_IN_DAY --- RATE_DOWN_PAYMENT --- RATE_INTEREST_PRIMARY --- RATE_INTEREST_PRIVILEGED --- NAME_CASH_LOAN_PURPOSE --- NAME_CONTRACT_STATUS --- DAYS_DECISION --- NAME_PAYMENT_TYPE --- CODE_REJECT_REASON --- NAME_TYPE_SUITE --- NAME_CLIENT_TYPE --- NAME_GOODS_CATEGORY --- NAME_PORTFOLIO --- NAME_PRODUCT_TYPE --- CHANNEL_TYPE --- SELLERPLACE_AREA --- NAME_SELLER_INDUSTRY --- CNT_PAYMENT --- NAME_YIELD_GROUP --- PRODUCT_COMBINATION --- DAYS_FIRST_DRAWING --- DAYS_FIRST_DUE --- DAYS_LAST_DUE_1ST_VERSION --- DAYS_LAST_DUE --- DAYS_TERMINATION --- NFLAG_INSURED_ON_APPROVAL


<a id="prev_process"></a>

### [^](#toc) Data Processing

In [8]:
### Fill in values that should be null
prev_app['DAYS_FIRST_DRAWING'       ].replace(365243, np.nan, inplace= True)
prev_app['DAYS_FIRST_DUE'           ].replace(365243, np.nan, inplace= True)
prev_app['DAYS_LAST_DUE_1ST_VERSION'].replace(365243, np.nan, inplace= True)
prev_app['DAYS_LAST_DUE'            ].replace(365243, np.nan, inplace= True)
prev_app['DAYS_TERMINATION'         ].replace(365243, np.nan, inplace= True)

### Lump together values with low counts
# NAME_GOODS_CATEGORY
prev_app.NAME_GOODS_CATEGORY = prev_app.NAME_GOODS_CATEGORY.map(
    lambda x: "MISC" if x in ["Weapon", "Insurance"] else x)

# NAME_CASH_LOAN_PURPOSE
prev_app.NAME_CASH_LOAN_PURPOSE = prev_app.NAME_CASH_LOAN_PURPOSE.map(
    lambda x: "MISC" if x in ["Buying a garage", "Misc"] else x)

# Create features
prev_app["APP_CREDIT_PERC"] = prev_app['AMT_APPLICATION'] / prev_app['AMT_CREDIT']

---
<a id="pos_cash"></a>

# [^](#toc) <u>POS CASH balance</u>

In [9]:
pcb = pd.read_csv(DATA_PATH + "POS_CASH_balance.csv")
print("Shape of pcb:",  pcb.shape)

print("\nColumns of pcb:")
print(" --- ".join(pcb.columns.values))

Shape of pcb: (10001358, 8)

Columns of pcb:
SK_ID_PREV --- SK_ID_CURR --- MONTHS_BALANCE --- CNT_INSTALMENT --- CNT_INSTALMENT_FUTURE --- NAME_CONTRACT_STATUS --- SK_DPD --- SK_DPD_DEF


<a id="pos_process"></a>

### [^](#toc) Data Processing

In [10]:
# Remove Outliers
pcb = pcb.drop(pcb[pcb.NAME_CONTRACT_STATUS.isin(["XNA", "Canceled"])].index)

<a id="merge_pos_cash"></a>

### [^](#toc) <u>Merge into Previous Application</u>

In [11]:
### Get Dummies
merge_df = pcb[["SK_ID_PREV", "NAME_CONTRACT_STATUS"]]
merge_df = get_dummies(merge_df, ["NAME_CONTRACT_STATUS"])
merge_df = merge_df.drop("NAME_CONTRACT_STATUS", axis=1)

# Prep for merge
count    = merge_df.groupby("SK_ID_PREV").count()
merge_df = merge_df.groupby("SK_ID_PREV").sum().reset_index()
merge_df["N"] = list(count.iloc[:,0])

### Add the median of the rest of the columns
right    = pcb.drop("SK_ID_CURR", axis=1).groupby("SK_ID_PREV").median().reset_index()
merge_df = merge_df.merge(right=right, how="left", on="SK_ID_PREV").set_index("SK_ID_PREV")

### Prefix column names
merged_cols = ['pos_' + col for col in merge_df.columns]
merge_df.columns = merged_cols

# Merge
prev_app = prev_app.merge(right=merge_df.reset_index(), how='left', on='SK_ID_PREV')

# Mark missing values
prev_app["no_pcb"] = prev_app[merged_cols[0]].map(lambda x: 1 if np.isnan(x) else 0)

### Delete old variables
del pcb, count, merge_df, merged_cols, right
gc.collect()

142

---
<a id="install_pay"></a>

# [^](#toc) <u>Installment Payments</u>

In [12]:
install_pay = pd.read_csv(DATA_PATH + "installments_payments.csv")
print("Shape of install_pay:",  install_pay.shape)

print("\nColumns of install_pay:")
print(" --- ".join(install_pay.columns.values))

Shape of install_pay: (13605401, 8)

Columns of install_pay:
SK_ID_PREV --- SK_ID_CURR --- NUM_INSTALMENT_VERSION --- NUM_INSTALMENT_NUMBER --- DAYS_INSTALMENT --- DAYS_ENTRY_PAYMENT --- AMT_INSTALMENT --- AMT_PAYMENT


<a id="merge_install_pay"></a>

### [^](#toc) <u>Merge into Previous Application</u>

In [13]:
### Create new feature
install_pay["AMT_MISSING"]  = install_pay["AMT_INSTALMENT"]     - install_pay["AMT_PAYMENT"]
install_pay['PAYMENT_PERC'] = install_pay['AMT_PAYMENT']        / install_pay['AMT_INSTALMENT']

# Days past due and days before due (no negative values)
install_pay['DPD']          = install_pay['DAYS_ENTRY_PAYMENT'] - install_pay['DAYS_INSTALMENT']
install_pay['DBD']          = install_pay['DAYS_INSTALMENT']    - install_pay['DAYS_ENTRY_PAYMENT']
install_pay['DPD']          = install_pay['DPD'].apply(lambda x: x if x > 0 else 0)
install_pay['DBD']          = install_pay['DBD'].apply(lambda x: x if x > 0 else 0)

# Amount of values missing in AMT_PAYMENT
install_pay["temp"]         = install_pay["AMT_PAYMENT"].map(lambda x: 1 if np.isnan(x) else 0)

### Select important features
merge_df = pd.DataFrame({
    "missing_max": install_pay.groupby("SK_ID_PREV")["AMT_MISSING"].max(),
    "missing_min": install_pay.groupby("SK_ID_PREV")["AMT_MISSING"].min(),
    "payment_max": install_pay.groupby("SK_ID_PREV")['PAYMENT_PERC'].max(),
    "payment_min": install_pay.groupby("SK_ID_PREV")['PAYMENT_PERC'].min(),
    
    "payment_nan": install_pay.groupby("SK_ID_PREV")["temp"].sum(),
    "N":           install_pay.groupby("SK_ID_PREV")["AMT_MISSING"].count(),
    "unique_ver":  install_pay.groupby("SK_ID_PREV")["NUM_INSTALMENT_VERSION"].unique()
})

# Delete temp column
install_pay = install_pay.drop("temp", axis=1)

# Select median of everything
right = install_pay.drop("SK_ID_CURR", axis=1).groupby("SK_ID_PREV").median().reset_index()

### Merge the two
merge_df = merge_df.reset_index()
merge_df = merge_df.merge(right=right, how="left", on="SK_ID_PREV").set_index("SK_ID_PREV")

### Prefix column names
merged_cols = ['install_' + col for col in merge_df.columns]
merge_df.columns = merged_cols

# Merge
prev_app = prev_app.merge(right=merge_df.reset_index(), how='left', on='SK_ID_PREV')

# Mark missing values
prev_app["no_install"] = prev_app[merged_cols[0]].map(lambda x: 1 if np.isnan(x) else 0)

### Delete old variables
del install_pay, merge_df, merged_cols, right
gc.collect()

64

---
<a id="credit"></a>

# [^](#toc) <u>Credit Card Balance</u>

In [14]:
credit_card = pd.read_csv(DATA_PATH + "credit_card_balance.csv")
print("Shape of credit_card:",  credit_card.shape)

print("\nColumns of credit_card:")
print(" --- ".join(credit_card.columns.values))

Shape of credit_card: (3840312, 23)

Columns of credit_card:
SK_ID_PREV --- SK_ID_CURR --- MONTHS_BALANCE --- AMT_BALANCE --- AMT_CREDIT_LIMIT_ACTUAL --- AMT_DRAWINGS_ATM_CURRENT --- AMT_DRAWINGS_CURRENT --- AMT_DRAWINGS_OTHER_CURRENT --- AMT_DRAWINGS_POS_CURRENT --- AMT_INST_MIN_REGULARITY --- AMT_PAYMENT_CURRENT --- AMT_PAYMENT_TOTAL_CURRENT --- AMT_RECEIVABLE_PRINCIPAL --- AMT_RECIVABLE --- AMT_TOTAL_RECEIVABLE --- CNT_DRAWINGS_ATM_CURRENT --- CNT_DRAWINGS_CURRENT --- CNT_DRAWINGS_OTHER_CURRENT --- CNT_DRAWINGS_POS_CURRENT --- CNT_INSTALMENT_MATURE_CUM --- NAME_CONTRACT_STATUS --- SK_DPD --- SK_DPD_DEF


<a id="credit_process"></a>

### [^](#toc) <u>Data Processing</u>

In [15]:
# Gets indices with outlier values
temp = credit_card[credit_card.NAME_CONTRACT_STATUS.isin(["Refused", "Approved"])].index

# Drops outlier values
credit_card = credit_card.drop(temp, axis=0)

<a id="merge_credit"></a>

### [^](#toc) <u>Merge into Previous Application</u>

In [16]:
### Create features
merge_df = pd.DataFrame({
    "AMT_BALANCE": credit_card.groupby("SK_ID_PREV").AMT_BALANCE.mean(),
    "SK_DPD":      credit_card.groupby("SK_ID_PREV").SK_DPD.max(),
    "SK_DPD_DEF":  credit_card.groupby("SK_ID_PREV").SK_DPD_DEF.max(),
    "N":           credit_card.groupby("SK_ID_PREV").count().iloc[:,0]
})

### Categorical column
temp = get_dummies(credit_card, ["NAME_CONTRACT_STATUS"])
cols = ['NAME_CONTRACT_STATUS_Active',
       'NAME_CONTRACT_STATUS_Completed', 'NAME_CONTRACT_STATUS_Demand',
       'NAME_CONTRACT_STATUS_Sent proposal', 'NAME_CONTRACT_STATUS_Signed']
for col in cols:
    temp[col] = temp[col] / (temp["MONTHS_BALANCE"] - 1)
cols.extend(["SK_ID_PREV"])
temp = temp[cols]
temp = temp.groupby("SK_ID_PREV").sum()

# Merge categorical and numerical df
merge_df = temp.join(merge_df)

### Add the rest of the columns
right = credit_card.drop("SK_ID_CURR", axis=1).groupby("SK_ID_PREV").median().reset_index()
merge_df = merge_df.reset_index()
merge_df = merge_df.merge(right=right, how="left", on="SK_ID_PREV").set_index("SK_ID_PREV")

### Prefix column names
merged_cols = ['credit_' + col for col in merge_df.columns]
merge_df.columns = merged_cols

# Merge
prev_app = prev_app.merge(right=merge_df.reset_index(), how='left', on='SK_ID_PREV')

# Mark missing values
prev_app["no_credit"] = prev_app[merged_cols[0]].map(lambda x: 1 if np.isnan(x) else 0)

### Delete old variables
del credit_card, merge_df, merged_cols, right
gc.collect()

199

---
<a id="misc"></a>

# [^](#toc) <u>Miscellaneous clean up</u>

In [17]:
### Drop unneeded ID columns
prev_app = prev_app.drop("SK_ID_PREV", axis=1)
bureau   = bureau.drop("SK_ID_BUREAU", axis=1)

---
<a id="final_merge"></a>

# [^](#toc) <u>Final Data Prep</u>

In [18]:
train = pd.read_csv(DATA_PATH + "train.csv")
test  = pd.read_csv(DATA_PATH + "test.csv")

print("Shape of train:", train.shape)
print("Shape of test:",  test.shape)

Shape of train: (307511, 122)
Shape of test: (48744, 121)


### Split into predictors, target, and id

In [19]:
train_y = train.TARGET
train_x = train.drop(["TARGET"], axis=1)

test_id = test.SK_ID_CURR
test_x  = test

### Merge train and test data

In [20]:
full    = pd.concat([train_x, test_x])
train_N = len(train_x)

<a id="final_process"></a>

### [^](#toc) <u>Data Processing</u>

In [21]:
### Replace maxed values with NaN
full['DAYS_EMPLOYED'].replace(365243, np.nan, inplace=True)

### Fill in outlier values
full["CODE_GENDER"]        = full["CODE_GENDER"].map(lambda x: "F" if x == "XNA" else x)
full["NAME_FAMILY_STATUS"] = full["NAME_FAMILY_STATUS"].map(lambda x: "Married" if x == "Unknown" else x)

# NAME_INCOME_TYPE
cols = ["Unemployed", "Student", "Businessman", "Maternity leave"]
full["NAME_INCOME_TYPE"] = full["NAME_INCOME_TYPE"].map(lambda x: "MISC" if x in cols else x)

# ORGANIZATION_TYPE
cols = ["Trade: type 4", "Trade: type 5"]
full["ORGANIZATION_TYPE"] = full["ORGANIZATION_TYPE"].map(lambda x: "MISC Trade" if x in cols else x)
cols = ["Industry: type 13", "Industry: type 8"]
full["ORGANIZATION_TYPE"] = full["ORGANIZATION_TYPE"].map(lambda x: "MISC Industry" if x in cols else x)

<a id="train_cat"></a>

### [^](#toc) Categorical values

In [22]:
### Get dummies
cols  = ["WALLSMATERIAL_MODE", "NAME_TYPE_SUITE", "NAME_INCOME_TYPE", "NAME_FAMILY_STATUS",
               "NAME_HOUSING_TYPE", "OCCUPATION_TYPE", "WEEKDAY_APPR_PROCESS_START", "ORGANIZATION_TYPE",
               "FONDKAPREMONT_MODE", "NAME_EDUCATION_TYPE"]
full = get_dummies(full, cols)
full = full.drop(cols, axis=1)

### Factorize the dataframe
cols = ["NAME_CONTRACT_TYPE", "CODE_GENDER", "FLAG_OWN_CAR",
               "FLAG_OWN_REALTY", "HOUSETYPE_MODE", "EMERGENCYSTATE_MODE"]
full = factorize_df(full, cols)

<a id="train_feat"></a>

### [^](#toc) Create Features

In [23]:
full['DAYS_EMPLOYED_PERC']  = full['DAYS_EMPLOYED']    / full['DAYS_BIRTH']
full['INCOME_CREDIT_PERC']  = full['AMT_INCOME_TOTAL'] / full['AMT_CREDIT']
full['INCOME_PER_PERSON']   = full['AMT_INCOME_TOTAL'] / full['CNT_FAM_MEMBERS']
full['ANNUITY_INCOME_PERC'] = full['AMT_ANNUITY']      / full['AMT_INCOME_TOTAL']

### Create feature marking number of flags
full["NUM_FLAGS"]           = np.zeros(len(full))
for flag in ('FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4',
             'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7',
             'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10',
             'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13',
             'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16',
             'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19',
             'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21'):
    full["NUM_FLAGS"] += full[flag]

<a id="merge_prev"></a>

### [^](#toc) Merge Previous Application with Full

In [24]:
cat_cols = [
        "NAME_CONTRACT_TYPE", "WEEKDAY_APPR_PROCESS_START",
        "FLAG_LAST_APPL_PER_CONTRACT", "NAME_CASH_LOAN_PURPOSE",
        "NAME_CONTRACT_STATUS", "NAME_PAYMENT_TYPE",
        "CODE_REJECT_REASON", "NAME_TYPE_SUITE", "NAME_CLIENT_TYPE",
        "NAME_GOODS_CATEGORY", "NAME_PORTFOLIO", "NAME_PRODUCT_TYPE",
        "CHANNEL_TYPE", "NAME_SELLER_INDUSTRY", "NAME_YIELD_GROUP",
        "PRODUCT_COMBINATION", "SK_ID_CURR"]
min_max_cols = ['AMT_ANNUITY', 'AMT_APPLICATION', 'AMT_CREDIT',
                 'AMT_DOWN_PAYMENT', 'AMT_GOODS_PRICE', 'HOUR_APPR_PROCESS_START',
                 'NFLAG_LAST_APPL_IN_DAY', 'RATE_DOWN_PAYMENT', 'RATE_INTEREST_PRIMARY',
                 'RATE_INTEREST_PRIVILEGED', 'DAYS_DECISION', 'SELLERPLACE_AREA',
                 'CNT_PAYMENT', 'DAYS_FIRST_DRAWING', 'DAYS_FIRST_DUE',
                 'DAYS_LAST_DUE_1ST_VERSION', 'DAYS_LAST_DUE', 'DAYS_TERMINATION',
                 'NFLAG_INSURED_ON_APPROVAL', 'APP_CREDIT_PERC', "SK_ID_CURR"]
num_cols = [col for col in prev_app.columns if col not in cat_cols]
num_cols.append("SK_ID_CURR")

### numerical columns - median
merge_df         = prev_app[num_cols].groupby('SK_ID_CURR').median()
merge_df.columns = ["MED_" + col for col in merge_df.columns]
print("Selected median of numerical columns")

### numerical columns - max
right         = prev_app[min_max_cols].groupby("SK_ID_CURR").max()
right.columns = ["MAX_" + col for col in right.columns]
print("Selected max of numerical columns")

### Merge median and max
merge_df = merge_df.reset_index()
merge_df = merge_df.merge(right=right.reset_index(), how="left", on="SK_ID_CURR").set_index("SK_ID_CURR")
print("Merged median and max")

### numerical columns - min
right = prev_app[min_max_cols].groupby("SK_ID_CURR").min()
right.columns = ["MIN_" + col for col in right.columns]
print("Selected min of numerical columns")

### Merge min with median and max
merge_df = merge_df.reset_index()
merge_df = merge_df.merge(right=right.reset_index(), how="left", on="SK_ID_CURR").set_index("SK_ID_CURR")
print("Merged min with median and max")

### Categorical columns
right = prev_app[cat_cols].set_index("SK_ID_CURR")
right = pd.get_dummies(right).reset_index()
right = right.groupby("SK_ID_CURR").sum().reset_index()
print("Selected categorical columns")

### Merge categorical and numerical
merge_df = merge_df.reset_index()
merge_df = merge_df.merge(right=right, how="left", on="SK_ID_CURR").set_index("SK_ID_CURR")
print("Merged categorical and numerical")

### Prefix column names
merge_df["N"]    = prev_app.groupby('SK_ID_CURR').count().iloc[:,0]
merged_cols      = ['p_' + col for col in merge_df.columns]
merge_df.columns = merged_cols

# Merge
full = full.merge(right=merge_df.reset_index(), how='left', on='SK_ID_CURR')
print("Merged into full")

# Mark missing values
full["no_prev_app"] = full[merged_cols[0]].map(lambda x: 1 if np.isnan(x) else 0)
print("Marked missing values")

### Delete old variables
del merge_df, merged_cols, right, cat_cols, num_cols
gc.collect()

Selected median of numerical columns
Selected max of numerical columns
Merged median and max
Selected min of numerical columns
Merged min with median and max
Selected categorical columns
Merged categorical and numerical
Merged into full
Marked missing values


406

<a id="merge_bureau"></a>

### [^](#toc) Merge Bureau with Full

In [25]:
cat_cols = ['CREDIT_ACTIVE', 'CREDIT_CURRENCY', 'CREDIT_TYPE', 'SK_ID_CURR']
num_cols = [col for col in bureau.columns if col not in cat_cols]
num_cols.append("SK_ID_CURR")

### Numeric columns - median
merge_df         = bureau[num_cols].groupby('SK_ID_CURR').median()
merge_df.columns = ["MED_" + col for col in merge_df.columns]
print("Selected median of numerical columns")

### Numeric columns - max
right         = bureau[num_cols].groupby("SK_ID_CURR").max()
right.columns = ["MAX_" + col for col in right.columns]
print("Selected max of numerical columns")

### Merge median and max
merge_df = merge_df.reset_index()
merge_df = merge_df.merge(right=right.reset_index(), how="left", on="SK_ID_CURR").set_index("SK_ID_CURR")
print("Merged median and max")

### Numeric columns - min
right = bureau[num_cols].groupby("SK_ID_CURR").min()
right.columns = ["MIN_" + col for col in right.columns]
print("Selected min of numerical columns")

### Merge min with median and max
merge_df = merge_df.reset_index()
merge_df = merge_df.merge(right=right.reset_index(), how="left", on="SK_ID_CURR").set_index("SK_ID_CURR")
print("Merged min with median and max")

### Categorical columns
right = bureau[cat_cols].set_index("SK_ID_CURR")
right = pd.get_dummies(right).reset_index()
right = right.groupby("SK_ID_CURR").sum().reset_index()
print("Selected categorical columns")

### Merge categorical and numeric
merge_df = merge_df.reset_index()
merge_df = merge_df.merge(right=right, how="left", on="SK_ID_CURR").set_index("SK_ID_CURR")
print("Merged categorical and numerical")

### Prefix column names
merge_df["N"] = bureau.groupby('SK_ID_CURR').count().iloc[:,0]
merged_cols      = ['b_' + col for col in merge_df.columns]
merge_df.columns = merged_cols

# Merge
full = full.merge(right=merge_df.reset_index(), how='left', on='SK_ID_CURR')
print("Merged into full")

# Mark missing values
full["no_bureau"] = full[merged_cols[0]].map(lambda x: 1 if np.isnan(x) else 0)
print("Marked missing values")

### Delete old variables
del merge_df, merged_cols, right, cat_cols, num_cols
gc.collect()

Selected median of numerical columns
Selected max of numerical columns
Merged median and max
Selected min of numerical columns
Merged min with median and max
Selected categorical columns
Merged categorical and numerical
Merged into full
Marked missing values


117

### Delete unneeded columns

In [26]:
full = full.drop("SK_ID_CURR", axis=1)

### Split full back into train and test

In [27]:
train_x = full[:train_N]
test_x = full[train_N:]

### Processed data look
train_x.head()

Unnamed: 0,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,REGION_POPULATION_RELATIVE,...,b_CREDIT_TYPE_Consumer credit,b_CREDIT_TYPE_Credit card,b_CREDIT_TYPE_Loan for business development,b_CREDIT_TYPE_Loan for working capital replenishment,b_CREDIT_TYPE_MISC,b_CREDIT_TYPE_Microloan,b_CREDIT_TYPE_Mortgage,b_CREDIT_TYPE_Unknown type of loan,b_N,no_bureau
0,0,0,0,0,0,202500.0,406597.5,24700.5,351000.0,0.018801,...,4.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,8.0,0
1,0,1,0,1,0,270000.0,1293502.5,35698.5,1129500.0,0.003541,...,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0
2,1,0,1,0,0,67500.0,135000.0,6750.0,135000.0,0.010032,...,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0
3,0,1,0,0,0,135000.0,312682.5,29686.5,297000.0,0.008019,...,,,,,,,,,,1
4,0,0,0,0,0,121500.0,513000.0,21865.5,513000.0,0.028663,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0


<a id="models"></a>

# [^](#toc) <u>Modeling</u>

### Machine Learning Imports

In [None]:
from sklearn.model_selection import train_test_split 
from sklearn.metrics         import roc_auc_score, precision_recall_curve, roc_curve
from sklearn.model_selection import KFold

import lightgbm as lgb
from lightgbm                import LGBMClassifier

<a id="feat_reduction"></a>

### [^](#toc) Feature Reduction

In [28]:
training_x, val_x, training_y, val_y = train_test_split(train_x, train_y, test_size=0.2, random_state=17)
lgb_train = lgb.Dataset(data=training_x, label=training_y)
lgb_eval  = lgb.Dataset(data=val_x, label=val_y)

# try feature_fraction
params = {'task': 'train', 'boosting_type': 'gbdt', 'objective': 'binary', 'metric': 'auc', 
          'learning_rate': 0.01, 'num_leaves': 48, 'num_iteration': 5000, 'verbose': 0 ,
          'colsample_bytree':.8, 'subsample':.9, 'max_depth':7, 'reg_alpha':.1, 'reg_lambda':.1, 
          'min_split_gain':.01, 'min_child_weight':1}

start = time.time()
model = lgb.train(params, lgb_train, valid_sets=lgb_eval, early_stopping_rounds=150, verbose_eval=200)
print("Training took {} seconds".format(round(time.time() - start)))

Training until validation scores don't improve for 150 rounds.
[200]	valid_0's auc: 0.741063
[400]	valid_0's auc: 0.756603
[600]	valid_0's auc: 0.768568
[800]	valid_0's auc: 0.775112
[1000]	valid_0's auc: 0.778152
[1200]	valid_0's auc: 0.779786
[1400]	valid_0's auc: 0.780904
[1600]	valid_0's auc: 0.781531
[1800]	valid_0's auc: 0.781982
[2000]	valid_0's auc: 0.782147
[2200]	valid_0's auc: 0.782333
[2400]	valid_0's auc: 0.782354
[2600]	valid_0's auc: 0.782361
Early stopping, best iteration is:
[2498]	valid_0's auc: 0.782466
Training took 1062 seconds


<a id="important_feats"></a>

### [^](#toc) Most important features

In [30]:
NUM_FEATS = 350

feats = sorted(list(zip(model.feature_importance(), train_x.columns)))
feats = list(list(zip(*feats[-NUM_FEATS:]))[1])

<a id="param_tuning"></a>

### [^](#toc) Parameter tuning

TODO: Add parameters:

    min_data_in_leaf
    feature_fraction
    feature_fraction_seed

Step 1: lambda tuning (L2 regularization)

<div hidden>

lambda = 10, num_iter = 2500
- [2500] train: 0.871737, test: 0.78456

lambda = 20, num_iter = 5000
- [3270] train: 0.882077, test: 0.785751

lambda = 40, num_iter = 3000
- [3000] train: 0.868533, test: 0.78583

##### Implemented random_state=17

lambda = 80, num_iter = 3000
- [3000] train: 0.859416, test: 0.785235

lambda = 160, num_iter = 4000
- [3789] train: 0.86189, test: 0.785736

lambda = 0.1, num_iter = 4000
- [2603] train: 0.895891, test: 0.784205

##### change max_depth from 6 to 7




Other variables
--------------

learning_rate = 0.01,
num_leaves = 48,
colsample_bytree = 0.8,
subsample = 0.9,
max_depth = 6,
reg_alpha = 0.1,
min_split_gain = 0.01,
min_child_weight = 1,

</div>

Step 2: max_depth tuning

<div hidden>

max_depth = 6, num_iter = 3000
- [3000] train: 0.859416, test: 0.785235

max_depth = 7, num_iter = 4000
- [3415] train: 0.878744, test: 0.786247

max_depth = 8, num_iter = 4000
- [2965] train: 0.875784, test: 0.786024

##### Change reg_alpha from 0.1 to 40

max_depth = 8, num_iter = 4000
- [3420] train: 0.858222, test: 0.785642

max_depth = 9, num_iter = 4000
- [] train: , test: 

max_depth = 10, num_iter = 4000
- [] train: , test: 

Other variables
--------------

learning_rate = 0.01,
num_leaves = 48,
colsample_bytree = 0.8,
subsample = 0.9,
reg_alpha = 0.1,
reg_lambda = 80,
min_split_gain = 0.01,
min_child_weight = 1,
random_state=17

</div>

Step 3: alpha tuning (L1 regularization)

In [None]:
training_x, val_x, training_y, val_y = train_test_split(train_x[feats], train_y, test_size=0.2, random_state=17)

model = LGBMClassifier(
                        learning_rate = 0.01,
                        num_leaves = 48,
                        colsample_bytree = 0.8,
                        subsample = 0.9,
                        max_depth = 7,
                        reg_alpha = 0.1,
                        reg_lambda = 40,
                        min_split_gain = 0.01,
                        min_child_weight = 1,
                        num_iteration = 4000,
                        random_state=17
)

model.fit(training_x, training_y, 
          eval_set= [(training_x, training_y), (val_x, val_y)], 
          eval_metric='auc', verbose=100, early_stopping_rounds=100
)

<a id="cv"></a>

### [^](#toc) CV Score

In [37]:
def get_score(train_x, train_y, usecols, params, dropcols=[]):  
    dtrain = lgb.Dataset(train_x[usecols].drop(dropcols, axis=1), train_y)
    eval = lgb.cv(params,
             dtrain,
             nfold=5,
             stratified=True,
             num_boost_round=20000,
             early_stopping_rounds=200,
             verbose_eval=100,
             seed = 5,
             show_stdv=True)
    return max(eval['auc-mean'])

params = {'task': 'train', 'boosting_type': 'gbdt', 'objective': 'binary', 'metric': 'auc', 
          'learning_rate': 0.01, 'num_leaves': 48, 'num_iteration': 3500, 'verbose': 0 ,
          'colsample_bytree':.8, 'subsample':.9, 'max_depth':7, 'reg_alpha':.1, 'reg_lambda':40, 
          'min_split_gain':.01, 'min_child_weight':1}
    
get_score(train_x, train_y, feats, params)

[100]	cv_agg's auc: 0.728685 + 0.00197275
[200]	cv_agg's auc: 0.742231 + 0.00223602
[300]	cv_agg's auc: 0.752246 + 0.00210872
[400]	cv_agg's auc: 0.761042 + 0.00188143
[500]	cv_agg's auc: 0.767395 + 0.00179124
[600]	cv_agg's auc: 0.772201 + 0.00170439
[700]	cv_agg's auc: 0.775731 + 0.00159968
[800]	cv_agg's auc: 0.778187 + 0.00156327
[900]	cv_agg's auc: 0.779986 + 0.00160404
[1000]	cv_agg's auc: 0.781391 + 0.00164479
[1100]	cv_agg's auc: 0.782433 + 0.00169232
[1200]	cv_agg's auc: 0.78324 + 0.00168115
[1300]	cv_agg's auc: 0.783937 + 0.00163168
[1400]	cv_agg's auc: 0.784519 + 0.00159277
[1500]	cv_agg's auc: 0.784978 + 0.0015947
[1600]	cv_agg's auc: 0.785432 + 0.00157407
[1700]	cv_agg's auc: 0.785781 + 0.0015493
[1800]	cv_agg's auc: 0.786105 + 0.00156533
[1900]	cv_agg's auc: 0.786391 + 0.00156107
[2000]	cv_agg's auc: 0.786633 + 0.00156858
[2100]	cv_agg's auc: 0.786865 + 0.00154762
[2200]	cv_agg's auc: 0.787007 + 0.00153578
[2300]	cv_agg's auc: 0.78718 + 0.00150114
[2400]	cv_agg's auc: 0.7

0.78785864122411065

---
<a id="final"></a>

# [^](#toc) <u>Final submission</u>

Help from [Dmitriy Kisil](https://www.kaggle.com/oysiyl) and [his kernel](https://www.kaggle.com/oysiyl/good-fun-with-ligthgbm/code)

In [42]:
folds = KFold(n_splits=5, shuffle=True, random_state=17)
oof_preds = np.zeros(train_x.shape[0])
sub_preds = np.zeros(test_x.shape[0])

for n_fold, (trn_idx, val_idx) in enumerate(folds.split(train_x)):
    trn_x, trn_y = train_x[feats].iloc[trn_idx], train_y.iloc[trn_idx]
    val_x, val_y = train_x[feats].iloc[val_idx], train_y.iloc[val_idx]
  

    model = LGBMClassifier(
        learning_rate = 0.01,
        num_leaves = 48,
        colsample_bytree = 0.8,
        subsample = 0.9,
        max_depth = 7,
        reg_alpha = 0.1,
        reg_lambda = 40,
        min_split_gain = 0.01,
        min_child_weight = 1,
        num_iteration = 4000,
        random_state=17 #random.randint(1, 100) # Recommended to make the seed random
    )
        
    model.fit(trn_x, trn_y, 
            eval_set= [(trn_x, trn_y), (val_x, val_y)], 
            eval_metric='auc', verbose=100, early_stopping_rounds=100
           )
    
    oof_preds[val_idx] = model.predict_proba(val_x, num_iteration=model.best_iteration_)[:, 1]
    sub_preds         += model.predict_proba(
                                             test_x[feats], 
                                             num_iteration=model.best_iteration_
                                            )[:, 1] / folds.n_splits
    
    print('Fold %2d AUC : %.6f' % (n_fold + 1, roc_auc_score(val_y, oof_preds[val_idx])))
    del model, trn_x, trn_y, val_x, val_y
    gc.collect()

Training until validation scores don't improve for 100 rounds.
[100]	training's auc: 0.739145	valid_1's auc: 0.72141
[200]	training's auc: 0.755157	valid_1's auc: 0.734595
[300]	training's auc: 0.771063	valid_1's auc: 0.746886
[400]	training's auc: 0.783347	valid_1's auc: 0.756159
[500]	training's auc: 0.793723	valid_1's auc: 0.763541
[600]	training's auc: 0.802127	valid_1's auc: 0.768948
[700]	training's auc: 0.808974	valid_1's auc: 0.772818
[800]	training's auc: 0.814616	valid_1's auc: 0.775662
[900]	training's auc: 0.819388	valid_1's auc: 0.777724
[1000]	training's auc: 0.823686	valid_1's auc: 0.779236
[1100]	training's auc: 0.827613	valid_1's auc: 0.780341
[1200]	training's auc: 0.831232	valid_1's auc: 0.781264
[1300]	training's auc: 0.834612	valid_1's auc: 0.781969
[1400]	training's auc: 0.837893	valid_1's auc: 0.782652
[1500]	training's auc: 0.840863	valid_1's auc: 0.783169
[1600]	training's auc: 0.843903	valid_1's auc: 0.783651
[1700]	training's auc: 0.846852	valid_1's auc: 0.78

[1500]	training's auc: 0.840345	valid_1's auc: 0.788182
[1600]	training's auc: 0.843377	valid_1's auc: 0.78869
[1700]	training's auc: 0.846347	valid_1's auc: 0.789097
[1800]	training's auc: 0.849075	valid_1's auc: 0.789491
[1900]	training's auc: 0.851941	valid_1's auc: 0.789835
[2000]	training's auc: 0.854655	valid_1's auc: 0.790127
[2100]	training's auc: 0.857227	valid_1's auc: 0.790321
[2200]	training's auc: 0.859682	valid_1's auc: 0.790518
[2300]	training's auc: 0.862293	valid_1's auc: 0.790769
[2400]	training's auc: 0.864877	valid_1's auc: 0.790947
[2500]	training's auc: 0.867407	valid_1's auc: 0.791072
[2600]	training's auc: 0.869817	valid_1's auc: 0.791187
[2700]	training's auc: 0.872195	valid_1's auc: 0.791243
[2800]	training's auc: 0.874527	valid_1's auc: 0.791368
[2900]	training's auc: 0.876742	valid_1's auc: 0.791338
Early stopping, best iteration is:
[2819]	training's auc: 0.874913	valid_1's auc: 0.791381
Fold  5 AUC : 0.791381


<a id="final_pred"></a>

### [^](#toc) Final predictions

In [43]:
pd.DataFrame({
    "SK_ID_CURR": test_id,
    "TARGET": sub_preds
}).to_csv("../submissions/lambda_40.csv", index=False)