<a id="toc"></a>

# <u>Table of Contents</u>
1.) [TODO](#todo)  
2.) [Imports](#imports)  
3.) [Load data](#load)  
4.) [Bureau](#bureau)  
5.) [Bureau Balance](#bureau_bal)  
&nbsp;&nbsp;&nbsp;&nbsp; 5.1.) [Merge into Bureau](#merge_bureau_bal)  
6.) [Previous Application](#prev_app)  
7.) [POS CASH balance](#pos_cash)  
&nbsp;&nbsp;&nbsp;&nbsp; 7.1.) [Missing values](#pos_nan)  
&nbsp;&nbsp;&nbsp;&nbsp; 7.2.) [Merge into Previous Application](#merge_pos_cash)  
8.) [Installment Payments](#install_pay)  
&nbsp;&nbsp;&nbsp;&nbsp; 8.1.) [Missing values](#install_nan)  
&nbsp;&nbsp;&nbsp;&nbsp; 8.2.) [Merge into Previous Application](#merge_install_pay)  
9.) [Credit Card Balance](#credit)  
&nbsp;&nbsp;&nbsp;&nbsp; 9.1.) [Missing values](#credit_nan)  
&nbsp;&nbsp;&nbsp;&nbsp; 9.2.) [Merge into Previous Application](#merge_credit)  
10.) [Misc clean up](#clean_up)  
11.) [Final Data Prep](#final_merge)  
&nbsp;&nbsp;&nbsp;&nbsp; 11.1.) [Missing values](#final_nan)  
12.) [Modeling](#models)  
13.) [Predictions](#predictions)  
14.) [Save file to CSV](#save)  

<a id="todo"></a>

# [^](#toc) <u>TODO</u>

- Fix skew on columns
- Tinker with the best way to replace missing values (dropping cols?)
- Look for outliers
- Merge db together
- Include timeline relatoinships like MONTHS_BALANCE
- Tune model parameters
- Address [this](https://www.kaggle.com/c/home-credit-default-risk/discussion/57248)

---
<a id="imports"></a>

# [^](#toc) <u>Imports</u>

In [1]:
### Standard imports
import pandas as pd
import numpy as np

# Time keeper
import time

# Progress bar
from tqdm import tqdm

# Modeling imports
from sklearn.model_selection import train_test_split 
import lightgbm as lgb

### Removes warnings from output
import warnings
warnings.filterwarnings('ignore')

### Helper functions

In [2]:
# function to create dummy variables of categorical features
def get_dummies(df, cats):
    for col in cats:
        df = pd.concat([df, pd.get_dummies(df[col], prefix=col)], axis=1)
    return df 

def fillna_num(df):
    missing_cols = [col for col in df.columns if any(df[col].isnull()) and df[col].dtype != object]
    for col in missing_cols:
        df[col] = df[col].fillna(df[col].median())
    return df

def fillna_cat(df):
    for col in [col for col in df if df[col].dtype==object]:
        df[col] = df[col].fillna(df[col].mode()[0])
    return df

def factorize_df(df, cats):
    for col in cats:
        df[col], _ = pd.factorize(df[col])
    return df 

---
<a id="load"></a>

# [^](#toc) <u>Data Path</u>

In [3]:
DATA_PATH = "../data/home_default/"

Shape of bureau: (1716428, 17)
Shape of prev_app: (1670214, 37)


---
<a id="bureau"></a>

# [^](#toc) <u>Bureau</u>

In [None]:
bureau   = pd.read_csv(DATA_PATH + "bureau.csv")

print("Shape of bureau:",    bureau.shape)

### Missing values

In [None]:
bureau = fillna_num(bureau)
bureau = fillna_cat(bureau)

sum(bureau.isnull().sum())

<a id="bureau_bal"></a>

# [^](#toc) <u>Bureau Balance</u>

In [7]:
bureau_balance = pd.read_csv(DATA_PATH + "bureau_balance.csv")
print("Shape of bureau_balance:",  bureau_balance.shape)

print("\nColumns of bureau_balance:")
print(" --- ".join(bureau_balance.columns.values))

Shape of bureau_balance: (27299925, 3)

Columns of bureau_balance:
SK_ID_BUREAU --- MONTHS_BALANCE --- STATUS


<a id="merge_bureau_bal"></a>

### [^](#toc) <u>Merge into Bureau</u>

In [8]:
# Setup bureau balance - get dummies
merge_df = get_dummies(bureau_balance, ["STATUS"])

merge_df = merge_df.drop(["MONTHS_BALANCE", "STATUS"], axis=1)

# prep for merge
merge_df = merge_df.groupby("SK_ID_BUREAU").sum()

### Add the max number of months
merge_df["max_months"] = bureau_balance.groupby("SK_ID_BUREAU")["MONTHS_BALANCE"].max()

### Remember added columns
merged_cols = ['bur_bal_' + col for col in merge_df.columns]
merge_df.columns = merged_cols

# Merge
bureau = bureau.merge(right=merge_df.reset_index(), how='left', on='SK_ID_BUREAU')

### Fill in new missing values

In [9]:
bureau["no_bureau_bal"] = bureau[merged_cols[0]].map(lambda x: 1 if np.isnan(x) else 0)
bureau[merged_cols]     = bureau[merged_cols].fillna(0)
sum(bureau.isnull().sum())

---
<a id="prev_app"></a>

# [^](#toc) <u>Previous Application</u>

In [None]:
prev_app = pd.read_csv(DATA_PATH + "previous_application.csv")

print("Shape of prev_app:",  prev_app.shape)

<a id="prev_nan"></a>

### [^](#toc) Missing values

In [None]:
prev_app = fillna_num(prev_app)
prev_app = fillna_cat(prev_app)

sum(prev_app.isnull().sum())

---
<a id="pos_cash"></a>

# [^](#toc) <u>POS CASH balance</u>

In [10]:
pcb = pd.read_csv(DATA_PATH + "POS_CASH_balance.csv")
print("Shape of pcb:",  pcb.shape)

print("\nColumns of pcb:")
print(" --- ".join(pcb.columns.values))

Shape of pcb: (10001358, 8)

Columns of pcb:
SK_ID_PREV --- SK_ID_CURR --- MONTHS_BALANCE --- CNT_INSTALMENT --- CNT_INSTALMENT_FUTURE --- NAME_CONTRACT_STATUS --- SK_DPD --- SK_DPD_DEF


<a id="pos_nan"></a>

### [^](#toc) Missing Values

In [11]:
for col in ("CNT_INSTALMENT", "CNT_INSTALMENT_FUTURE"):
    pcb[col] = pcb[col].transform(lambda x: x.fillna(x.median()))

### Remove Outliers

In [12]:
pcb = pcb.drop(pcb[pcb.NAME_CONTRACT_STATUS.isin(["XNA", "Canceled"])].index)

### Get Dummies

In [13]:
merge_df = pcb[["SK_ID_PREV", "NAME_CONTRACT_STATUS"]]

merge_df = get_dummies(merge_df, ["NAME_CONTRACT_STATUS"])
merge_df = merge_df.drop("NAME_CONTRACT_STATUS", axis=1)

<a id="merge_pos_cash"></a>

### [^](#toc) <u>Merge into Previous Application</u>

In [14]:
# prep for merge
count    = merge_df.groupby("SK_ID_PREV").count()
merge_df = merge_df.groupby("SK_ID_PREV").sum().reset_index()
merge_df["N"] = list(count.iloc[:,0])

# Add the median values.  MONTHS_BALANCE will be added as the max
right    = pcb.drop(["SK_ID_CURR", "MONTHS_BALANCE"], axis=1).groupby("SK_ID_PREV").median().reset_index()
merge_df = merge_df.merge(right=right, how="left", on="SK_ID_PREV").set_index("SK_ID_PREV")

### Add the max number of months
merge_df["max_months"] = pcb.groupby("SK_ID_PREV").MONTHS_BALANCE.max()

merged_cols = ['pos_' + col for col in merge_df.columns]
merge_df.columns = merged_cols

# Merge
prev_app = prev_app.merge(right=merge_df.reset_index(), how='left', on='SK_ID_PREV')

### Fill in missing values

In [19]:
prev_app["no_pcb"] = prev_app[merged_cols[0]].map(lambda x: 1 if np.isnan(x) else 0)

for col in tqdm(merged_cols):
    not_null      = prev_app[col].notnull()
    mode          = prev_app[not_null][col].mode().iloc[0]
    prev_app[col] = prev_app[col].fillna(mode)    
    
sum(prev_app.isnull().sum())



  0%|          | 0/13 [00:00<?, ?it/s][A[A

  8%|▊         | 1/13 [00:01<00:14,  1.21s/it][A[A

 15%|█▌        | 2/13 [00:02<00:13,  1.19s/it][A[A

 23%|██▎       | 3/13 [00:03<00:12,  1.20s/it][A[A

 31%|███       | 4/13 [00:04<00:10,  1.20s/it][A[A

 38%|███▊      | 5/13 [00:05<00:09,  1.20s/it][A[A

 46%|████▌     | 6/13 [00:07<00:08,  1.20s/it][A[A

 54%|█████▍    | 7/13 [00:07<00:06,  1.13s/it][A[A

 62%|██████▏   | 8/13 [00:08<00:05,  1.08s/it][A[A

 69%|██████▉   | 9/13 [00:09<00:04,  1.04s/it][A[A

 77%|███████▋  | 10/13 [00:10<00:03,  1.01s/it][A[A

 85%|████████▍ | 11/13 [00:10<00:01,  1.02it/s][A[A

 92%|█████████▏| 12/13 [00:11<00:00,  1.04it/s][A[A

100%|██████████| 13/13 [00:12<00:00,  1.06it/s][A[A

[A[A

0

---
<a id="install_pay"></a>

# [^](#toc) <u>Installment Payments</u>

In [21]:
install_pay = pd.read_csv(DATA_PATH + "installments_payments.csv")
print("Shape of install_pay:",  install_pay.shape)

print("\nColumns of install_pay:")
print(" --- ".join(install_pay.columns.values))

Shape of install_pay: (13605401, 8)

Columns of install_pay:
SK_ID_PREV --- SK_ID_CURR --- NUM_INSTALMENT_VERSION --- NUM_INSTALMENT_NUMBER --- DAYS_INSTALMENT --- DAYS_ENTRY_PAYMENT --- AMT_INSTALMENT --- AMT_PAYMENT


<a id="install_nan"></a>

### [^](#toc) <u>Missing values</u>

In [22]:
for col in ("DAYS_ENTRY_PAYMENT", "AMT_PAYMENT"):
    install_pay[col + "_nan"] = install_pay[col].map(lambda x: 1 if np.isnan(x) else 0)
    install_pay[col] = install_pay[col].fillna(0)

### Setup for merge

In [23]:
install_pay["AMT_MISSING"] = install_pay["AMT_INSTALMENT"] - install_pay["AMT_PAYMENT"]
temp = install_pay.groupby("SK_ID_PREV")["AMT_MISSING"]

merge_df = pd.DataFrame({
    "INSTALL_missing_max": temp.max(),
    "INSTALL_missing_min": temp.min(),
    "INSTALL_missing_med": temp.median(),
    "INSTALL_payment_nan": install_pay.groupby("SK_ID_PREV")["AMT_PAYMENT_nan"].sum(),
    "INSTALL_N":           temp.count()
})

### Add the rest of the columns

In [24]:
right = install_pay.drop("SK_ID_CURR", axis=1).groupby("SK_ID_PREV").median().reset_index()
merge_df = merge_df.reset_index()

merge_df = merge_df.merge(right=right, how="left", on="SK_ID_PREV").set_index("SK_ID_PREV")
merged_cols = merge_df.columns

<a id="merge_install_pay"></a>

### [^](#toc) <u>Merge into Previous Application</u>

In [25]:
# Merge
prev_app = prev_app.merge(right=merge_df.reset_index(), how='left', on='SK_ID_PREV')

### Fill in missing values

In [26]:
prev_app["no_install"] = prev_app[merged_cols[0]].map(lambda x: 1 if np.isnan(x) else 0)

for col in tqdm(merged_cols):
    not_null      = prev_app[col].notnull()
    mode          = prev_app[not_null][col].mode().iloc[0]
    prev_app[col] = prev_app[col].fillna(mode)    
    
sum(prev_app.isnull().sum())



  0%|          | 0/14 [00:00<?, ?it/s][A[A

  7%|▋         | 1/14 [00:04<01:01,  4.75s/it][A[A

 14%|█▍        | 2/14 [00:05<00:34,  2.91s/it][A[A

 21%|██▏       | 3/14 [00:06<00:25,  2.29s/it][A[A

 29%|██▊       | 4/14 [00:07<00:19,  1.95s/it][A[A

 36%|███▌      | 5/14 [00:08<00:15,  1.72s/it][A[A

 43%|████▎     | 6/14 [00:09<00:12,  1.58s/it][A[A

 50%|█████     | 7/14 [00:10<00:10,  1.49s/it][A[A

 57%|█████▋    | 8/14 [00:11<00:08,  1.41s/it][A[A

 64%|██████▍   | 9/14 [00:12<00:06,  1.36s/it][A[A

 71%|███████▏  | 10/14 [00:14<00:05,  1.41s/it][A[A

 79%|███████▊  | 11/14 [00:16<00:04,  1.47s/it][A[A

 86%|████████▌ | 12/14 [00:17<00:02,  1.42s/it][A[A

 93%|█████████▎| 13/14 [00:17<00:01,  1.38s/it][A[A

100%|██████████| 14/14 [00:18<00:00,  1.36s/it][A[A

[A[A

0

---
<a id="credit"></a>

# [^](#toc) <u>Credit Card Balance</u>

In [27]:
credit_card = pd.read_csv(DATA_PATH + "credit_card_balance.csv")
print("Shape of credit_card:",  credit_card.shape)

print("\nColumns of credit_card:")
print(" --- ".join(credit_card.columns.values))

Shape of credit_card: (3840312, 23)

Columns of credit_card:
SK_ID_PREV --- SK_ID_CURR --- MONTHS_BALANCE --- AMT_BALANCE --- AMT_CREDIT_LIMIT_ACTUAL --- AMT_DRAWINGS_ATM_CURRENT --- AMT_DRAWINGS_CURRENT --- AMT_DRAWINGS_OTHER_CURRENT --- AMT_DRAWINGS_POS_CURRENT --- AMT_INST_MIN_REGULARITY --- AMT_PAYMENT_CURRENT --- AMT_PAYMENT_TOTAL_CURRENT --- AMT_RECEIVABLE_PRINCIPAL --- AMT_RECIVABLE --- AMT_TOTAL_RECEIVABLE --- CNT_DRAWINGS_ATM_CURRENT --- CNT_DRAWINGS_CURRENT --- CNT_DRAWINGS_OTHER_CURRENT --- CNT_DRAWINGS_POS_CURRENT --- CNT_INSTALMENT_MATURE_CUM --- NAME_CONTRACT_STATUS --- SK_DPD --- SK_DPD_DEF


<a id="credit_nan"></a>

### [^](#toc) <u>Missing Values and Outliers</u>

In [28]:
# ------------------------------
### Remove outliers
# Gets indices with outlier values
temp = credit_card[credit_card.NAME_CONTRACT_STATUS.isin(["Refused", "Approved"])].index

# Drops outlier values
credit_card = credit_card.drop(temp, axis=0)

# ------------------------------
#### Fill in missing values
cols = [
        "AMT_DRAWINGS_ATM_CURRENT", "AMT_DRAWINGS_OTHER_CURRENT", "AMT_DRAWINGS_POS_CURRENT", 
        "AMT_INST_MIN_REGULARITY", "AMT_PAYMENT_CURRENT", "CNT_DRAWINGS_ATM_CURRENT", 
        "CNT_DRAWINGS_OTHER_CURRENT", "CNT_DRAWINGS_POS_CURRENT", "CNT_INSTALMENT_MATURE_CUM"
]
for col in tqdm(cols):
    not_null = credit_card[col].notnull()
    mode = float(credit_card[not_null][col].mode())
    credit_card[col] = credit_card[col].fillna(mode)



  0%|          | 0/9 [00:00<?, ?it/s][A[A

 11%|█         | 1/9 [00:01<00:11,  1.41s/it][A[A

 22%|██▏       | 2/9 [00:02<00:10,  1.44s/it][A[A

 33%|███▎      | 3/9 [00:04<00:08,  1.48s/it][A[A

 44%|████▍     | 4/9 [00:06<00:08,  1.63s/it][A[A

 56%|█████▌    | 5/9 [00:08<00:06,  1.66s/it][A[A

 67%|██████▋   | 6/9 [00:09<00:04,  1.58s/it][A[A

 78%|███████▊  | 7/9 [00:10<00:03,  1.51s/it][A[A

 89%|████████▉ | 8/9 [00:11<00:01,  1.49s/it][A[A

100%|██████████| 9/9 [00:13<00:00,  1.50s/it][A[A

[A[A

### Setup Categorical column

In [29]:
temp = credit_card[["SK_ID_PREV", "NAME_CONTRACT_STATUS"]]

temp = get_dummies(temp, ["NAME_CONTRACT_STATUS"])
temp = temp.drop("NAME_CONTRACT_STATUS", axis=1)
temp = temp.groupby("SK_ID_PREV").sum()

### Select columns

In [30]:
merge_df = pd.DataFrame({
    "mean_AMT_BALANCE": credit_card.groupby("SK_ID_PREV").AMT_BALANCE.mean(),
    "max_SK_DPD":      credit_card.groupby("SK_ID_PREV").SK_DPD.max(),
    "max_SK_DPD_DEF":  credit_card.groupby("SK_ID_PREV").SK_DPD_DEF.max(),
    "N":           credit_card.groupby("SK_ID_PREV").count().iloc[:,0]
})

merge_df = temp.join(merge_df)
del temp

### Add the rest of the columns

In [31]:
right = credit_card.drop("SK_ID_CURR", axis=1).groupby("SK_ID_PREV").median().reset_index()
merge_df = merge_df.reset_index()
merge_df = merge_df.merge(right=right, how="left", on="SK_ID_PREV").set_index("SK_ID_PREV")

<a id="merge_credit"></a>

### [^](#toc) <u>Merge into Previous Application</u>

In [32]:
# Merge
merged_cols = ['credit_' + col for col in merge_df.columns]
merge_df.columns = merged_cols
prev_app = prev_app.merge(right=merge_df.reset_index(), how='left', on='SK_ID_PREV')

### Fill in new NaN values

In [33]:
prev_app["no_credit"] = prev_app[merged_cols[0]].map(lambda x: 1 if np.isnan(x) else 0)

for col in tqdm(merged_cols):
    not_null = prev_app[col].notnull()
    median = prev_app[not_null][col].median()
    prev_app[col] = prev_app[col].fillna(median)    
    
sum(prev_app.isnull().sum())



  0%|          | 0/29 [00:00<?, ?it/s][A[A

  3%|▎         | 1/29 [00:04<02:11,  4.70s/it][A[A

  7%|▋         | 2/29 [00:04<01:05,  2.44s/it][A[A

 10%|█         | 3/29 [00:05<00:43,  1.69s/it][A[A

 14%|█▍        | 4/29 [00:05<00:32,  1.31s/it][A[A

 17%|█▋        | 5/29 [00:05<00:25,  1.08s/it][A[A

 21%|██        | 6/29 [00:05<00:21,  1.08it/s][A[A

 24%|██▍       | 7/29 [00:05<00:17,  1.22it/s][A[A

 28%|██▊       | 8/29 [00:05<00:15,  1.36it/s][A[A

 31%|███       | 9/29 [00:06<00:13,  1.48it/s][A[A

 34%|███▍      | 10/29 [00:06<00:11,  1.60it/s][A[A

 38%|███▊      | 11/29 [00:06<00:10,  1.71it/s][A[A

 41%|████▏     | 12/29 [00:06<00:09,  1.82it/s][A[A

 45%|████▍     | 13/29 [00:06<00:08,  1.92it/s][A[A

 48%|████▊     | 14/29 [00:06<00:07,  2.01it/s][A[A

 52%|█████▏    | 15/29 [00:07<00:06,  2.10it/s][A[A

 55%|█████▌    | 16/29 [00:07<00:05,  2.18it/s][A[A

 59%|█████▊    | 17/29 [00:07<00:05,  2.27it/s][A[A

 62%|██████▏   | 18/29 [00

0

---
<a id="clean_up"></a>

# [^](#toc) <u>Misc clean up</u>

### Drop identification columns

Maybe I shouldn't?  Not all the information may be passed

In [34]:
### Drop unneeded SK_ID_PREV from prev_app
# prev_app = prev_app.drop("SK_ID_PREV", axis=1)
# bureau   = bureau.drop("SK_ID_BUREAU", axis=1)

print("Number of null in prev_app:", sum(prev_app.isnull().sum()))
print("Number of null in bureau:  ", sum(bureau.isnull().sum()))

Number of null in prev_app: 0
Number of null in bureau:   0


### Remove outliers

bureau.CREDIT_ACTIVE.value_counts()

    Closed      1079273
    Active       630607
    Sold           6527
    Bad debt         21
    
Maybe this row should stay?  Merge with Sold?

In [35]:
# bureau = bureau.drop(bureau[bureau.CREDIT_ACTIVE == "Bad debt"])

### Get dummies

In [36]:
prev_app = pd.get_dummies(prev_app)
bureau   = pd.get_dummies(bureau)

---
<a id="final_merge"></a>

# [^](#toc) <u>Final Data Prep</u>

In [37]:
train = pd.read_csv(DATA_PATH + "train.csv")
test  = pd.read_csv(DATA_PATH + "test.csv")

print("Shape of train:", train.shape)
print("Shape of test:",  test.shape)

Shape of train: (307511, 122)
Shape of test: (48744, 121)


### Split into predictors, target, and id

In [38]:
train_y = train.TARGET
train_x = train.drop(["TARGET"], axis=1)

test_id = test.SK_ID_CURR
test_x  = test

### Merge train and test data

In [39]:
full    = pd.concat([train_x, test_x])
train_N = len(train_x)

<a id="final_nan"></a>

### [^](#toc) <u>Missing values</u>

In [40]:
full = fillna_cat(full)
full = fillna_num(full)
sum(full.isnull().sum())

0

### Factorize

In [41]:
# Get categorical features
data_cats = [col for col in full.columns if full[col].dtype == 'object']

# Factorize the dataframe
full = factorize_df(full, data_cats)

### Merge Previous Application with full

In [42]:
merge_df      = prev_app.groupby('SK_ID_CURR').mean()
merge_df["N"] = prev_app.groupby('SK_ID_CURR').count().iloc[:,0]
merged_cols   = ['p_' + col for col in merge_df.columns]
merge_df.columns = merged_cols

full = full.merge(right=merge_df.reset_index(), how='left', on='SK_ID_CURR')

#### Fill NaN values

In [43]:
full["no_prev_app"] = full[merged_cols[0]].map(lambda x: 1 if np.isnan(x) else 0)

for col in tqdm(merged_cols):
    not_null  = full[col].notnull()
    median    = full[not_null][col].median()
    full[col] = full[col].fillna(median)    
    
sum(full.isnull().sum())



  0%|          | 0/223 [00:00<?, ?it/s][A[A

  0%|          | 1/223 [00:03<13:27,  3.64s/it][A[A

  1%|          | 2/223 [00:04<08:24,  2.28s/it][A[A

  1%|▏         | 3/223 [00:05<06:44,  1.84s/it][A[A

  2%|▏         | 4/223 [00:06<05:49,  1.60s/it][A[A

  2%|▏         | 5/223 [00:07<05:22,  1.48s/it][A[A

  3%|▎         | 6/223 [00:08<04:58,  1.38s/it][A[A

  3%|▎         | 7/223 [00:09<04:43,  1.31s/it][A[A

  4%|▎         | 8/223 [00:10<04:30,  1.26s/it][A[A

  4%|▍         | 9/223 [00:11<04:22,  1.23s/it][A[A

  4%|▍         | 10/223 [00:12<04:15,  1.20s/it][A[A

  5%|▍         | 11/223 [00:12<04:09,  1.18s/it][A[A

  5%|▌         | 12/223 [00:13<04:05,  1.16s/it][A[A

  6%|▌         | 13/223 [00:14<04:01,  1.15s/it][A[A

  6%|▋         | 14/223 [00:15<03:57,  1.14s/it][A[A

  7%|▋         | 15/223 [00:17<03:58,  1.14s/it][A[A

  7%|▋         | 16/223 [00:18<03:55,  1.14s/it][A[A

  8%|▊         | 17/223 [00:19<03:52,  1.13s/it][A[A

  8%|▊  

 65%|██████▌   | 145/223 [02:27<01:19,  1.02s/it][A[A

 65%|██████▌   | 146/223 [02:28<01:18,  1.02s/it][A[A

 66%|██████▌   | 147/223 [02:29<01:17,  1.02s/it][A[A

 66%|██████▋   | 148/223 [02:30<01:16,  1.02s/it][A[A

 67%|██████▋   | 149/223 [02:31<01:15,  1.02s/it][A[A

 67%|██████▋   | 150/223 [02:32<01:14,  1.02s/it][A[A

 68%|██████▊   | 151/223 [02:33<01:13,  1.02s/it][A[A

 68%|██████▊   | 152/223 [02:34<01:12,  1.02s/it][A[A

 69%|██████▊   | 153/223 [02:35<01:11,  1.02s/it][A[A

 69%|██████▉   | 154/223 [02:36<01:10,  1.02s/it][A[A

 70%|██████▉   | 155/223 [02:37<01:09,  1.01s/it][A[A

 70%|██████▉   | 156/223 [02:38<01:08,  1.01s/it][A[A

 70%|███████   | 157/223 [02:39<01:06,  1.01s/it][A[A

 71%|███████   | 158/223 [02:40<01:05,  1.01s/it][A[A

 71%|███████▏  | 159/223 [02:41<01:04,  1.01s/it][A[A

 72%|███████▏  | 160/223 [02:42<01:03,  1.01s/it][A[A

 72%|███████▏  | 161/223 [02:43<01:02,  1.01s/it][A[A

 73%|███████▎  | 162/223 [02:44

0

### Merge Bureau with full

In [44]:
# Average Values for all bureau features 
merge_df         = bureau.groupby('SK_ID_CURR').mean().sort_index()
merge_df['N']    = bureau.groupby('SK_ID_CURR').count().sort_index().iloc[:,0]

### Add the debt to overdue ratio
right = (bureau.groupby("SK_ID_CURR")['AMT_CREDIT_SUM_DEBT'].sum() /
         bureau.groupby("SK_ID_CURR")['AMT_CREDIT_SUM_OVERDUE'].sum() ).sort_index()
merge_df["debt_to_overdue"] = right

### Add the debt to overdue ratio
right = (bureau.groupby("SK_ID_CURR")['AMT_CREDIT_SUM_DEBT'].sum() /
         bureau.groupby("SK_ID_CURR")['AMT_CREDIT_SUM'].sum() ).sort_index()
merge_df["debt_to_credit"] = right

merged_cols = ['b_' + f_ for f_ in merge_df.columns]
merge_df.columns = merged_cols

full = full.merge(right=merge_df.reset_index(), how='left', on='SK_ID_CURR')

#### Fill NaN values

In [45]:
full["no_bureau"] = full[merged_cols[0]].map(lambda x: 1 if np.isnan(x) else 0)

for col in tqdm(merged_cols):
    not_null  = full[col].notnull()
    median    = full[not_null][col].median()
    full[col] = full[col].fillna(median)    

sum(full.isnull().sum())



  0%|          | 0/47 [00:00<?, ?it/s][A[A

  2%|▏         | 1/47 [00:05<03:50,  5.01s/it][A[A

  4%|▍         | 2/47 [00:06<02:15,  3.00s/it][A[A

  6%|▋         | 3/47 [00:07<01:42,  2.34s/it][A[A

  9%|▊         | 4/47 [00:07<01:25,  2.00s/it][A[A

 11%|█         | 5/47 [00:09<01:15,  1.80s/it][A[A

 13%|█▎        | 6/47 [00:09<01:08,  1.67s/it][A[A

 15%|█▍        | 7/47 [00:10<01:02,  1.57s/it][A[A

 17%|█▋        | 8/47 [00:12<00:58,  1.51s/it][A[A

 19%|█▉        | 9/47 [00:13<00:55,  1.46s/it][A[A

 21%|██▏       | 10/47 [00:14<00:52,  1.42s/it][A[A

 23%|██▎       | 11/47 [00:15<00:49,  1.38s/it][A[A

 26%|██▌       | 12/47 [00:16<00:47,  1.35s/it][A[A

 28%|██▊       | 13/47 [00:17<00:45,  1.33s/it][A[A

 30%|██▉       | 14/47 [00:18<00:43,  1.31s/it][A[A

 32%|███▏      | 15/47 [00:19<00:41,  1.29s/it][A[A

 34%|███▍      | 16/47 [00:20<00:39,  1.27s/it][A[A

 36%|███▌      | 17/47 [00:21<00:37,  1.25s/it][A[A

 38%|███▊      | 18/47 [00

0

### Drop SK_ID_CURR

In [46]:
full = full.drop("SK_ID_CURR", axis=1)

### Split full back into train and test

In [47]:
train_x = full[:train_N]
test_x = full[train_N:]

### Processed data look

In [48]:
train_x.head()

Unnamed: 0,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,NAME_TYPE_SUITE,...,b_CREDIT_TYPE_Loan for purchase of shares (margin lending),b_CREDIT_TYPE_Loan for the purchase of equipment,b_CREDIT_TYPE_Loan for working capital replenishment,b_CREDIT_TYPE_Microloan,b_CREDIT_TYPE_Mobile operator loan,b_CREDIT_TYPE_Mortgage,b_CREDIT_TYPE_Real estate loan,b_CREDIT_TYPE_Unknown type of loan,b_N,no_bureau
0,0,0,0,0,0,202500.0,406597.5,24700.5,351000.0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8.0,0
1,0,1,0,1,0,270000.0,1293502.5,35698.5,1129500.0,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0
2,1,0,1,0,0,67500.0,135000.0,6750.0,135000.0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0
3,0,1,0,0,0,135000.0,312682.5,29686.5,297000.0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,1
4,0,0,0,0,0,121500.0,513000.0,21865.5,513000.0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0


---
<a id="models"></a>

# [^](#toc) <u>Models </u>

### sban's method

In [50]:
from sklearn.model_selection import train_test_split 
import lightgbm as lgb

# Maybe change the size of the test size?

train_x, val_x, train_y, val_y = train_test_split(train_x, train_y, test_size=0.2, random_state=17)
lgb_train = lgb.Dataset(data=train_x, label=train_y)
lgb_eval  = lgb.Dataset(data=val_x, label=val_y)

params = {'task': 'train', 'boosting_type': 'gbdt', 'objective': 'binary', 'metric': 'auc', 
          'learning_rate': 0.01, 'num_leaves': 48, 'num_iteration': 5000, 'verbose': 0 ,
          'colsample_bytree':.8, 'subsample':.9, 'max_depth':7, 'reg_alpha':.1, 'reg_lambda':.1, 
          'min_split_gain':.01, 'min_child_weight':1}

start = time.time()
model = lgb.train(params, lgb_train, valid_sets=lgb_eval, early_stopping_rounds=150, verbose_eval=200)
print("Training took {} seconds".format(round(time.time() - start)))

Training until validation scores don't improve for 150 rounds.
[200]	valid_0's auc: 0.738903
[400]	valid_0's auc: 0.755505
[600]	valid_0's auc: 0.76779
[800]	valid_0's auc: 0.773027
[1000]	valid_0's auc: 0.775646
[1200]	valid_0's auc: 0.777186
[1400]	valid_0's auc: 0.778163
[1600]	valid_0's auc: 0.778696
[1800]	valid_0's auc: 0.779073
[2000]	valid_0's auc: 0.77933
[2200]	valid_0's auc: 0.779568
[2400]	valid_0's auc: 0.779646
Early stopping, best iteration is:
[2265]	valid_0's auc: 0.779686
Training took 557 seconds


---
<a id="predictions"></a>

# [^](#toc) <u>Predictions</u>

In [51]:
predictions = model.predict(test_x)

---
<a id="save"></a>

# [^](#toc) <u>Save file to CSV</u>

In [52]:
pd.DataFrame({
    "SK_ID_CURR": test_id,
    "TARGET": predictions
}).to_csv("../submissions/same_filled_nan.csv", index=False)