# Experiment 3: Home Credit Default Risk

This notebook attempts to apply the lessons taught in Fast.ai Lesson 4 and 5 to another, similar dataset.

## Contents

1. Explore and visualise dataset.
2. Prepare dataset.
3. Build and train model.
4. Evaluation.
5. Ideas for improvements.

In [14]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

from pathlib import Path
import os

import pandas as pd

In [2]:
PATH = Path('./data/home-credit-default-risk')

In [4]:
PATH.mkdir(exist_ok=True)

## 1. Explore and visualise dataset

### Download and extract dataset

In [6]:
# Get dataset
!kaggle competitions download -c home-credit-default-risk --path={PATH}

sample_submission.csv.zip: Downloaded 117KB of 117KB
application_test.csv.zip: Downloaded 6MB of 6MB
application_train.csv.zip: Downloaded 34MB of 34MB
bureau.csv.zip: Downloaded 36MB of 36MB
bureau_balance.csv.zip: Downloaded 61MB of 61MB
previous_application.csv.zip: Downloaded 74MB of 74MB
credit_card_balance.csv.zip: Downloaded 94MB of 94MB
POS_CASH_balance.csv.zip: Downloaded 106MB of 106MB
installments_payments.csv.zip: Downloaded 267MB of 267MB
HomeCredit_columns_description.csv: Downloaded 37KB of 37KB


In [12]:
for file in os.listdir(PATH):
    if not file.endswith('zip'):
        continue
    
    file_path = PATH / file

    !unzip -q -d {PATH} {file_path}

In [13]:
!ls {PATH}

HomeCredit_columns_description.csv bureau_balance.csv.zip
POS_CASH_balance.csv               credit_card_balance.csv
POS_CASH_balance.csv.zip           credit_card_balance.csv.zip
application_test.csv               installments_payments.csv
application_test.csv.zip           installments_payments.csv.zip
application_train.csv              previous_application.csv
application_train.csv.zip          previous_application.csv.zip
bureau.csv                         sample_submission.csv
bureau.csv.zip                     sample_submission.csv.zip
bureau_balance.csv


### application_train

From Kaggle data page:

*"This is the main table, broken into two files for Train (with TARGET) and Test (without TARGET). Static data for all applications. One row represents one loan in our data sample."*

In [15]:
train_df = pd.read_csv(PATH / 'application_train.csv')

In [16]:
train_df.head()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,...,0,0,0,0,,,,,,
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


In [18]:
len(train_df)

307511

Number of rows where `TARGET = 1`:

In [20]:
print("Num target = 1: ", len(train_df[train_df['TARGET'] == 1]))
print("Num target = 0: ", len(train_df[train_df['TARGET'] == 0]))

Num target = 1:  24825
Num target = 0:  282686


In [21]:
test_df = pd.read_csv(PATH / 'application_test.csv')

In [22]:
test_df.head()

Unnamed: 0,SK_ID_CURR,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100001,Cash loans,F,N,Y,0,135000.0,568800.0,20560.5,450000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
1,100005,Cash loans,M,N,Y,0,99000.0,222768.0,17370.0,180000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,3.0
2,100013,Cash loans,M,Y,Y,0,202500.0,663264.0,69777.0,630000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,1.0,4.0
3,100028,Cash loans,F,N,Y,2,315000.0,1575000.0,49018.5,1575000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,3.0
4,100038,Cash loans,M,Y,N,1,180000.0,625500.0,32067.0,625500.0,...,0,0,0,0,,,,,,


There's a file called `HomeCredit_columns_description.csv` that has a description of each column. Let's take a look at that.

In [29]:
column_desc = pd.read_csv(PATH / 'HomeCredit_columns_description.csv', encoding='latin_1')

In [30]:
column_desc.head()

Unnamed: 0.1,Unnamed: 0,Table,Row,Description,Special
0,1,application_{train|test}.csv,SK_ID_CURR,ID of loan in our sample,
1,2,application_{train|test}.csv,TARGET,Target variable (1 - client with payment diffi...,
2,5,application_{train|test}.csv,NAME_CONTRACT_TYPE,Identification if loan is cash or revolving,
3,6,application_{train|test}.csv,CODE_GENDER,Gender of the client,
4,7,application_{train|test}.csv,FLAG_OWN_CAR,Flag if the client owns a car,


In [34]:
column_desc[column_desc['Table'] == 'application_{train|test}.csv']

Unnamed: 0.1,Unnamed: 0,Table,Row,Description,Special
0,1,application_{train|test}.csv,SK_ID_CURR,ID of loan in our sample,
1,2,application_{train|test}.csv,TARGET,Target variable (1 - client with payment diffi...,
2,5,application_{train|test}.csv,NAME_CONTRACT_TYPE,Identification if loan is cash or revolving,
3,6,application_{train|test}.csv,CODE_GENDER,Gender of the client,
4,7,application_{train|test}.csv,FLAG_OWN_CAR,Flag if the client owns a car,
5,8,application_{train|test}.csv,FLAG_OWN_REALTY,Flag if client owns a house or flat,
6,9,application_{train|test}.csv,CNT_CHILDREN,Number of children the client has,
7,10,application_{train|test}.csv,AMT_INCOME_TOTAL,Income of the client,
8,11,application_{train|test}.csv,AMT_CREDIT,Credit amount of the loan,
9,12,application_{train|test}.csv,AMT_ANNUITY,Loan annuity,


In [37]:
column_desc[column_desc['Table'] == 'application_{train|test}.csv'].iloc[1]['Description']

'Target variable (1 - client with payment difficulties: he/she had late payment more than X days on at least one of the first Y installments of the loan in our sample, 0 - all other cases)'

#### Distribution of income

Extract out values and display distribution using Seaborn.

In [38]:
train_df.hist(column='AMT_INCOME_TOTAL'_

SyntaxError: invalid syntax (<ipython-input-38-ee46be830dfe>, line 1)