# Data Merging: Home Credit Default Risk

### by Michael

Step 1: Download all files with data from: https://www.kaggle.com/c/home-credit-default-risk/data

Step 2: Using the database scheme that shows how tables are related to each other do the following:

![title](m4_w2.png)

In [1]:
import numpy as np
import pandas as pd
import glob

In [2]:
csv_files = glob.glob('*.csv')
print(csv_files)

['application_test.csv', 'application_train.csv', 'bureau.csv', 'bureau_balance.csv', 'credit_card_balance.csv', 'HomeCredit_columns_description.csv', 'installments_payments.csv', 'POS_CASH_balance.csv', 'previous_application.csv', 'sample_submission.csv']


## 1. calculate the average loan size for all loan applicants from the application_train table

Applicants could have either acquired loans from Home Credit or other lenders(according to the Bureau). Therefore, an applicant may have two average loan size...

In [3]:
appl = pd.read_csv(csv_files[1], usecols=[0])
bureau = pd.read_csv(csv_files[2])
prev_app = pd.read_csv(csv_files[-2], usecols=['SK_ID_CURR', 'AMT_CREDIT', 'NAME_CONTRACT_STATUS'])

In [4]:
appl.SK_ID_CURR.is_unique

True

In [5]:
appl_bureau = pd.merge(appl, bureau, on='SK_ID_CURR')

In [6]:
#Average for each applicant according to the bureau
avg_bureau_loan = appl_bureau.groupby('SK_ID_CURR').AMT_CREDIT_SUM.mean()
avg_bureau_loan

SK_ID_CURR
100002    108131.945625
100003    254350.125000
100004     94518.900000
100007    146250.000000
100008    156148.500000
              ...      
456247    471405.681818
456249    284142.973846
456253    990000.000000
456254     45000.000000
456255    345629.045455
Name: AMT_CREDIT_SUM, Length: 263491, dtype: float64

In [7]:
appl_prev_app = pd.merge(appl, prev_app, on='SK_ID_CURR')

In [8]:
#Average for each applicant according to previous applications at Home Credit ONLY
avg_prev_loan = appl_prev_app.groupby('SK_ID_CURR').AMT_CREDIT.mean()
avg_prev_loan

SK_ID_CURR
100002    179055.00
100003    484191.00
100004     20106.00
100006    291695.50
100007    166638.75
            ...    
456251     40455.00
456252     56821.50
456253     20625.75
456254    134439.75
456255    424431.00
Name: AMT_CREDIT, Length: 291057, dtype: float64

## 2. calculate how many applicants from the application_train table were previously rejected

In [9]:
appl_prev_app = pd.merge(prev_app,appl, on='SK_ID_CURR', how='left')

In [10]:
prev_rejected = appl_prev_app[appl_prev_app['NAME_CONTRACT_STATUS']=='Refused']

In [11]:
prev_rejected.shape[0]

290678

## 3. calculate the average credit card balance for the applicants from the application_train table

Merge on IDs from appl and credit card balance, then average balance for each applicant

In [12]:
credit_cards = pd.read_csv(csv_files[4])
credit_appl = pd.merge(appl, credit_cards, on='SK_ID_CURR')

In [13]:
avg_credit_bal = credit_appl.groupby('SK_ID_CURR')['AMT_BALANCE'].mean()
avg_credit_bal

SK_ID_CURR
100006         0.000000
100011     54482.111149
100021         0.000000
100023         0.000000
100036         0.000000
              ...      
456242    148232.328750
456244    131834.730732
456246     13136.731875
456247     23216.396211
456248         0.000000
Name: AMT_BALANCE, Length: 86905, dtype: float64