# Table of Contents
* [Import Data from SQLite Database](#Import-Data-from-SQLite-Database)
* [Aggregate columns at SK_ID_CURR level](#Aggregate-columns-at-SK_ID_CURR-level)
    * [Part 1 -](#Part-1--)
    * [Part 2 -](#Part-2--)
    * [Part 3 -](#Part-3--)
    * [Part 4 -](#Part-4--)
    * [Part 5 -](#Part-5--)
    * [Merge and save all 5 data parts](#Merge-and-save-all-5-data-parts)

## Import Data from SQLite Database

In [1]:
import pandas as pd
import numpy as np
from numpy.random import seed
import matplotlib.pyplot as plt
%matplotlib inline  
import statistics
from scipy import stats
from scipy.stats import t
from scipy.stats import norm
import seaborn as sns

In [2]:
import sqlite3
from sqlite3 import Error
import csv
# open the connection to read in the datasets, remember to close the connection at the end of the code
con = sqlite3.connect(r"pythonsqlite.db")
cur = con.cursor()

In [3]:
# read in previous application data from sqlite database
sql_stmt = '''SELECT A.* FROM prev_sql as A '''
previous_app = pd.read_sql(sql_stmt, coerce_float=True, con=con)
# replace field that's entirely space (or empty) with NaN
previous_app.replace(r'^\s*$', np.nan, regex=True, inplace=True)
# close connection
con.close()

## Aggregate columns at SK_ID_CURR level

### Due to the large number of columns to be aggregated, divide into 5 groups of variables:

* Part 1 - Get total annuity, total amount of requests, total credit, down payment etc. information from all previous applications. Get maximum down payment and interest rate etc. from all of the previous applications.
<p>
    1. AMT_ANNUITY: sum <br>
    2. AMT_APPLICATION: sum <br>
    3. AMT_CREDIT: sum <br>
    4. AMT_DOWN_PAYMENT: sum <br>
    5. AMT_GOODS_PRICE: sum <br>
    6. RATE_DOWN_PAYMENT: max <br>
    7. RATE_INTEREST_PRIMARY: max <br>
    8. RATE_INTEREST_PRIVILEGED: max <br>
    9. SELLERPLACE_AREA: sum <br>
    10. CNT_PAYMENT: min, max </p>
<p>
* Part 2 - For the following categorical variables, first do one hot encoding at SK_ID_PREV level, then aggregate at SK_ID_CURR level by sum to get the total number of instances over all previous application IDs:
<p>
    1. NAME_CONTRACT_TYPE: 1 hot encoding,<br>
    2. NAME_CONTRACT_STATUS: 1 hot encoding,<br>
    3. NAME_PAYMENT_TYPE: 1 hot encoding,<br>
    4. CODE_REJECT_REASON: 1 hot encoding,<br>
    5. CHANNEL_TYPE: 1 hot encoding,<br>
    6. NAME_TYPE_SUITE: 1 hot encoding,<br>
    7. NAME_CLIENT_TYPE: 1 hot encoding,<br>
    8. NAME_GOODS_CATEGORY: 1 hot encoding,<br>
    9. NAME_PORTFOLIO: 1 hot encoding,<br>
    10. NAME_PRODUCT_TYPE: 1 hot encoding,<br>
    11. NAME_SELLER_INDUSTRY: 1 hot encoding,<br>
    12. NAME_YIELD_GROUP: 1 hot encoding,<br>
    13. PRODUCT_COMBINATION: 1 hot encoding<br>
<p>
* Part 3 - Get the min and max of days decisions (relative to current application date) over all previous applications for the same current application ID. Similarly, get min and max for every variable in the following list:
<p>
    1. DAYS_DECISION: min max,<br>
    2. DAYS_FIRST_DRAWING: min max,<br>
    3. DAYS_FIRST_DUE: min max,<br>
    4. DAYS_LAST_DUE_1ST_VERSION: min max,<br>
    5. DAYS_LAST_DUE: min max,<br>
    6. DAYS_TERMINATION: min max<br>
<p>
* Part 4 - For the following columns with high cardinality, first group some categories, then do 1 hot encoding at SK_ID_PREV level, then aggregate at SK_ID_CURR level by sum to get the total counts that previous application appeared in that category, within the same current application ID. As an example, variable "NAME_CASH_LOAN_PURPOSE" has over 20 categories, group the categories whose frequency are <= 20000. 
<p>
    1. WEEKDAY_APPR_PROCESS_START,<br>
    2. HOUR_APPR_PROCESS_START,<br>
    3. FLAG_LAST_APPL_PER_CONTRACT,<br>
    4. NFLAG_INSURED_ON_APPROVAL,<br>
    5. NAME_CASH_LOAN_PURPOSE<br>
<p>
* Part 5 - For the following variable, sum at SK_ID_CURR level to get the total number of last applications per day from the client.
<p>
    1. NFLAG_LAST_APPL_IN_DAY<br>

## Part 1 - 

In [4]:
prev_p1 = previous_app[['SK_ID_PREV','SK_ID_CURR','AMT_ANNUITY','AMT_APPLICATION',
                       'AMT_CREDIT','AMT_DOWN_PAYMENT','AMT_GOODS_PRICE',
                       'RATE_DOWN_PAYMENT','RATE_INTEREST_PRIMARY','RATE_INTEREST_PRIVILEGED',
                       'SELLERPLACE_AREA','CNT_PAYMENT']]

In [5]:
# open connection to sqlite database 
con = sqlite3.connect(r"pythonsqlite.db")
cur = con.cursor()

# write to sqlite database and save it, if don't have index=False, there will be an extra index column
prev_p1.to_sql(name='prev_p1', index=False, con=con)

# group by curr and prev IDs:
sql_sm = '''SELECT SK_ID_CURR,
                COUNT(SK_ID_PREV) AS NUM_PREV_APPS,
                
                SUM(AMT_ANNUITY) AS SUM_AMT_ANNUITY_PREV,
                SUM(AMT_APPLICATION) AS SUM_AMT_APPLICATION_PREV,
                SUM(AMT_CREDIT) AS SUM_AMT_CREDIT_PREV,
                SUM(AMT_DOWN_PAYMENT) AS SUM_AMT_DOWN_PAYMENT_PREV,
                SUM(AMT_GOODS_PRICE) AS SUM_AMT_GOODS_PRICE_PREV,
                MAX(RATE_DOWN_PAYMENT) AS MAX_RATE_DOWN_PAYMENT_PREV,
                MAX(RATE_INTEREST_PRIMARY) AS MAX_RATE_INTEREST_PRIMARY_PREV,
                MAX(RATE_INTEREST_PRIVILEGED) AS MAX_RATE_INTEREST_PRIVILEGED_PREV,
                SUM(SELLERPLACE_AREA) AS SUM_SELLERPLACE_AREA_PREV,
                MIN(CNT_PAYMENT) AS MIN_CNT_PAYMENT_PREV,
                MAX(CNT_PAYMENT) AS MAX_CNT_PAYMENT_PREV

            FROM prev_p1
            GROUP BY SK_ID_CURR'''
prev_p1_gp = pd.read_sql(sql_sm, coerce_float=True, con=con)

display(prev_p1_gp.head())
print(prev_p1_gp.shape)

Unnamed: 0,SK_ID_CURR,NUM_PREV_APPS,SUM_AMT_ANNUITY_PREV,SUM_AMT_APPLICATION_PREV,SUM_AMT_CREDIT_PREV,SUM_AMT_DOWN_PAYMENT_PREV,SUM_AMT_GOODS_PRICE_PREV,MAX_RATE_DOWN_PAYMENT_PREV,MAX_RATE_INTEREST_PRIMARY_PREV,MAX_RATE_INTEREST_PRIVILEGED_PREV,SUM_SELLERPLACE_AREA_PREV,MIN_CNT_PAYMENT_PREV,MAX_CNT_PAYMENT_PREV
0,100001,1,3951.0,24835.5,23787.0,2520.0,24835.5,0.104326,,,23,8.0,8.0
1,100002,1,9251.775,179055.0,179055.0,0.0,179055.0,0.0,,,500,24.0,24.0
2,100003,3,169661.97,1306309.5,1452573.0,6885.0,1306309.5,0.100061,,,1599,6.0,12.0
3,100004,1,5357.25,24282.0,20106.0,4860.0,24282.0,0.212008,,,30,4.0,4.0
4,100005,2,4813.2,44617.5,40153.5,4464.0,44617.5,0.108964,,,36,12.0,12.0


(338857, 13)


## Part 2 - 

In [6]:
prev_p2 = previous_app[['SK_ID_PREV','SK_ID_CURR','NAME_CONTRACT_TYPE','NAME_CONTRACT_STATUS','NAME_PAYMENT_TYPE',
                        'CODE_REJECT_REASON','CHANNEL_TYPE','NAME_TYPE_SUITE','NAME_CLIENT_TYPE',
                        'NAME_GOODS_CATEGORY','NAME_PORTFOLIO','NAME_PRODUCT_TYPE','NAME_SELLER_INDUSTRY',
                        'NAME_YIELD_GROUP','PRODUCT_COMBINATION']]

In [7]:
tmp = prev_p2.copy()

# Define nan as 'UKN' to avoid one hot encoding errors
tmp['NAME_TYPE_SUITE'][tmp['NAME_TYPE_SUITE'].isna()] = 'UKN'
tmp['PRODUCT_COMBINATION'][tmp['PRODUCT_COMBINATION'].isna()] = 'UKN'

# one hot encoding
for co in tmp.columns[2:]:
    unique_value = tmp[co].unique()
    for val in unique_value:
        tmp.loc[tmp[co]==val,co+val] = 1
        tmp.loc[tmp[co]!=val,co+val] = 0
        
# drop the original columns that are used in the one hot encoding, as they are redundant
prev_p2_agg = tmp.drop(tmp.columns[[range(2,15)]], axis=1)
display(prev_p2_agg.head())

# the number of columns increased after one hot encoding
print(prev_p2_agg.shape)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """
  result = getitem(key)


Unnamed: 0,SK_ID_PREV,SK_ID_CURR,NAME_CONTRACT_TYPEConsumer loans,NAME_CONTRACT_TYPECash loans,NAME_CONTRACT_TYPERevolving loans,NAME_CONTRACT_TYPEXNA,NAME_CONTRACT_STATUSApproved,NAME_CONTRACT_STATUSRefused,NAME_CONTRACT_STATUSCanceled,NAME_CONTRACT_STATUSUnused offer,...,PRODUCT_COMBINATIONPOS other with interest,PRODUCT_COMBINATIONCard X-Sell,PRODUCT_COMBINATIONPOS mobile without interest,PRODUCT_COMBINATIONCard Street,PRODUCT_COMBINATIONPOS industry with interest,PRODUCT_COMBINATIONCash Street: low,PRODUCT_COMBINATIONPOS industry without interest,PRODUCT_COMBINATIONCash Street: middle,PRODUCT_COMBINATIONPOS others without interest,PRODUCT_COMBINATIONUKN
0,2030495,271877,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2802425,108129,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2523466,122040,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,2819243,176158,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1784265,202054,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


(1670214, 113)


In [8]:
# group by curr IDs
prev_p2_gp = prev_p2_agg.groupby(['SK_ID_CURR']).sum()

# drop the prev id column
prev_p2_gp.drop(['SK_ID_PREV'], axis=1, inplace=True)

# get number of prev id counts for the same curr id
num_prev_ids = prev_p2_agg.groupby(['SK_ID_CURR']).count().SK_ID_PREV

# add the num_prev_ids column to prev_p2_gp data
prev_p2_gp['num_prev_id'] = num_prev_ids

display(prev_p2_gp.head())
print(prev_p2_gp.shape)

Unnamed: 0_level_0,NAME_CONTRACT_TYPEConsumer loans,NAME_CONTRACT_TYPECash loans,NAME_CONTRACT_TYPERevolving loans,NAME_CONTRACT_TYPEXNA,NAME_CONTRACT_STATUSApproved,NAME_CONTRACT_STATUSRefused,NAME_CONTRACT_STATUSCanceled,NAME_CONTRACT_STATUSUnused offer,NAME_PAYMENT_TYPECash through the bank,NAME_PAYMENT_TYPEXNA,...,PRODUCT_COMBINATIONCard X-Sell,PRODUCT_COMBINATIONPOS mobile without interest,PRODUCT_COMBINATIONCard Street,PRODUCT_COMBINATIONPOS industry with interest,PRODUCT_COMBINATIONCash Street: low,PRODUCT_COMBINATIONPOS industry without interest,PRODUCT_COMBINATIONCash Street: middle,PRODUCT_COMBINATIONPOS others without interest,PRODUCT_COMBINATIONUKN,num_prev_id
SK_ID_CURR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
100001,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
100002,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
100003,2.0,1.0,0.0,0.0,3.0,0.0,0.0,0.0,2.0,1.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,3
100004,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
100005,1.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2


(338857, 112)


## Part 3 - 

In [9]:
prev_p3 = previous_app[['SK_ID_CURR','DAYS_DECISION','DAYS_FIRST_DRAWING','DAYS_FIRST_DUE',
                        'DAYS_LAST_DUE_1ST_VERSION','DAYS_LAST_DUE','DAYS_TERMINATION']]

cur.execute('DROP TABLE IF EXISTS prev_p3')
prev_p3.to_sql(name='prev_p3', index=False, con=con)

In [10]:
# get the min and max
sql_sm = '''SELECT A.SK_ID_CURR,
            MIN(DAYS_DECISION) AS MIN_DAYS_DECISION,
            MAX(DAYS_DECISION) AS MAX_DAYS_DECISION,
            MIN(DAYS_FIRST_DRAWING) AS MIN_DAYS_FIRST_DRAWING,
            MAX(DAYS_FIRST_DRAWING) AS MAX_DAYS_FIRST_DRAWING,
            MIN(DAYS_FIRST_DUE) AS MIN_DAYS_FIRST_DUE,
            MAX(DAYS_FIRST_DUE) AS MAX_DAYS_FIRST_DUE,
            MIN(DAYS_LAST_DUE_1ST_VERSION) AS MIN_DAYS_LAST_DUE_1ST_VERSION,
            MAX(DAYS_LAST_DUE_1ST_VERSION) AS MAX_DAYS_LAST_DUE_1ST_VERSION,
            MIN(DAYS_LAST_DUE) AS MIN_DAYS_LAST_DUE,
            MAX(DAYS_LAST_DUE) AS MAX_DAYS_LAST_DUE,
            MIN(DAYS_TERMINATION) AS MIN_DAYS_TERMINATION,
            MAX(DAYS_TERMINATION) AS MAX_DAYS_TERMINATION
            FROM prev_p3 AS A 
            GROUP BY SK_ID_CURR
            '''
prev_p3_gp = pd.read_sql(sql_sm, coerce_float=True, con=con)

display(prev_p3_gp.head())
print(prev_p3_gp.shape)

Unnamed: 0,SK_ID_CURR,MIN_DAYS_DECISION,MAX_DAYS_DECISION,MIN_DAYS_FIRST_DRAWING,MAX_DAYS_FIRST_DRAWING,MIN_DAYS_FIRST_DUE,MAX_DAYS_FIRST_DUE,MIN_DAYS_LAST_DUE_1ST_VERSION,MAX_DAYS_LAST_DUE_1ST_VERSION,MIN_DAYS_LAST_DUE,MAX_DAYS_LAST_DUE,MIN_DAYS_TERMINATION,MAX_DAYS_TERMINATION
0,100001,-1740,-1740,365243.0,365243.0,-1709.0,-1709.0,-1499.0,-1499.0,-1619.0,-1619.0,-1612.0,-1612.0
1,100002,-606,-606,365243.0,365243.0,-565.0,-565.0,125.0,125.0,-25.0,-25.0,-17.0,-17.0
2,100003,-2341,-746,365243.0,365243.0,-2310.0,-716.0,-1980.0,-386.0,-1980.0,-536.0,-1976.0,-527.0
3,100004,-815,-815,365243.0,365243.0,-784.0,-784.0,-694.0,-694.0,-724.0,-724.0,-714.0,-714.0
4,100005,-757,-315,365243.0,365243.0,-706.0,-706.0,-376.0,-376.0,-466.0,-466.0,-460.0,-460.0


(338857, 13)


## Part 4 - 

In [11]:
prev_p4 = previous_app[['SK_ID_CURR','WEEKDAY_APPR_PROCESS_START','HOUR_APPR_PROCESS_START',
                        'FLAG_LAST_APPL_PER_CONTRACT','NFLAG_INSURED_ON_APPROVAL',
                        'NAME_CASH_LOAN_PURPOSE']]

In [13]:
tmp = prev_p4.copy()

# Define nan as -1 to avoid one hot encoding errors
tmp['NFLAG_INSURED_ON_APPROVAL'][tmp['NFLAG_INSURED_ON_APPROVAL'].isna()] = -1
# redefine the NFLAG_INSURED_ON_APPROVAL into 3 str categories:
tmp['NFLAG_INSURED_APPROVAL_3'] = 'Unknown'
# assign the values based on NFLAG_INSURED_ON_APPROVAL values: 0: 'Non_Request', 1: 'Requested', -1: 'Unknown'
tmp['NFLAG_INSURED_APPROVAL_3'][tmp['NFLAG_INSURED_ON_APPROVAL'].isin(tmp['NFLAG_INSURED_ON_APPROVAL'][tmp['NFLAG_INSURED_ON_APPROVAL'] == 0].index)] = 'Non_Request'
tmp['NFLAG_INSURED_APPROVAL_3'][tmp['NFLAG_INSURED_ON_APPROVAL'].isin(tmp['NFLAG_INSURED_ON_APPROVAL'][tmp['NFLAG_INSURED_ON_APPROVAL'] == 1].index)] = 'Requested'
print(tmp['NFLAG_INSURED_APPROVAL_3'].value_counts())

# drop the numeric variable NFLAG_INSURED_ON_APPROVAL, otherwise one hot encoding will error out
tmp.drop(['NFLAG_INSURED_ON_APPROVAL'],axis=1,inplace=True)


# define all the frequency <= 20000 categories as 'Others'
tmp.loc[tmp['NAME_CASH_LOAN_PURPOSE'].isin
        (tmp['NAME_CASH_LOAN_PURPOSE'].value_counts()[tmp['NAME_CASH_LOAN_PURPOSE'].value_counts() <= 20000].index),
        'NAME_CASH_LOAN_PURPOSE']='Others'


# assign a constant str to a new column to hold hour ranges
tmp['HOUR_PROCESS_START_RANGE'] = 'A'
# assign the values based on hour ranges: 0-7, 8-15, 16-23
tmp['HOUR_PROCESS_START_RANGE'][tmp['HOUR_APPR_PROCESS_START'].isin(tmp['HOUR_APPR_PROCESS_START'][tmp['HOUR_APPR_PROCESS_START'] <= 7].index)] = '0-7'
# assign 8-15 group
tmp['HOUR_PROCESS_START_RANGE'][tmp['HOUR_APPR_PROCESS_START'].isin(tmp['HOUR_APPR_PROCESS_START'][(tmp['HOUR_APPR_PROCESS_START'] >= 8) & (tmp['HOUR_APPR_PROCESS_START'] <= 15)].index)] = '8-15'
# assign 16-23 group
tmp['HOUR_PROCESS_START_RANGE'][tmp['HOUR_APPR_PROCESS_START'].isin(tmp['HOUR_APPR_PROCESS_START'][(tmp['HOUR_APPR_PROCESS_START'] >= 16) & (tmp['HOUR_APPR_PROCESS_START'] <= 23)].index)] = '16-23'
# check frequency in each range
print(tmp['HOUR_PROCESS_START_RANGE'].value_counts())

# drop the numeric variable HOUR_APPR_PROCESS_START, otherwise one hot encoding will error out
tmp.drop(['HOUR_APPR_PROCESS_START'],axis=1,inplace=True)


# create one hot coding for the categorical variables, and then drop them as they will be redundant
for co in tmp.columns[1:]:
    unique_value = tmp[co].unique()
    for val in unique_value:
        tmp.loc[tmp[co]==val,co+val] = 1
        tmp.loc[tmp[co]!=val,co+val] = 0
        
tmp.drop(['WEEKDAY_APPR_PROCESS_START',
          'FLAG_LAST_APPL_PER_CONTRACT',
          'NAME_CASH_LOAN_PURPOSE',
          'HOUR_PROCESS_START_RANGE',
          'NFLAG_INSURED_APPROVAL_3'], axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


Unknown        673065
Non_Request    665527
Requested      331622
Name: NFLAG_INSURED_APPROVAL_3, dtype: int64


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


8-15     1241970
0-7       394155
16-23      34089
Name: HOUR_PROCESS_START_RANGE, dtype: int64


In [14]:
prev_p4 = tmp.copy()

# group by curr IDs
prev_p4_gp = prev_p4.groupby(['SK_ID_CURR']).sum()

display(prev_p4_gp.head())
print(prev_p4_gp.shape)

Unnamed: 0_level_0,WEEKDAY_APPR_PROCESS_STARTSATURDAY,WEEKDAY_APPR_PROCESS_STARTTHURSDAY,WEEKDAY_APPR_PROCESS_STARTTUESDAY,WEEKDAY_APPR_PROCESS_STARTMONDAY,WEEKDAY_APPR_PROCESS_STARTFRIDAY,WEEKDAY_APPR_PROCESS_STARTSUNDAY,WEEKDAY_APPR_PROCESS_STARTWEDNESDAY,FLAG_LAST_APPL_PER_CONTRACTY,FLAG_LAST_APPL_PER_CONTRACTN,NAME_CASH_LOAN_PURPOSEXAP,NAME_CASH_LOAN_PURPOSEXNA,NAME_CASH_LOAN_PURPOSERepairs,NAME_CASH_LOAN_PURPOSEOthers,NFLAG_INSURED_APPROVAL_3Non_Request,NFLAG_INSURED_APPROVAL_3Requested,NFLAG_INSURED_APPROVAL_3Unknown,HOUR_PROCESS_START_RANGE0-7,HOUR_PROCESS_START_RANGE8-15,HOUR_PROCESS_START_RANGE16-23
SK_ID_CURR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
100001,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
100002,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
100003,1.0,0.0,0.0,0.0,1.0,1.0,0.0,3.0,0.0,2.0,1.0,0.0,0.0,1.0,2.0,0.0,1.0,2.0,0.0
100004,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
100005,0.0,1.0,0.0,0.0,1.0,0.0,0.0,2.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0


(338857, 19)


## Part 5 - 

In [15]:
prev_p5 = previous_app[['SK_ID_CURR','NFLAG_LAST_APPL_IN_DAY']]

# group by curr IDs
prev_p5_gp = prev_p5.groupby(['SK_ID_CURR']).sum()

display(prev_p5_gp.head())
print(prev_p5_gp.shape)

Unnamed: 0_level_0,NFLAG_LAST_APPL_IN_DAY
SK_ID_CURR,Unnamed: 1_level_1
100001,1
100002,1
100003,3
100004,1
100005,2


(338857, 1)


## Merge and save all 5 data parts

In [16]:
# need to make sure the index for each part are consistent, otherwise sql join will error out
prev_p2_gp = prev_p2_gp.reset_index()
prev_p4_gp = prev_p4_gp.reset_index()
prev_p5_gp = prev_p5_gp.reset_index()

# save all the aggregated to sqlite database and join together
cur.execute('DROP TABLE IF EXISTS prev_p1_gp')
cur.execute('DROP TABLE IF EXISTS prev_p2_gp')
cur.execute('DROP TABLE IF EXISTS prev_p3_gp')
cur.execute('DROP TABLE IF EXISTS prev_p4_gp')
cur.execute('DROP TABLE IF EXISTS prev_p5_gp')
prev_p1_gp.to_sql(name='prev_p1_gp', index=False, con=con)
prev_p2_gp.to_sql(name='prev_p2_gp', index=False, con=con)
prev_p3_gp.to_sql(name='prev_p3_gp', index=False, con=con)
prev_p4_gp.to_sql(name='prev_p4_gp', index=False, con=con)
prev_p5_gp.to_sql(name='prev_p5_gp', index=False, con=con)

sql_sm = '''SELECT A.*, B.*, C.*, D.*, E.NFLAG_LAST_APPL_IN_DAY
            FROM prev_p1_gp AS A 
            INNER JOIN
                prev_p2_gp AS B ON A.SK_ID_CURR = B.SK_ID_CURR
            INNER JOIN
                prev_p3_gp AS C ON A.SK_ID_CURR = C.SK_ID_CURR
            INNER JOIN
                prev_p4_gp AS D ON A.SK_ID_CURR = D.SK_ID_CURR
            INNER JOIN
                prev_p5_gp AS E ON A.SK_ID_CURR = E.SK_ID_CURR
            '''
prev_gp = pd.read_sql(sql_sm, coerce_float=True, con=con)

display(prev_gp.head())
print(prev_gp.shape)

  dtype=dtype, method=method)


Unnamed: 0,SK_ID_CURR,NUM_PREV_APPS,SUM_AMT_ANNUITY_PREV,SUM_AMT_APPLICATION_PREV,SUM_AMT_CREDIT_PREV,SUM_AMT_DOWN_PAYMENT_PREV,SUM_AMT_GOODS_PRICE_PREV,MAX_RATE_DOWN_PAYMENT_PREV,MAX_RATE_INTEREST_PRIMARY_PREV,MAX_RATE_INTEREST_PRIVILEGED_PREV,...,NAME_CASH_LOAN_PURPOSEXNA,NAME_CASH_LOAN_PURPOSERepairs,NAME_CASH_LOAN_PURPOSEOthers,NFLAG_INSURED_APPROVAL_3Non_Request,NFLAG_INSURED_APPROVAL_3Requested,NFLAG_INSURED_APPROVAL_3Unknown,HOUR_PROCESS_START_RANGE0-7,HOUR_PROCESS_START_RANGE8-15,HOUR_PROCESS_START_RANGE16-23,NFLAG_LAST_APPL_IN_DAY
0,100001,1,3951.0,24835.5,23787.0,2520.0,24835.5,0.104326,,,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1
1,100002,1,9251.775,179055.0,179055.0,0.0,179055.0,0.0,,,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1
2,100003,3,169661.97,1306309.5,1452573.0,6885.0,1306309.5,0.100061,,,...,1.0,0.0,0.0,1.0,2.0,0.0,1.0,2.0,0.0,3
3,100004,1,5357.25,24282.0,20106.0,4860.0,24282.0,0.212008,,,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1
4,100005,2,4813.2,44617.5,40153.5,4464.0,44617.5,0.108964,,,...,1.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,2


(338857, 160)


In [17]:
# there are duplicate SK_ID_CURR columns due to the SQL join. Need to remove the duplicate columns
prev_gp['SK_ID_CURR_2'] = 'A'
prev_gp['SK_ID_CURR_2'] = prev_gp['SK_ID_CURR'] 
prev_gp.drop(['SK_ID_CURR'], axis = 1, inplace = True)
prev_gp.rename(columns = {'SK_ID_CURR_2':'SK_ID_CURR'}, inplace = True)
display(prev_gp.head())
print(prev_gp.shape)
# write the aggregated level data to sqlite database and save it, if don't have index=False, there will be an extra index column
prev_gp.to_sql(name='prev_gp', index=False, con=con)
con.commit()
con.close()

Unnamed: 0,NUM_PREV_APPS,SUM_AMT_ANNUITY_PREV,SUM_AMT_APPLICATION_PREV,SUM_AMT_CREDIT_PREV,SUM_AMT_DOWN_PAYMENT_PREV,SUM_AMT_GOODS_PRICE_PREV,MAX_RATE_DOWN_PAYMENT_PREV,MAX_RATE_INTEREST_PRIMARY_PREV,MAX_RATE_INTEREST_PRIVILEGED_PREV,SUM_SELLERPLACE_AREA_PREV,...,NAME_CASH_LOAN_PURPOSERepairs,NAME_CASH_LOAN_PURPOSEOthers,NFLAG_INSURED_APPROVAL_3Non_Request,NFLAG_INSURED_APPROVAL_3Requested,NFLAG_INSURED_APPROVAL_3Unknown,HOUR_PROCESS_START_RANGE0-7,HOUR_PROCESS_START_RANGE8-15,HOUR_PROCESS_START_RANGE16-23,NFLAG_LAST_APPL_IN_DAY,SK_ID_CURR
0,1,3951.0,24835.5,23787.0,2520.0,24835.5,0.104326,,,23,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1,100001
1,1,9251.775,179055.0,179055.0,0.0,179055.0,0.0,,,500,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1,100002
2,3,169661.97,1306309.5,1452573.0,6885.0,1306309.5,0.100061,,,1599,...,0.0,0.0,1.0,2.0,0.0,1.0,2.0,0.0,3,100003
3,1,5357.25,24282.0,20106.0,4860.0,24282.0,0.212008,,,30,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1,100004
4,2,4813.2,44617.5,40153.5,4464.0,44617.5,0.108964,,,36,...,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,2,100005


(338857, 157)


  dtype=dtype, method=method)
