# Table of Contents
* [Import Data from SQLite Database](#Import-Data-from-SQLite-Database)
* [Aggregate columns at SK_ID_PREV level](#Aggregate-columns-at-SK_ID_PREV-level)
* [Aggregate columns by SK_ID_CURR](#Aggregate-columns-by-SK_ID_CURR)

## Import Data from SQLite Database

In [1]:
import pandas as pd
import numpy as np
from numpy.random import seed
import matplotlib.pyplot as plt
%matplotlib inline  
import statistics
from scipy import stats
from scipy.stats import t
from scipy.stats import norm
import seaborn as sns

In [2]:
import sqlite3
from sqlite3 import Error
import csv
# open the connection to read in the datasets, remember to close the connection at the end of the code
con = sqlite3.connect(r"pythonsqlite.db")
cur = con.cursor()

In [3]:
# read in bureau balance data from sqlite database
sql_stmt = '''SELECT A.* FROM inst_pay_sql as A '''
inst_pay = pd.read_sql(sql_stmt, coerce_float=True, con=con)
# replace field that's entirely space (or empty) with NaN
inst_pay.replace(r'^\s*$', np.nan, regex=True, inplace=True)
# close connection
con.close()

## Aggregate columns at SK_ID_PREV level

Use the following aggregation for each variable: Aggregate out the SK_ID_PREV field
0. keep total number of previous applications: COUNT(SK_ID_PREV) AS TOTAL_NUM_PREV_APPS. 
1. NUM_INSTALMENT_VERSION: keep min and max. There may be better aggregation/transfer logic for this variable. The version number signifies payment parameter changes, however, no additional information is available to further engineer this feature.   
2. NUM_INSTALMENT_NUMBER: keep min and max. Same as variable NUM_INSTALMENT_VERSION.
3. DAYS_INSTALMENT & DAYS_ENTRY_PAYMENT: if DAYS_INSTALMENT >= DAYS_ENTRY_PAYMENT, meaning the payment is ahead of the due date, this is good, otherwise would be a late payment. Aggregate by counting number of late payments.
5. AMT_INSTALMENT: keep average
6. AMT_PAYMENT: keep average

In [4]:
# open connection to sqlite database 
con = sqlite3.connect(r"pythonsqlite.db")
cur = con.cursor()

In [5]:
# aggregate at SK_ID_PREV level
sql_sm = '''SELECT SK_ID_CURR, SK_ID_PREV,
                AVG(AMT_INSTALMENT) AS AVG_AMT_INSTALMENT,
                AVG(AMT_PAYMENT) AS AVG_AMT_PAYMENT,
                MIN(NUM_INSTALMENT_VERSION) AS MIN_NUM_INSTALMENT_VERSION,
                MAX(NUM_INSTALMENT_VERSION) AS MAX_NUM_INSTALMENT_VERSION,
                MIN(NUM_INSTALMENT_NUMBER) AS MIN_NUM_INSTALMENT_NUMBER,
                MAX(NUM_INSTALMENT_NUMBER) AS MAX_NUM_INSTALMENT_NUMBER,                
                SUM(CASE WHEN DAYS_INSTALMENT < DAYS_ENTRY_PAYMENT THEN 1 ELSE 0 END) AS NUM_LATE_PAYMENT
            FROM inst_pay_sql
            GROUP BY SK_ID_CURR, SK_ID_PREV       
            '''
inst_pay_cur_prev_gp = pd.read_sql(sql_sm, coerce_float=True, con=con)

display(inst_pay_cur_prev_gp.head())
print(inst_pay_cur_prev_gp.shape)

# write inst_pay_cur_prev_gp to sqlite database and save it, 
# if don't have index=False, there will be an extra index column
inst_pay_cur_prev_gp.to_sql(name='inst_pay_cur_prev_gp', index=False, con=con)

Unnamed: 0,SK_ID_CURR,SK_ID_PREV,AVG_AMT_INSTALMENT,AVG_AMT_PAYMENT,MIN_NUM_INSTALMENT_VERSION,MAX_NUM_INSTALMENT_VERSION,MIN_NUM_INSTALMENT_NUMBER,MAX_NUM_INSTALMENT_NUMBER,NUM_LATE_PAYMENT
0,100001,1369693,7312.725,7312.725,1.0,2.0,1,4,0
1,100001,1851984,3981.675,3981.675,1.0,1.0,2,4,1
2,100002,1038818,11559.247105,11559.247105,1.0,2.0,1,19,0
3,100003,1810518,164425.332857,164425.332857,1.0,2.0,1,7,0
4,100003,2396755,6731.115,6731.115,1.0,1.0,1,12,0


(997752, 9)


## Aggregate columns by SK_ID_CURR

Note: Logic of aggregation at SK_ID_CURR level are the same as those used in aggregating at SK_ID_PREV level

In [6]:
# aggregate at SK_ID_CURR level, added total number of previous application line
sql_sm = '''SELECT SK_ID_CURR, 
                COUNT(SK_ID_PREV) AS TOTAL_NUM_PREV_APPS,
                AVG(AVG_AMT_INSTALMENT) AS AVG_AMT_INSTALMENT,
                AVG(AVG_AMT_PAYMENT) AS AVG_AMT_PAYMENT,
                MIN(MIN_NUM_INSTALMENT_VERSION) AS MIN_NUM_INSTALMENT_VERSION,
                MAX(MAX_NUM_INSTALMENT_VERSION) AS MAX_NUM_INSTALMENT_VERSION,
                MIN(MIN_NUM_INSTALMENT_NUMBER) AS MIN_NUM_INSTALMENT_NUMBER,
                MAX(MAX_NUM_INSTALMENT_NUMBER) AS MAX_NUM_INSTALMENT_NUMBER,                
                SUM(NUM_LATE_PAYMENT) AS NUM_LATE_PAYMENT
            FROM inst_pay_cur_prev_gp
            GROUP BY SK_ID_CURR  
            '''
inst_pay_cur_gp = pd.read_sql(sql_sm, coerce_float=True, con=con)

display(inst_pay_cur_gp.head())
print(inst_pay_cur_gp.shape)

# write inst_pay_cur_gp to sqlite database and save it, 
# if don't have index=False, there will be an extra index column
inst_pay_cur_gp.to_sql(name='inst_pay_cur_gp', index=False, con=con)

con.commit()
con.close()

Unnamed: 0,SK_ID_CURR,TOTAL_NUM_PREV_APPS,AVG_AMT_INSTALMENT,AVG_AMT_PAYMENT,MIN_NUM_INSTALMENT_VERSION,MAX_NUM_INSTALMENT_VERSION,MIN_NUM_INSTALMENT_NUMBER,MAX_NUM_INSTALMENT_NUMBER,NUM_LATE_PAYMENT
0,100001,2,5647.2,5647.2,1.0,2.0,1,4,1
1,100002,1,11559.247105,11559.247105,1.0,2.0,1,19,0
2,100003,3,78558.479286,78558.479286,1.0,2.0,1,12,0
3,100004,1,7096.155,7096.155,1.0,2.0,1,3,0
4,100005,1,6240.205,6240.205,1.0,2.0,1,9,1


(339587, 9)
