# Table of Contents
* [Import Data from SQLite Database](#Import-Data-from-SQLite-Database)
* [Feature engineering](#Feature-engineering)
* [Aggregate columns by SK_ID_CURR](#Aggregate-columns-by-SK_ID_CURR)
* [Merge and save the lead application dataset with aggregated Bureau dataset](#Merge-and-save-the-lead-application-dataset-with-aggregated-Bureau-dataset)

## Import Data from SQLite Database

In [1]:
import pandas as pd
import numpy as np
from numpy.random import seed
import matplotlib.pyplot as plt
%matplotlib inline  
import statistics
from scipy import stats
from scipy.stats import t
from scipy.stats import norm
import seaborn as sns

In [2]:
import sqlite3
from sqlite3 import Error
import csv
# open the connection to read in the datasets, remember to close the connection at the end of the code
con = sqlite3.connect(r"pythonsqlite.db")
cur = con.cursor()

In [3]:
# read in bureau data from sqlite database
sql_stmt = '''SELECT A.* FROM bureau_sql as A '''
bureau = pd.read_sql(sql_stmt, coerce_float=True, con=con)
# replace field that's entirely space (or empty) with NaN
bureau.replace(r'^\s*$', np.nan, regex=True, inplace=True)
# close connection
con.close()

## Feature engineering

1. Drop CREDIT_CURRENCY column because 99% are currency 1, the variable is not informative.
2. Consolidate any non active status in column CREDIT_ACTIVE into 'Closed' status.
3. CREDIT_TYPE has 15 categories and many have very few counts, consolidate any type that are not in the largest 2 categories into a new category called "Loan", as these are all related to some type of loans such as car loans, Microloan and so on. Only keep 3 final categories: Consumer Credit, Credit Card, Loan.

In [4]:
bureau_agg = bureau.copy()

bureau_agg.drop(columns='CREDIT_CURRENCY',axis=1,inplace=True)

bureau_agg['CREDIT_ACTIVE'][bureau_agg['CREDIT_ACTIVE'] == 'Sold'] = 'Closed'
bureau_agg['CREDIT_ACTIVE'][bureau_agg['CREDIT_ACTIVE'] == 'Bad debt'] = 'Closed' 
print(bureau_agg['CREDIT_ACTIVE'].value_counts())

bureau_agg.CREDIT_TYPE[~bureau_agg.CREDIT_TYPE.isin(['Consumer credit','Credit card'])] = 'Loan'
print(bureau_agg['CREDIT_TYPE'].value_counts())

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Closed    1085821
Active     630607
Name: CREDIT_ACTIVE, dtype: int64


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


Consumer credit    1251615
Credit card         402195
Loan                 62618
Name: CREDIT_TYPE, dtype: int64


## Aggregate columns by SK_ID_CURR

Use the following aggregation logic for each of the variables below: These variables represent current amount of credits, debts, limits of credit cards, or max/sum of amount overdue shown on the Bureau, aggregating using the sum at each SK_ID_CURR level is reasonable.
AMT_CREDIT_SUM: sum <br>
AMT_CREDIT_SUM_DEBT: sum <br>
AMT_CREDIT_SUM_LIMIT: sum <br>
AMT_CREDIT_SUM_OVERDUE: sum <br>
AMT_ANNUITY: sum <br>
AMT_CREDIT_MAX_OVERDUE: sum <br>

Use the following aggregation logic for each of the variables below: These variables represent at the time of application, how many days since Bureau credit ended, or remaining days of Bureau credit, or how many days before current application did client apply for Bureau credit, or days overdue and so on. Using max or min or both summary statistics to aggregate these columns is reasonable.
DAYS_ENDDATE_FACT: max <br>
DAYS_CREDIT_ENDDATE: max <br>
DAYS_CREDIT: min and max to generate 2 columns <br>
DAYS_CREDIT_UPDATE: use min and max to generate 2 columns <br>
CREDIT_DAY_OVERDUE: max, 99% are 0, no missing values <br>
CNT_CREDIT_PROLONG: max, 99.5% are 0, no missing values <br>

In [5]:
# open connection to sqlite database 
con = sqlite3.connect(r"pythonsqlite.db")
cur = con.cursor()
cur.execute('DROP TABLE IF EXISTS bureau_agg')

# write bureau_agg to sqlite database and save it, if don't have index=False, there will be an extra index column
bureau_agg.to_sql(name='bureau_agg', index=False, con=con)

# aggregate at SK_ID_CURR level
sql_sm = '''SELECT SK_ID_CURR, 
                COUNT(SK_ID_BUREAU) AS BUREAU_ID_CT,
                SUM(AMT_CREDIT_SUM) AS SUM_AMT_CREDIT_SUM, 
                SUM(AMT_CREDIT_SUM_DEBT) AS SUM_AMT_CREDIT_SUM_DEBT,
                SUM(AMT_CREDIT_SUM_LIMIT) AS SUM_AMT_CREDIT_SUM_LIMIT,
                SUM(AMT_CREDIT_SUM_OVERDUE) AS SUM_AMT_CREDIT_SUM_OVERDUE,
                SUM(AMT_ANNUITY) AS SUM_AMT_ANNUITY,
                SUM(AMT_CREDIT_MAX_OVERDUE) AS SUM_AMT_CREDIT_MAX_OVERDUE,
                MAX(DAYS_ENDDATE_FACT) AS MAX_DAYS_ENDDATE_FACT,
                MAX(DAYS_CREDIT_ENDDATE) AS MAX_DAYS_CREDIT_ENDDATE,
                MAX(DAYS_CREDIT) AS MAX_DAYS_CREDIT,
                MAX(DAYS_CREDIT_UPDATE) AS MAX_DAYS_CREDIT_UPDATE,
                MAX(CREDIT_DAY_OVERDUE) AS MAX_CREDIT_DAY_OVERDUE,
                MAX(CNT_CREDIT_PROLONG) AS MAX_CNT_CREDIT_PROLONG,
                MIN(DAYS_CREDIT) AS MIN_DAYS_CREDIT,
                MIN(DAYS_CREDIT_UPDATE) AS MIN_DAYS_CREDIT_UPDATE,
                SUM(CASE WHEN CREDIT_ACTIVE = 'Closed' THEN 1 ELSE 0 END) AS CLOSED_CT,
                SUM(CASE WHEN CREDIT_ACTIVE = 'Active' THEN 1 ELSE 0 END) AS Active_CT,
                SUM(CASE WHEN CREDIT_TYPE = 'Consumer credit' THEN 1 ELSE 0 END) AS Consumer_Credit_type_CT,
                SUM(CASE WHEN CREDIT_TYPE = 'Credit card' THEN 1 ELSE 0 END) AS Credit_card_type_CT,
                SUM(CASE WHEN CREDIT_TYPE = 'Loan' THEN 1 ELSE 0 END) AS Loan_type_CT
            FROM bureau_agg 
            GROUP BY SK_ID_CURR'''
bureau_post_agg = pd.read_sql(sql_sm, coerce_float=True, con=con)

# check the columns are aggregated as expected
display(bureau_post_agg.head())
print(bureau_post_agg.shape)

# write bureau_post_agg to sqlite database and save it
cur.execute('DROP TABLE IF EXISTS bureau_post_agg')
bureau_post_agg.to_sql(name='bureau_post_agg', index=False, con=con)

Unnamed: 0,SK_ID_CURR,BUREAU_ID_CT,SUM_AMT_CREDIT_SUM,SUM_AMT_CREDIT_SUM_DEBT,SUM_AMT_CREDIT_SUM_LIMIT,SUM_AMT_CREDIT_SUM_OVERDUE,SUM_AMT_ANNUITY,SUM_AMT_CREDIT_MAX_OVERDUE,MAX_DAYS_ENDDATE_FACT,MAX_DAYS_CREDIT_ENDDATE,...,MAX_DAYS_CREDIT_UPDATE,MAX_CREDIT_DAY_OVERDUE,MAX_CNT_CREDIT_PROLONG,MIN_DAYS_CREDIT,MIN_DAYS_CREDIT_UPDATE,CLOSED_CT,Active_CT,Consumer_Credit_type_CT,Credit_card_type_CT,Loan_type_CT
0,100001,7,1453365.0,596686.5,0.0,0.0,24817.5,,-544.0,1778.0,...,-6,0,0,-1572,-155,4,3,7,0,0
1,100002,8,865055.565,245781.0,31988.565,0.0,0.0,8405.145,-36.0,780.0,...,-7,0,0,-1437,-1185,6,2,4,4,0
2,100003,4,1017400.5,0.0,810000.0,0.0,,0.0,-540.0,1216.0,...,-43,0,0,-2586,-2131,3,1,2,2,0
3,100004,2,189037.8,0.0,0.0,0.0,,0.0,-382.0,-382.0,...,-382,0,0,-1326,-682,2,0,2,0,0
4,100005,3,657126.0,568408.5,0.0,0.0,4261.5,0.0,-123.0,1324.0,...,-11,0,0,-373,-121,1,2,2,1,0


(305811, 21)


## Merge and save the lead application dataset with aggregated Bureau dataset

In [6]:
# merge the lead app_sql dataset with this aggregated version bureau_post_agg table and save in sqlite database
sql_sm = '''SELECT A.*, 
                B.BUREAU_ID_CT,
                B.SUM_AMT_CREDIT_SUM, 
                B.SUM_AMT_CREDIT_SUM_DEBT,
                B.SUM_AMT_CREDIT_SUM_LIMIT,
                B.SUM_AMT_CREDIT_SUM_OVERDUE,
                B.SUM_AMT_ANNUITY,
                B.SUM_AMT_CREDIT_MAX_OVERDUE,
                B.MAX_DAYS_ENDDATE_FACT,
                B.MAX_DAYS_CREDIT_ENDDATE,
                B.MAX_DAYS_CREDIT,
                B.MAX_DAYS_CREDIT_UPDATE,
                B.MAX_CREDIT_DAY_OVERDUE,
                B.MAX_CNT_CREDIT_PROLONG,
                B.MIN_DAYS_CREDIT,
                B.MIN_DAYS_CREDIT_UPDATE,
                B.CLOSED_CT,
                B.Active_CT,
                B.Consumer_Credit_type_CT,
                B.Credit_card_type_CT,
                B.Loan_type_CT
            FROM app_sql AS A 
            LEFT JOIN bureau_post_agg AS B ON A.SK_ID_CURR = B.SK_ID_CURR'''
app_bureau = pd.read_sql(sql_sm, coerce_float=True, con=con)

display(app_bureau.head())
print(app_bureau.shape)

# replace field that's entirely space (or empty) with NaN
app_bureau.replace(r'^\s*$', np.nan, regex=True, inplace=True)

# write app_bureau to sqlite database and save it
cur.execute('DROP TABLE IF EXISTS app_bureau')
app_bureau.to_sql(name='app_bureau', index=False, con=con)

# save the files permanently to the sqlite database
con.commit()
# close connection
con.close()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,MAX_DAYS_CREDIT_UPDATE,MAX_CREDIT_DAY_OVERDUE,MAX_CNT_CREDIT_PROLONG,MIN_DAYS_CREDIT,MIN_DAYS_CREDIT_UPDATE,CLOSED_CT,Active_CT,Consumer_Credit_type_CT,Credit_card_type_CT,Loan_type_CT
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,-7.0,0.0,0.0,-1437.0,-1185.0,6.0,2.0,4.0,4.0,0.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,-43.0,0.0,0.0,-2586.0,-2131.0,3.0,1.0,2.0,2.0,0.0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,-382.0,0.0,0.0,-1326.0,-682.0,2.0,0.0,2.0,0.0,0.0
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,...,,,,,,,,,,
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,-783.0,0.0,0.0,-1149.0,-783.0,1.0,0.0,1.0,0.0,0.0


(307511, 142)
