<h2> This notebook is used for processing the data in the supplemental <TT>previous_application</TT> file. </h2>

In [1]:
import v11_common as com

import feather
import numpy as np
import pandas as pd

In [2]:
# Pending changes

In [3]:
PREV_APP_FILE = com.DATA_FILE_FOLDER + "previous_application.feather"

In [4]:
prev_app = pd.read_feather(PREV_APP_FILE)

<h4>There are a few columns that have entries reading "XNA", and several date-related columns with values of 365243; these are both to be treated as NA values.</h4>

In [5]:
xna_cols = ["NAME_PAYMENT_TYPE","NAME_CLIENT_TYPE", "NAME_CONTRACT_TYPE"]
for col in xna_cols:
    prev_app[col] = prev_app[col].astype("str").replace("XNA",np.nan)

In [6]:
cols_with_365243 = ["DAYS_FIRST_DRAWING","DAYS_FIRST_DUE","DAYS_LAST_DUE_1ST_VERSION",
                    "DAYS_LAST_DUE","DAYS_TERMINATION"]
prev_app[cols_with_365243] = prev_app[cols_with_365243].replace(365243, np.nan)

In [7]:
prev_app["AMT_CREDIT"].fillna(value = prev_app["AMT_APPLICATION"], inplace = True)

<h4>Some comparisons of the sizes of the monetary values in the data could be useful, so we'll engineer some of those.</h4>

In [8]:
app_credit_diff = prev_app["AMT_APPLICATION"] - prev_app["AMT_CREDIT"]
prev_app["APP_CREDIT_DIFF"] = app_credit_diff
app_credit_ratio = prev_app["AMT_APPLICATION"] / prev_app["AMT_CREDIT"]
prev_app["APP_CREDIT_RATIO"] = app_credit_ratio.replace([np.inf,-np.inf],np.nan)

conditions = [app_credit_diff < 0, app_credit_diff == 0, app_credit_diff > 0]
choices = ["MORE_CREDIT_THAN_ASKED","EQUAL_TO_CREDIT_ASKED","LESS_CREDIT_THAN_ASKED"]
prev_app["RECEIVED_VS_APPLIED_CREDIT"] = np.select(conditions, choices)

In [9]:
prev_app_gr = prev_app.groupby("SK_ID_CURR")

In [10]:
# Aggregations & new columns
prev_app_sub = prev_app_gr.agg({"AMT_ANNUITY":[min,max,np.mean,sum],
                                "AMT_CREDIT":[max,min,np.mean,sum],
                                "AMT_APPLICATION":[min,max,np.mean,sum,com.log_average],
                                "APP_CREDIT_DIFF":[max,min],
                                "APP_CREDIT_RATIO":[min,np.mean,max,com.geom_mean],
                                "DAYS_DECISION":min,
                                "NFLAG_LAST_APPL_IN_DAY":sum,
                                "AMT_DOWN_PAYMENT":[min,max,np.mean],
                                "DAYS_TERMINATION":[min,max]})
prev_app_sub.columns = ["MIN_AMT_ANNUITY","MAX_AMT_ANNUITY","AVG_AMT_ANNUITY","TOTAL_AMT_ANNUITY",
                        "MAX_CREDIT_RECEIVED","MIN_CREDIT_RECEIVED","AVG_CREDIT_RECEIVED","TOTAL_CREDIT_RECEIVED",
                        "MIN_AMT_APPLICATION","MAX_AMT_APPLICATION","AVG_AMT_APPLICATION","TOTAL_AMT_APPLICATION","LOG_AVG_AMT_APPLICATION",
                        "MAX_APP_CREDIT_DIFF", "MIN_APP_CREDIT_DIFF",
                        "MIN_APP_CREDIT_RATIO","AMEAN_APP_CREDIT_RATIO", "MAX_APP_CREDIT_RATIO","GMEAN_APP_CREDIT_RATIO",
                        "LAST_DECISION_DATE",
                        "NUM_APPS_IN_ON_LAST_DAY",
                        "MIN_AMT_DOWN_PAYMENT","MAX_AMT_DOWN_PAYMENT","AVG_AMT_DOWN_PAYMENT",
                        "OLDEST_TERMINATION_DATE", "NEWEST_TERMINATION_DATE"]

prev_app_sub["NUMBER_APPLICATIONS"] = prev_app_gr.size()

  log_a = np.log(a)


In [11]:
prev_app_contract = com.count_frac_cols(prev_app_gr,
                                        col = "NAME_CONTRACT_STATUS",
                                        middle_string = "CONTRACT_STATUS")

In [12]:
prev_app_client_type = com.count_frac_cols(prev_app_gr,
                                           col = "NAME_PAYMENT_TYPE",
                                           middle_string = "PREV_PAYMENT_TYPE")

In [13]:
prev_app_amt_received = com.count_frac_cols(prev_app_gr,
                                            col = "RECEIVED_VS_APPLIED_CREDIT")

In [14]:
prev_app_contract_types = com.count_frac_cols(prev_app_gr,
                                              col = "NAME_CONTRACT_TYPE",
                                              middle_string = "CONTRACT_TYPE")

In [15]:
prev_app_data = pd.concat([prev_app_sub, prev_app_contract, prev_app_amt_received,
                           prev_app_contract_types, prev_app_client_type],
                          axis = 1)

<h4> Logistic regression predictions.  Several columns are stripped out due to multicollinearity issues (the threshold point for this is a correlation of greater than 0.75 between two variables).  A few other columns are transformed so that they have higher correlations with the TARGET variable (which seems to help the models along); since the ranking order of the points stays the same in these circumstances, the gradient boosting shouldn't be particularly affected. </h4>

<h4> LAST AUC VALUE: 0.6194 </h4>

High Correlation Pairs:
    
NOT CHECKED YET

In [16]:
# Load objects needed for logistic regression
target_df = pd.read_feather("target.feather")

prev_app_poly = {"NUM_CONTRACT_STATUS_Refused":0.5, "LAST_DECISION_DATE":2}

high_cor_columns = []

In [17]:
# Make logistic regression predictions
test_aucs = []
for _ in range(8):
    pred, auc = com.log_regress_other_files(com.add_polynomial_terms(prev_app_data.reset_index().copy(), prev_app_poly),
                                            target_df,
                                            high_cor_columns)
    test_aucs.append(auc)
    print(auc)
print("Avg AUC: " + str(np.mean(test_aucs)))

0.6184704044726084
0.61597934003202
0.6193075136664563
0.6208053429426919
0.6262680960960947
0.6190420311177428
0.6214467451844795
0.6167316480972425
Avg AUC: 0.6197563902011669


In [18]:
prev_app_data["PREV_APP_LR_PREDS"] = pred

In [19]:
prev_app_data.reset_index().to_feather("previous_application_sub.feather")