Skip to content
This repository has been archived by the owner on Jun 22, 2022. It is now read-only.

LightGBM on selected features

Kamil A. Kaczmarek edited this page Jul 11, 2018 · 27 revisions

Blossom 🌼

🌼 code

In this solution we put focus on feature engineering. We made use of the several files in the dataset: previous-application, application, pos_cash_balance, installments_payments, bureau.

Preprocessing

Application data

  • CODE_GENDER replace XNA with np.nan
  • DAYS_EMPLOYED replace 365243 with np.nan
  • NAME_FAMILY_STATUS replace Unknown with np.nan
  • ORGANIZATION_TYPE replace XNA with np.nan
  • No missing value imputation
  • Encode as categorical the following columns:
CATEGORICAL_COLUMNS = ['CODE_GENDER', 'EMERGENCYSTATE_MODE', 'FLAG_CONT_MOBILE',
                       'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7',
                       'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_18',
                       'FLAG_EMAIL', 'FLAG_EMP_PHONE', 'FLAG_MOBIL', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'FLAG_PHONE', 'FLAG_WORK_PHONE',
                       'FONDKAPREMONT_MODE', 'HOUR_APPR_PROCESS_START', 'HOUSETYPE_MODE',
                       'LIVE_CITY_NOT_WORK_CITY', 'LIVE_REGION_NOT_WORK_REGION',
                       'NAME_CONTRACT_TYPE', 'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE',
                       'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE',
                       'OCCUPATION_TYPE', 'ORGANIZATION_TYPE',
                       'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY', 'REG_REGION_NOT_LIVE_REGION',
                       'REG_REGION_NOT_WORK_REGION',
                       'WALLSMATERIAL_MODE', 'WEEKDAY_APPR_PROCESS_START']

Feature Extraction

Application data -> eda-application.ipynb πŸ“

  • Raw numerical columns
  • Raw categorical columns
  • Engineered features
X['annuity_income_percentage'] = X['AMT_ANNUITY'] / X['AMT_INCOME_TOTAL']
X['car_to_birth_ratio'] = X['OWN_CAR_AGE'] / X['DAYS_BIRTH']
X['car_to_employ_ratio'] = X['OWN_CAR_AGE'] / X['DAYS_EMPLOYED']
X['children_ratio'] = X['CNT_CHILDREN'] / X['CNT_FAM_MEMBERS']
X['credit_to_annuity_ratio'] = X['AMT_CREDIT'] / X['AMT_ANNUITY']
X['credit_to_goods_ratio'] = X['AMT_CREDIT'] / X['AMT_GOODS_PRICE']
X['credit_to_income_ratio'] = X['AMT_CREDIT'] / X['AMT_INCOME_TOTAL']
X['days_employed_percentage'] = X['DAYS_EMPLOYED'] / X['DAYS_BIRTH']
X['ext_sources_mean'] = X[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']].mean(axis=1)
X['income_credit_percentage'] = X['AMT_INCOME_TOTAL'] / X['AMT_CREDIT']
X['income_per_child'] = X['AMT_INCOME_TOTAL'] / (1 + X['CNT_CHILDREN'])
X['income_per_person'] = X['AMT_INCOME_TOTAL'] / X['CNT_FAM_MEMBERS']
X['payment_rate'] = X['AMT_ANNUITY'] / X['AMT_CREDIT']
X['phone_to_birth_ratio'] = X['DAYS_LAST_PHONE_CHANGE'] / X['DAYS_BIRTH']
X['phone_to_employ_ratio'] = X['DAYS_LAST_PHONE_CHANGE'] / X['DAYS_EMPLOYED']
  • external_sources_mean has the strongest correlation with the target. Check below:
ext_sources_mean             0.222052
credit_to_goods_ratio        0.069427
car_to_birth_ratio           0.048824
days_employed_percentage     0.042206
phone_to_birth_ratio         0.033991
credit_to_annuity_ratio      0.032102
car_to_employ_ratio          0.030553
children_ratio               0.021223
annuity_income_percentage    0.014265
payment_rate                 0.012704
income_per_child             0.012529
credit_to_income_ratio       0.007727
income_per_person            0.006571
phone_to_employ_ratio        0.004562
income_credit_percentage     0.001817
AGGREGATION_RECIPIES = [
    (['CODE_GENDER', 'NAME_EDUCATION_TYPE'], [('AMT_ANNUITY', 'max'),
                                              ('AMT_CREDIT', 'max'),
                                              ('EXT_SOURCE_1', 'mean'),
                                              ('EXT_SOURCE_2', 'mean'),
                                              ('OWN_CAR_AGE', 'max'),
                                              ('OWN_CAR_AGE', 'sum')]),
]

Again features constructed from EXT_SOURCE_X are the most important. Check correlations below:

NAME_EDUCATION_TYPE_OCCUPATION_TYPE_REG_CITY_NOT_WORK_CITY_mean_EXT_SOURCE_1                0.089964
CODE_GENDER_NAME_EDUCATION_TYPE_OCCUPATION_TYPE_REG_CITY_NOT_WORK_CITY_mean_EXT_SOURCE_2    0.089235
CODE_GENDER_NAME_EDUCATION_TYPE_OCCUPATION_TYPE_REG_CITY_NOT_WORK_CITY_mean_EXT_SOURCE_1    0.086676
NAME_EDUCATION_TYPE_OCCUPATION_TYPE_mean_EXT_SOURCE_1                                       0.083520
NAME_EDUCATION_TYPE_OCCUPATION_TYPE_mean_EXT_SOURCE_2                                       0.082742
NAME_EDUCATION_TYPE_OCCUPATION_TYPE_REG_CITY_NOT_WORK_CITY_mean_ELEVATORS_AVG               0.078057
OCCUPATION_TYPE_mean_EXT_SOURCE_1                                                           0.076587
NAME_EDUCATION_TYPE_OCCUPATION_TYPE_mean_AMT_REQ_CREDIT_BUREAU_YEAR                         0.074528
NAME_EDUCATION_TYPE_OCCUPATION_TYPE_mean_YEARS_BUILD_AVG                                    0.073816
NAME_EDUCATION_TYPE_OCCUPATION_TYPE_mean_NONLIVINGAREA_AVG                                  0.073730
NAME_EDUCATION_TYPE_OCCUPATION_TYPE_mean_OWN_CAR_AGE                                        0.073535
NAME_EDUCATION_TYPE_OCCUPATION_TYPE_mean_APARTMENTS_AVG                                     0.072854
NAME_EDUCATION_TYPE_OCCUPATION_TYPE_mean_BASEMENTAREA_AVG                                   0.072231
CODE_GENDER_NAME_EDUCATION_TYPE_mean_EXT_SOURCE_1                                           0.071557
NAME_EDUCATION_TYPE_OCCUPATION_TYPE_mean_AMT_CREDIT                                         0.071023
OCCUPATION_TYPE_mean_EXT_SOURCE_2                                                           0.070659
CODE_GENDER_ORGANIZATION_TYPE_mean_EXT_SOURCE_1                                             0.070028
CODE_GENDER_NAME_EDUCATION_TYPE_max_OWN_CAR_AGE                                             0.067620
CODE_GENDER_REG_CITY_NOT_WORK_CITY_mean_CNT_CHILDREN                                        0.066059
CODE_GENDER_NAME_EDUCATION_TYPE_max_AMT_CREDIT                                              0.065317
CODE_GENDER_NAME_EDUCATION_TYPE_max_AMT_ANNUITY                                             0.064173
CODE_GENDER_NAME_EDUCATION_TYPE_mean_EXT_SOURCE_2                                           0.063390
CODE_GENDER_NAME_EDUCATION_TYPE_sum_OWN_CAR_AGE                                             0.062637
CODE_GENDER_ORGANIZATION_TYPE_mean_DAYS_REGISTRATION                                        0.052398
CODE_GENDER_ORGANIZATION_TYPE_mean_AMT_ANNUITY                                              0.052341
CODE_GENDER_ORGANIZATION_TYPE_mean_AMT_INCOME_TOTAL                                         0.050268
OCCUPATION_TYPE_mean_DAYS_EMPLOYED                                                          0.050074
CODE_GENDER_REG_CITY_NOT_WORK_CITY_mean_AMT_ANNUITY                                         0.048534
OCCUPATION_TYPE_mean_AMT_ANNUITY                                                            0.046566
CODE_GENDER_REG_CITY_NOT_WORK_CITY_mean_DAYS_ID_PUBLISH                                     0.040932
OCCUPATION_TYPE_mean_DAYS_REGISTRATION                                                      0.035164
OCCUPATION_TYPE_mean_CNT_CHILDREN                                                           0.019836
OCCUPATION_TYPE_mean_EXT_SOURCE_3                                                           0.007225
OCCUPATION_TYPE_mean_CNT_FAM_MEMBERS                                                        0.005959
OCCUPATION_TYPE_mean_DAYS_BIRTH                                                             0.003795
OCCUPATION_TYPE_mean_DAYS_ID_PUBLISH                                                        0.002663
NAME_EDUCATION_TYPE_OCCUPATION_TYPE_mean_EXT_SOURCE_3                                       0.001598
  • Hand crafted features (with correlation with target):
bureau_credit_active_binary              0.105735
bureau_debt_credit_ratio                 0.096372
bureau_credit_enddate_percentage         0.053573
bureau_total_customer_overdue            0.052995
bureau_total_customer_credit             0.041768
bureau_total_customer_debt               0.019435
bureau_number_of_loan_types              0.018792
bureau_average_of_past_loans_per_type    0.014492
bureau_average_creditdays_prolonged      0.011719
bureau_overdue_debt_ratio                0.008374
bureau_number_of_past_loans              0.006160
  • statistical aggregations (mean, sum, min, max)
SK_ID_CURR_mean_DAYS_CREDIT               0.089729
SK_ID_CURR_min_DAYS_CREDIT                0.075248
SK_ID_CURR_mean_DAYS_CREDIT_UPDATE        0.068927
SK_ID_CURR_sum_DAYS_CREDIT_ENDDATE        0.053735
SK_ID_CURR_max_DAYS_CREDIT                0.049782
SK_ID_CURR_mean_DAYS_CREDIT_ENDDATE       0.046983
SK_ID_CURR_min_DAYS_CREDIT_UPDATE         0.042864
SK_ID_CURR_sum_DAYS_CREDIT                0.042000
SK_ID_CURR_sum_DAYS_CREDIT_UPDATE         0.041404
...
SK_ID_CURR_min_AMT_CREDIT_SUM_DEBT        0.000242
SK_ID_CURR_min_CNT_CREDIT_PROLONG         0.000182
SK_ID_CURR_min_AMT_CREDIT_SUM_OVERDUE     0.000003

Credit Card data -> eda-credit_card.ipynb πŸ“

  • Hand crafted features (with correlation with target):
credit_card_drawings_atm                   0.038106
credit_card_installments_per_loan          0.031622
credit_card_total_instalments              0.031304
credit_card_drawings_total                 0.023680
credit_card_number_of_loans                0.004388
credit_card_average_of_days_past_due       0.003195
credit_card_avg_loading_of_credit_limit    0.002944
credit_card_cash_card_ratio                0.002414
  • statistical aggregations (mean, sum, min, max)
SK_ID_CURR_mean_CNT_DRAWINGS_ATM_CURRENT      0.107692
SK_ID_CURR_max_CNT_DRAWINGS_CURRENT           0.101389
SK_ID_CURR_mean_AMT_BALANCE                   0.087177
SK_ID_CURR_mean_CNT_DRAWINGS_CURRENT          0.082520
SK_ID_CURR_max_AMT_BALANCE                    0.068798
SK_ID_CURR_min_AMT_BALANCE                    0.064163
SK_ID_CURR_max_CNT_DRAWINGS_ATM_CURRENT       0.063729
SK_ID_CURR_var_CNT_DRAWINGS_CURRENT           0.062892
SK_ID_CURR_mean_MONTHS_BALANCE                0.062081
SK_ID_CURR_min_MONTHS_BALANCE                 0.061359
SK_ID_CURR_var_CNT_DRAWINGS_ATM_CURRENT       0.061123
SK_ID_CURR_mean_AMT_DRAWINGS_ATM_CURRENT      0.059925
SK_ID_CURR_sum_MONTHS_BALANCE                 0.059051
SK_ID_CURR_var_MONTHS_BALANCE                 0.058817
SK_ID_CURR_mean_AMT_DRAWINGS_CURRENT          0.058732
SK_ID_CURR_max_AMT_DRAWINGS_CURRENT           0.052318
SK_ID_CURR_sum_CNT_DRAWINGS_CURRENT           0.050685
SK_ID_CURR_sum_CNT_DRAWINGS_ATM_CURRENT       0.049970
SK_ID_CURR_sum_AMT_CREDIT_LIMIT_ACTUAL        0.045460
SK_ID_CURR_sum_CNT_INSTALMENT_MATURE_CUM      0.042363
...
SK_ID_CURR_max_AMT_PAYMENT_CURRENT            0.000438
SK_ID_CURR_sum_CNT_DRAWINGS_OTHER_CURRENT     0.000227
SK_ID_CURR_max_CNT_DRAWINGS_OTHER_CURRENT     0.000008
SK_ID_CURR_min_SK_DPD                              NaN
SK_ID_CURR_min_SK_DPD_DEF                          NaN

Installments Payments data -> eda-installments.ipynb πŸ“

  • statistical aggregations (mean, sum, min, max)
SK_ID_CURR_min_DAYS_ENTRY_PAYMENT         0.058794
SK_ID_CURR_min_DAYS_INSTALMENT            0.058648
SK_ID_CURR_var_DAYS_INSTALMENT            0.052273
SK_ID_CURR_var_DAYS_ENTRY_PAYMENT         0.052071
SK_ID_CURR_mean_DAYS_ENTRY_PAYMENT        0.043992
SK_ID_CURR_mean_DAYS_INSTALMENT           0.043509
SK_ID_CURR_sum_DAYS_ENTRY_PAYMENT         0.035227
SK_ID_CURR_sum_DAYS_INSTALMENT            0.035064
SK_ID_CURR_min_NUM_INSTALMENT_VERSION     0.032039
SK_ID_CURR_sum_NUM_INSTALMENT_VERSION     0.030063
SK_ID_CURR_mean_NUM_INSTALMENT_VERSION    0.027323
SK_ID_CURR_min_AMT_PAYMENT                0.025724
SK_ID_CURR_sum_AMT_PAYMENT                0.024375
SK_ID_CURR_mean_AMT_PAYMENT               0.023169
SK_ID_CURR_min_AMT_INSTALMENT             0.020257
SK_ID_CURR_sum_AMT_INSTALMENT             0.019811
SK_ID_CURR_max_NUM_INSTALMENT_VERSION     0.018611
SK_ID_CURR_mean_AMT_INSTALMENT            0.018409
SK_ID_CURR_sum_NUM_INSTALMENT_NUMBER      0.017441
SK_ID_CURR_var_NUM_INSTALMENT_VERSION     0.011427
SK_ID_CURR_mean_NUM_INSTALMENT_NUMBER     0.009537
SK_ID_CURR_max_NUM_INSTALMENT_NUMBER      0.006304
SK_ID_CURR_var_AMT_PAYMENT                0.003841
SK_ID_CURR_max_DAYS_INSTALMENT            0.003231
SK_ID_CURR_min_NUM_INSTALMENT_NUMBER      0.002334
SK_ID_CURR_max_AMT_INSTALMENT             0.002324
SK_ID_CURR_max_DAYS_ENTRY_PAYMENT         0.002298
SK_ID_CURR_var_AMT_INSTALMENT             0.002151
SK_ID_CURR_max_AMT_PAYMENT                0.001554
SK_ID_CURR_var_NUM_INSTALMENT_NUMBER      0.001040
  • statistical aggregations (mean, sum, min, max)
SK_ID_CURR_min_MONTHS_BALANCE     0.055307
SK_ID_CURR_var_MONTHS_BALANCE     0.048760
SK_ID_CURR_sum_MONTHS_BALANCE     0.040570
SK_ID_CURR_mean_MONTHS_BALANCE    0.034543
SK_ID_CURR_max_SK_DPD_DEF         0.009580
SK_ID_CURR_mean_SK_DPD_DEF        0.006496
SK_ID_CURR_min_SK_DPD             0.005444
SK_ID_CURR_mean_SK_DPD            0.005436
SK_ID_CURR_sum_SK_DPD_DEF         0.004950
SK_ID_CURR_max_SK_DPD             0.004763
SK_ID_CURR_sum_SK_DPD             0.004740
SK_ID_CURR_min_SK_DPD_DEF         0.004702
SK_ID_CURR_max_MONTHS_BALANCE     0.004321
SK_ID_CURR_var_SK_DPD_DEF         0.004076
SK_ID_CURR_var_SK_DPD             0.003361

Previous application data -> eda-previous_application.ipynb πŸ“

  • statistical aggregations (mean, sum, min, max)
SK_ID_CURR_min_DAYS_DECISION               0.053434
SK_ID_CURR_var_DAYS_DECISION               0.048513
SK_ID_CURR_mean_DAYS_DECISION              0.046864
SK_ID_CURR_var_CNT_PAYMENT                 0.041960
SK_ID_CURR_sum_RATE_DOWN_PAYMENT           0.041693
SK_ID_CURR_max_RATE_DOWN_PAYMENT           0.040096
SK_ID_CURR_mean_HOUR_APPR_PROCESS_START    0.035927
SK_ID_CURR_mean_AMT_ANNUITY                0.034871
SK_ID_CURR_mean_RATE_DOWN_PAYMENT          0.033601
SK_ID_CURR_min_AMT_ANNUITY                 0.032249
SK_ID_CURR_min_HOUR_APPR_PROCESS_START     0.031427
SK_ID_CURR_max_HOUR_APPR_PROCESS_START     0.030847
SK_ID_CURR_max_CNT_PAYMENT                 0.029439
...
SK_ID_CURR_sum_AMT_GOODS_PRICE             0.004662
SK_ID_CURR_sum_AMT_APPLICATION             0.004607
SK_ID_CURR_var_AMT_DOWN_PAYMENT            0.002022

Validation

  • 5-fold stratified K-fold (5 is parameter in the configuration file: neptune.yaml#L38)

Model

# Light GBM
  lgbm_random_search_runs: 0
  lgbm__device: cpu # gpu cpu
  lgbm__boosting_type: gbdt
  lgbm__objective: binary
  lgbm__metric: auc
  lgbm__number_boosting_rounds: 500
  lgbm__early_stopping_rounds: 50
  lgbm__learning_rate: 0.1
  lgbm__max_bin: 300
  lgbm__max_depth: -1
  lgbm__num_leaves: 100
  lgbm__min_child_samples: 600
  lgbm__subsample: 1.0
  lgbm__subsample_freq: 1
  lgbm__colsample_bytree: 0.1
  lgbm__min_gain_to_split: 0.5
  lgbm__reg_lambda: 50.0
  lgbm__reg_alpha: 0.0
  lgbm__scale_pos_weight: 1

Remarks

  • Diagrams below shows that there is not a lot of diversity between folds (which is good) and quite a large gap between train and valid (which is badish πŸ˜‰).

solution-3-lgb_monitor

  • Tweaking the parameters responsible for regularization may help:
  lgbm__min_child_samples: 600
  lgbm__subsample: 1.0
  lgbm__subsample_freq: 1
  lgbm__colsample_bytree: 0.1
  lgbm__min_gain_to_split: 0.5
  lgbm__reg_lambda: 50.0
  lgbm__reg_alpha: 0.0
  • One can also investigate a bit different tree 🌴 architectures. As of now it is deep and wide:
  lgbm__max_depth: -1
  lgbm__num_leaves: 100

pipeline diagram

For your reference we put entire pipeline here. solution-3