Skip to content
This repository has been archived by the owner on Jun 22, 2022. It is now read-only.

LightGBM on selected features

Jakub edited this page Jun 28, 2018 · 27 revisions

Validation

  • 5-fold stratified K-fold

Preprocessing

  • Application

*DAYS_EMPLOYED value of 365243 -> np.nan *Encode as categorical the following columns:

CATEGORICAL_COLUMNS = ['CODE_GENDER', 'EMERGENCYSTATE_MODE', 'FLAG_CONT_MOBILE',
                       'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7',
                       'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_18',
                       'FLAG_EMAIL', 'FLAG_EMP_PHONE', 'FLAG_MOBIL', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'FLAG_PHONE', 'FLAG_WORK_PHONE',
                       'FONDKAPREMONT_MODE', 'HOUR_APPR_PROCESS_START', 'HOUSETYPE_MODE',
                       'LIVE_CITY_NOT_WORK_CITY', 'LIVE_REGION_NOT_WORK_REGION',
                       'NAME_CONTRACT_TYPE', 'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE',
                       'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE',
                       'OCCUPATION_TYPE', 'ORGANIZATION_TYPE',
                       'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY', 'REG_REGION_NOT_LIVE_REGION',
                       'REG_REGION_NOT_WORK_REGION',
                       'WALLSMATERIAL_MODE', 'WEEKDAY_APPR_PROCESS_START']
  • No missing value imputation

Feature Extraction

  • Application

All the info can be found in the notebooks/eda-application.ipynb

  • Raw numerical columns
  • Raw categorical columns
  • Engineered features
X['annuity_income_percentage'] = X['AMT_ANNUITY'] / X['AMT_INCOME_TOTAL']
X['car_to_birth_ratio'] = X['OWN_CAR_AGE'] / X['DAYS_BIRTH']
X['car_to_employ_ratio'] = X['OWN_CAR_AGE'] / X['DAYS_EMPLOYED']
X['children_ratio'] = X['CNT_CHILDREN'] / X['CNT_FAM_MEMBERS']
X['credit_to_annuity_ratio'] = X['AMT_CREDIT'] / X['AMT_ANNUITY']
X['credit_to_goods_ratio'] = X['AMT_CREDIT'] / X['AMT_GOODS_PRICE']
X['credit_to_income_ratio'] = X['AMT_CREDIT'] / X['AMT_INCOME_TOTAL']
X['days_employed_percentage'] = X['DAYS_EMPLOYED'] / X['DAYS_BIRTH']
X['ext_sources_mean'] = X[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']].mean(axis=1)
X['income_credit_percentage'] = X['AMT_INCOME_TOTAL'] / X['AMT_CREDIT']
X['income_per_child'] = X['AMT_INCOME_TOTAL'] / (1 + X['CNT_CHILDREN'])
X['income_per_person'] = X['AMT_INCOME_TOTAL'] / X['CNT_FAM_MEMBERS']
X['payment_rate'] = X['AMT_ANNUITY'] / X['AMT_CREDIT']
X['phone_to_birth_ratio'] = X['DAYS_LAST_PHONE_CHANGE'] / X['DAYS_BIRTH']
X['phone_to_employ_ratio'] = X['DAYS_LAST_PHONE_CHANGE'] / X['DAYS_EMPLOYED']

external_sources_mean has the strongest correlation with the target

TARGET                       1.000000
ext_sources_mean             0.222052
credit_to_goods_ratio        0.069427
car_to_birth_ratio           0.048824
days_employed_percentage     0.042206
phone_to_birth_ratio         0.033991
credit_to_annuity_ratio      0.032102
car_to_employ_ratio          0.030553
children_ratio               0.021223
annuity_income_percentage    0.014265
payment_rate                 0.012704
income_per_child             0.012529
credit_to_income_ratio       0.007727
income_per_person            0.006571
phone_to_employ_ratio        0.004562
income_credit_percentage     0.001817
  • Aggregated features Aggregations are constructed from recipies (all in pipeline_config.py like this: AGGREGATION_RECIPIES = [ (['CODE_GENDER', 'NAME_EDUCATION_TYPE'], [('AMT_ANNUITY', 'max'), ('AMT_CREDIT', 'max'), ('EXT_SOURCE_1', 'mean'), ('EXT_SOURCE_2', 'mean'), ('OWN_CAR_AGE', 'max'), ('OWN_CAR_AGE', 'sum')]), ]

    Again features constructed from EXT_SOURCE_... are the most important

TARGET                                                                                      1.000000
NAME_EDUCATION_TYPE_OCCUPATION_TYPE_REG_CITY_NOT_WORK_CITY_mean_EXT_SOURCE_1                0.089964
CODE_GENDER_NAME_EDUCATION_TYPE_OCCUPATION_TYPE_REG_CITY_NOT_WORK_CITY_mean_EXT_SOURCE_2    0.089235
CODE_GENDER_NAME_EDUCATION_TYPE_OCCUPATION_TYPE_REG_CITY_NOT_WORK_CITY_mean_EXT_SOURCE_1    0.086676
NAME_EDUCATION_TYPE_OCCUPATION_TYPE_mean_EXT_SOURCE_1                                       0.083520
NAME_EDUCATION_TYPE_OCCUPATION_TYPE_mean_EXT_SOURCE_2                                       0.082742
NAME_EDUCATION_TYPE_OCCUPATION_TYPE_REG_CITY_NOT_WORK_CITY_mean_ELEVATORS_AVG               0.078057
OCCUPATION_TYPE_mean_EXT_SOURCE_1                                                           0.076587
NAME_EDUCATION_TYPE_OCCUPATION_TYPE_mean_AMT_REQ_CREDIT_BUREAU_YEAR                         0.074528
NAME_EDUCATION_TYPE_OCCUPATION_TYPE_mean_YEARS_BUILD_AVG                                    0.073816
NAME_EDUCATION_TYPE_OCCUPATION_TYPE_mean_NONLIVINGAREA_AVG                                  0.073730
NAME_EDUCATION_TYPE_OCCUPATION_TYPE_mean_OWN_CAR_AGE                                        0.073535
NAME_EDUCATION_TYPE_OCCUPATION_TYPE_mean_APARTMENTS_AVG                                     0.072854
NAME_EDUCATION_TYPE_OCCUPATION_TYPE_mean_BASEMENTAREA_AVG                                   0.072231
CODE_GENDER_NAME_EDUCATION_TYPE_mean_EXT_SOURCE_1                                           0.071557
NAME_EDUCATION_TYPE_OCCUPATION_TYPE_mean_AMT_CREDIT                                         0.071023
OCCUPATION_TYPE_mean_EXT_SOURCE_2                                                           0.070659
CODE_GENDER_ORGANIZATION_TYPE_mean_EXT_SOURCE_1                                             0.070028
CODE_GENDER_NAME_EDUCATION_TYPE_max_OWN_CAR_AGE                                             0.067620
CODE_GENDER_REG_CITY_NOT_WORK_CITY_mean_CNT_CHILDREN                                        0.066059
CODE_GENDER_NAME_EDUCATION_TYPE_max_AMT_CREDIT                                              0.065317
CODE_GENDER_NAME_EDUCATION_TYPE_max_AMT_ANNUITY                                             0.064173
CODE_GENDER_NAME_EDUCATION_TYPE_mean_EXT_SOURCE_2                                           0.063390
CODE_GENDER_NAME_EDUCATION_TYPE_sum_OWN_CAR_AGE                                             0.062637
CODE_GENDER_ORGANIZATION_TYPE_mean_DAYS_REGISTRATION                                        0.052398
CODE_GENDER_ORGANIZATION_TYPE_mean_AMT_ANNUITY                                              0.052341
CODE_GENDER_ORGANIZATION_TYPE_mean_AMT_INCOME_TOTAL                                         0.050268
OCCUPATION_TYPE_mean_DAYS_EMPLOYED                                                          0.050074
CODE_GENDER_REG_CITY_NOT_WORK_CITY_mean_AMT_ANNUITY                                         0.048534
OCCUPATION_TYPE_mean_AMT_ANNUITY                                                            0.046566
CODE_GENDER_REG_CITY_NOT_WORK_CITY_mean_DAYS_ID_PUBLISH                                     0.040932
OCCUPATION_TYPE_mean_DAYS_REGISTRATION                                                      0.035164
OCCUPATION_TYPE_mean_CNT_CHILDREN                                                           0.019836
OCCUPATION_TYPE_mean_EXT_SOURCE_3                                                           0.007225
OCCUPATION_TYPE_mean_CNT_FAM_MEMBERS                                                        0.005959
OCCUPATION_TYPE_mean_DAYS_BIRTH                                                             0.003795
OCCUPATION_TYPE_mean_DAYS_ID_PUBLISH                                                        0.002663
NAME_EDUCATION_TYPE_OCCUPATION_TYPE_mean_EXT_SOURCE_3                                       0.001598
  • Bureau

All the info can be found in the notebooks/eda-bureau.ipynb

  • Hand crafted features.
TARGET                                   1.000000
bureau_credit_active_binary              0.105735
bureau_debt_credit_ratio                 0.096372
bureau_credit_enddate_percentage         0.053573
bureau_total_customer_overdue            0.052995
bureau_total_customer_credit             0.041768
bureau_total_customer_debt               0.019435
bureau_number_of_loan_types              0.018792
bureau_average_of_past_loans_per_type    0.014492
bureau_average_creditdays_prolonged      0.011719
bureau_overdue_debt_ratio                0.008374
bureau_number_of_past_loans              0.006160
  • Stat Aggregations
TARGET                                    1.000000
SK_ID_CURR_mean_DAYS_CREDIT               0.089729
SK_ID_CURR_min_DAYS_CREDIT                0.075248
SK_ID_CURR_mean_DAYS_CREDIT_UPDATE        0.068927
SK_ID_CURR_sum_DAYS_CREDIT_ENDDATE        0.053735
SK_ID_CURR_max_DAYS_CREDIT                0.049782
SK_ID_CURR_mean_DAYS_CREDIT_ENDDATE       0.046983
SK_ID_CURR_min_DAYS_CREDIT_UPDATE         0.042864
SK_ID_CURR_sum_DAYS_CREDIT                0.042000
SK_ID_CURR_sum_DAYS_CREDIT_UPDATE         0.041404
...
SK_ID_CURR_min_AMT_CREDIT_SUM_DEBT        0.000242
SK_ID_CURR_min_CNT_CREDIT_PROLONG         0.000182
SK_ID_CURR_min_AMT_CREDIT_SUM_OVERDUE     0.000003
  • Credit Card

All the info can be found in the notebooks/eda-credit_card.ipynb

  • Hand crafted features
TARGET                                     1.000000
credit_card_drawings_atm                   0.038106
credit_card_installments_per_loan          0.031622
credit_card_total_instalments              0.031304
credit_card_drawings_total                 0.023680
credit_card_number_of_loans                0.004388
credit_card_average_of_days_past_due       0.003195
credit_card_avg_loading_of_credit_limit    0.002944
credit_card_cash_card_ratio                0.002414
  • Aggregated features
TARGET                                        1.000000
SK_ID_CURR_mean_CNT_DRAWINGS_ATM_CURRENT      0.107692
SK_ID_CURR_max_CNT_DRAWINGS_CURRENT           0.101389
SK_ID_CURR_mean_AMT_BALANCE                   0.087177
SK_ID_CURR_mean_CNT_DRAWINGS_CURRENT          0.082520
SK_ID_CURR_max_AMT_BALANCE                    0.068798
SK_ID_CURR_min_AMT_BALANCE                    0.064163
SK_ID_CURR_max_CNT_DRAWINGS_ATM_CURRENT       0.063729
SK_ID_CURR_var_CNT_DRAWINGS_CURRENT           0.062892
SK_ID_CURR_mean_MONTHS_BALANCE                0.062081
SK_ID_CURR_min_MONTHS_BALANCE                 0.061359
SK_ID_CURR_var_CNT_DRAWINGS_ATM_CURRENT       0.061123
SK_ID_CURR_mean_AMT_DRAWINGS_ATM_CURRENT      0.059925
SK_ID_CURR_sum_MONTHS_BALANCE                 0.059051
SK_ID_CURR_var_MONTHS_BALANCE                 0.058817
SK_ID_CURR_mean_AMT_DRAWINGS_CURRENT          0.058732
SK_ID_CURR_max_AMT_DRAWINGS_CURRENT           0.052318
SK_ID_CURR_sum_CNT_DRAWINGS_CURRENT           0.050685
SK_ID_CURR_sum_CNT_DRAWINGS_ATM_CURRENT       0.049970
SK_ID_CURR_sum_AMT_CREDIT_LIMIT_ACTUAL        0.045460
SK_ID_CURR_sum_CNT_INSTALMENT_MATURE_CUM      0.042363
...
SK_ID_CURR_max_AMT_PAYMENT_CURRENT            0.000438
SK_ID_CURR_sum_CNT_DRAWINGS_OTHER_CURRENT     0.000227
SK_ID_CURR_max_CNT_DRAWINGS_OTHER_CURRENT     0.000008
SK_ID_CURR_min_SK_DPD                              NaN
SK_ID_CURR_min_SK_DPD_DEF                          NaN

Model

  • LightGBM