Skip to content
This repository has been archived by the owner on Jun 22, 2022. It is now read-only.

LightGBM clean dynamic features

Jakub edited this page Sep 3, 2018 · 13 revisions

Sunflower 🌻

🌻 code

We continue working with single model - LightGBM. Our primary focus is on features engineering. Thanks to this approach we obtained significant gains on local CV and LB πŸ†.

Validation

  • 5-fold stratified K-fold (5 is parameter in the configuration file: neptune.yaml#L38)

Preprocessing

We realized that before we were cleaning tables only for hand-crafted features and not aggregations which is obviously a mistake. We have fixed it in this release.

Application data

        X['DAYS_LAST_PHONE_CHANGE'].replace(0, np.nan, inplace=True)

Bureau data

        bureau['DAYS_CREDIT_ENDDATE'][bureau['DAYS_CREDIT_ENDDATE'] < -40000] = np.nan
        bureau['DAYS_CREDIT_UPDATE'][bureau['DAYS_CREDIT_UPDATE'] < -40000] = np.nan
        bureau['DAYS_ENDDATE_FACT'][bureau['DAYS_ENDDATE_FACT'] < -40000] = np.nan

Credit Card data

        credit_card['AMT_DRAWINGS_ATM_CURRENT'][credit_card['AMT_DRAWINGS_ATM_CURRENT'] < 0] = np.nan
        credit_card['AMT_DRAWINGS_CURRENT'][credit_card['AMT_DRAWINGS_CURRENT'] < 0] = np.nan

Feature Extraction

Altogether we are using 1092 features in our solution (it does need some pruning:) )

Application data -> eda-application.ipynb πŸ“

  • Hand Crafted features
credit_per_child            0.033503
credit_per_person           0.023462
child_to_non_child_ratio    0.020943
credit_per_non_child        0.020244
cnt_non_child               0.012195
income_per_non_child        0.001947
  • Aggregation features: Adding more numerical values over which aggregations are performed helped considerably. We are now using the following:
cols_to_agg = ['AMT_CREDIT', 
               'AMT_ANNUITY',
               'AMT_INCOME_TOTAL',
               'AMT_GOODS_PRICE', 
               'EXT_SOURCE_1',
               'EXT_SOURCE_2',
               'EXT_SOURCE_3',
               'OWN_CAR_AGE',
               'REGION_POPULATION_RELATIVE',
               'DAYS_REGISTRATION',
               'CNT_CHILDREN',
               'CNT_FAM_MEMBERS',
               'DAYS_ID_PUBLISH',
               'DAYS_BIRTH',
               'DAYS_EMPLOYED'
]

Installment Payments data -> eda-installments.ipynb πŸ“

  • Hand crafted features:

Dividing the short-term by long-term features was added. It may not give much in terms of correlation but it helps when using tree-based models.

    def last_k_installment_features_with_fractions(gr, periods, period_fractions):
        features = InstallmentPaymentsFeatures.last_k_installment_features(gr, periods)

        for short_period, long_period in period_fractions:
            short_feature_names = get_feature_names_by_period(features, short_period)
            long_feature_names = get_feature_names_by_period(features, long_period)

            for short_feature, long_feature in zip(short_feature_names, long_feature_names):
                old_name_chunk = '_{}_'.format(short_period)
                new_name_chunk = '_{}by{}_fraction_'.format(short_period, long_period)
                fraction_feature_name = short_feature.replace(old_name_chunk, new_name_chunk)
                features[fraction_feature_name] = safe_div(features[short_feature], features[long_feature])
        return features

POS Cash Balance application data -> eda-pos_cash_balance.ipynb πŸ“

Below 3 feature groups generated for pos cash balance

last_10_pos_cash_paid_late_with_tolerance_mean             0.052731
last_50_pos_cash_paid_late_with_tolerance_mean             0.048322
all_installment_pos_cash_paid_late_with_tolerance_mean     0.047050
last_10_pos_cash_paid_late_mean                            0.043763
last_50_pos_cash_paid_late_with_tolerance_count            0.043304
last_50_pos_cash_paid_late_count                           0.043304
all_installment_pos_cash_paid_late_count                   0.035632
all_installment_pos_cash_paid_late_with_tolerance_count    0.035632
last_50_pos_cash_paid_late_mean                            0.032820
all_installment_pos_cash_paid_late_mean                    0.030616
...
last_1_SK_DPD_DEF_max                                      0.003592
last_50_SK_DPD_min                                         0.003585
last_50_SK_DPD_DEF_min                                     0.002355
last_1_pos_cash_paid_late_count                                 NaN
last_1_pos_cash_paid_late_with_tolerance_count                  NaN
last_loan_pos_cash_paid_late_with_tolerance_mean    0.049801
last_loan_pos_cash_paid_late_mean                   0.042730
last_loan_pos_cash_paid_late_with_tolerance_sum     0.028442
last_loan_pos_cash_paid_late_sum                    0.015614
last_loan_SK_DPD_std                                0.007400
last_loan_SK_DPD_DEF_std                            0.007180
last_loan_SK_DPD_max                                0.006939
last_loan_SK_DPD_DEF_max                            0.006845
last_loan_SK_DPD_mean                               0.006002
last_loan_SK_DPD_sum                                0.004737
last_loan_pos_cash_paid_late_count                  0.003446
last_loan_SK_DPD_DEF_mean                           0.003391
last_loan_SK_DPD_DEF_min                            0.003115
last_loan_SK_DPD_min                                0.002458
last_loan_SK_DPD_DEF_sum                            0.002150
60_period_trend_SK_DPD_DEF    0.010600
60_period_trend_SK_DPD        0.009394
6_period_trend_SK_DPD_DEF     0.004313
12_period_trend_SK_DPD_DEF    0.004157
30_period_trend_SK_DPD_DEF    0.003879
30_period_trend_SK_DPD        0.003474
6_period_trend_SK_DPD         0.001397
12_period_trend_SK_DPD        0.000429
1_period_trend_SK_DPD              NaN
1_period_trend_SK_DPD_DEF          NaN

Model

  • We continue working with single Light-GBM model implemented here: models.py#L80 πŸ’»
  • Results for new set of features are rather nice πŸ˜‰:
    • CV 0.7950 πŸŽ†
    • LB 0.804 πŸŽ‰
  • We trained the model with following hyper-parameters (check config file πŸ“’):
# Light GBM
  lgbm__boosting_type: gbdt
  lgbm__objective: binary
  lgbm__metric: auc
  lgbm__number_boosting_rounds: 5000
  lgbm__early_stopping_rounds: 100
  lgbm__learning_rate: 0.02
  lgbm__max_bin: 300
  lgbm__max_depth: -1
  lgbm__num_leaves: 30
  lgbm__min_child_samples: 70
  lgbm__subsample: 1.0
  lgbm__subsample_freq: 1
  lgbm__colsample_bytree: 0.05
  lgbm__min_gain_to_split: 0.5
  lgbm__reg_lambda: 100
  lgbm__reg_alpha: 0.0
  lgbm__scale_pos_weight: 1
  lgbm__is_unbalance: False

Pipeline diagram

Since the diagram below is quite wide (it uses multiple input files), here is a link to the larger version.

HC-solution-5