# Do feature engineering to improve LightGBM prediction
This kernel closely follows https://www.kaggle.com/mlisovyi/lighgbm-hyperoptimisation-with-f1-macro, but instead of running hyperparameter optimisation it uses optimal values from that kernel and thus runs faster. 

Several key points:
- **This kernel runs training on the heads of housholds only** (after extracting aggregates over households). This follows the announced scoring startegy: *Note that ONLY the heads of household are used in scoring. All household members are included in test + the sample submission, but only heads of households are scored.* (from the data description). 
- **It seems to be very important to balance class frequencies.** Without balancing a trained model gives ~0.39 PLB / ~0.43 local test, while adding balancing leads to ~0.42 PLB / 0.47 local test. One can do it by hand, one can achieve it by undersampling. But the simplest (and more powerful compared to undersampling) is to set `class_weight='balanced'` in the LightGBM model constructor in sklearn API, which will assign different weights to different classes proportional to their representation. *Note that a better procedure would be to tune those weights in a CV loop instead of blindly assigning 1/n weights*
- **This kernel uses macro F1 score to early stopping in training**. This is done to align with the scoring strategy.
- Categoricals are turned into numbers with proper mapping instead of blind label encoding. 
- **OHE is reversed into label encoding, as it is easier to digest for a tree model.** This trick would be harmful for non-tree models, so be careful.
- **idhogar is NOT used in training**. The only way it could have any info would be if there is a data leak. We are fighting with poverty here- exploiting leaks will not reduce poverty in any way :)
- **Squared features (`SQBXXX` and `agesq`) are NOT used in training**. These would be useful for a linear model, but are useless for a tree-based model and only confused it (when bagging and resampling is done)
- **There are aggregations done within households and new features are hand-crafted**. Note, that there are not so many features that can be aggregated, as most are already quoted on household level.
- **NEW: There are geographical aggregates calculated from households**
- **NEW: Models are build and evaluated in a nested CV loop**. This is done to reduce fluctuations in early-stopping criterion as well as to average over several performance estimates.
- **A voting classifier is used to average over several LightGBM models**

The main goal is to do feature engineering

In [None]:
import numpy as np # linear algebra
import pandas as pd 

import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline 

import warnings
warnings.filterwarnings("ignore")

The following categorical mapping originates from [this kernel](https://www.kaggle.com/mlisovyi/categorical-variables-encoding-function).

In [None]:
from sklearn.preprocessing import LabelEncoder

def encode_data(df):
    '''
    The function does not return, but transforms the input pd.DataFrame
    
    Encodes the Costa Rican Household Poverty Level data 
    following studies in https://www.kaggle.com/mlisovyi/categorical-variables-in-the-data
    and the insight from https://www.kaggle.com/c/costa-rican-household-poverty-prediction/discussion/61403#359631
    
    The following columns get transformed: edjefe, edjefa, dependency, idhogar
    The user most likely will simply drop idhogar completely (after calculating houshold-level aggregates)
    '''
    
    yes_no_map = {'no': 0, 'yes': 1}
    
    df['dependency'] = df['dependency'].replace(yes_no_map).astype(np.float32)
    
    df['edjefe'] = df['edjefe'].replace(yes_no_map).astype(np.float32)
    df['edjefa'] = df['edjefa'].replace(yes_no_map).astype(np.float32)
    
    df['idhogar'] = LabelEncoder().fit_transform(df['idhogar'])

**There is also feature engineering magic happening here:**

In [None]:
def do_features(df):
    feats_div = [('children_fraction', 'r4t1', 'r4t3'), 
                 ('working_man_fraction', 'r4h2', 'r4t3'),
                 ('all_man_fraction', 'r4h3', 'r4t3'),
                 ('human_density', 'tamviv', 'rooms'),
                 ('human_bed_density', 'tamviv', 'bedrooms'),
                 ('bed_density', 'bedrooms', 'rooms'),
                 ('rent_per_person', 'v2a1', 'r4t3'),
                 ('rent_per_room', 'v2a1', 'rooms'),
                 ('mobile_density', 'qmobilephone', 'r4t3'),
                 ('tablet_density', 'v18q1', 'r4t3'),
                 ('mobile_adult_density', 'qmobilephone', 'r4t2'),
                 ('tablet_adult_density', 'v18q1', 'r4t2'),
                 ('male_over_female', 'r4h3', 'r4m3'),
                 ('man12plus_over_women12plus', 'r4h2', 'r4m2'),
                 ('pesioner_over_working', 'hogar_mayor', 'hogar_adul'),
                 ('children_over_working', 'hogar_nin', 'hogar_adul'),
                 ('education_fraction', 'escolari', 'age')
                 #('', '', ''),
                ]
    
    feats_sub = [('people_not_living', 'tamhog', 'tamviv'),
                 ('non_bedrooms', 'rooms', 'bedrooms'),
                 ('people_weird_stat', 'tamhog', 'r4t3')]

    for f_new, f1, f2 in feats_div:
        df['fe_' + f_new] = (df[f1] / df[f2]).astype(np.float32)       
    for f_new, f1, f2 in feats_sub:
        df['fe_' + f_new] = (df[f1] - df[f2]).astype(np.float32)
    
    # aggregation rules over household
    aggs_num = {'age': ['min', 'max', 'mean', 'count'],
                'escolari': ['min', 'max', 'mean', 'std'],
                'fe_education_fraction': ['min', 'max', 'mean', 'std']
               }
    aggs_cat = {'dis': ['mean']}
    for s_ in ['estadocivil', 'parentesco', 'instlevel']:
        for f_ in [f_ for f_ in df.columns if f_.startswith(s_)]:
            aggs_cat[f_] = ['mean']
    # aggregation over household
    for name_, df_ in [('18', df.query('age >= 18'))]:
        df_agg = df_.groupby('idhogar').agg({**aggs_num, **aggs_cat}).astype(np.float32)
        df_agg.columns = pd.Index(['agg' + name_ + '_' + e[0] + "_" + e[1].upper() for e in df_agg.columns.tolist()])
        df = df.join(df_agg, how='left', on='idhogar')
        del df_agg
    # do something advanced above...
    
    # Drop SQB variables, as they are just squres of other vars 
    df.drop([f_ for f_ in df.columns if f_.startswith('SQB') or f_ == 'agesq'], axis=1, inplace=True)
    # Drop id's
    df.drop(['Id'], axis=1, inplace=True)
    # Drop repeated columns
    df.drop(['hhsize', 'female', 'area2'], axis=1, inplace=True)
    return df

In [None]:
def convert_OHE2LE(df):
    tmp_df = df.copy(deep=True)
    for s_ in ['pared', 'piso', 'techo', 'abastagua', 'sanitario', 'energcocinar', 'elimbasu', 
               'epared', 'etecho', 'eviv', 'estadocivil', 'parentesco', 
               'instlevel', 'lugar', 'tipovivi',
               'manual_elec']:
        if 'manual_' not in s_:
            cols_s_ = [f_ for f_ in df.columns if f_.startswith(s_)]
        elif 'elec' in s_:
            cols_s_ = ['public', 'planpri', 'noelec', 'coopele']
        sum_ohe = tmp_df[cols_s_].sum(axis=1).unique()
        #deal with those OHE, where there is a sum over columns == 0
        if 0 in sum_ohe:
            print('The OHE in {} is incomplete. A new column will be added before label encoding'
                  .format(s_))
            # dummy colmn name to be added
            col_dummy = s_+'_dummy'
            # add the column to the dataframe
            tmp_df[col_dummy] = (tmp_df[cols_s_].sum(axis=1) == 0).astype(np.int8)
            # add the name to the list of columns to be label-encoded
            cols_s_.append(col_dummy)
            # proof-check, that now the category is complete
            sum_ohe = tmp_df[cols_s_].sum(axis=1).unique()
            if 0 in sum_ohe:
                 print("The category completion did not work")
        tmp_cat = tmp_df[cols_s_].idxmax(axis=1)
        tmp_df[s_ + '_LE'] = LabelEncoder().fit_transform(tmp_cat).astype(np.int16)
        if 'parentesco1' in cols_s_:
            cols_s_.remove('parentesco1')
        tmp_df.drop(cols_s_, axis=1, inplace=True)
    return tmp_df

# Read in the data and clean it up

In [None]:
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')

In [None]:
train.info()

In [None]:
def process_df(df_):
    # fix categorical features
    encode_data(df_)
    #fill in missing values based on https://www.kaggle.com/mlisovyi/missing-values-in-the-data
    for f_ in ['v2a1', 'v18q1', 'meaneduc', 'SQBmeaned']:
        df_[f_] = df_[f_].fillna(0)
    df_['rez_esc'] = df_['rez_esc'].fillna(-1)
    # do feature engineering and drop useless columns
    return do_features(df_)

train = process_df(train)
test = process_df(test)

In [None]:
train.info()

Note the change in the number of features of different type. What we did was:
- encoded categorical variables appropreately into numerical values;
- dropped a few irrelevant columns;
- added several columns with household aggregates and cand-crafted ratio and subtraction features

Now, let's define `train_test_apply_func` helper function to apply a custom function to a concatenated test+train dataset

In [None]:
def train_test_apply_func(train_, test_, func_):
    test_['Target'] = 0
    xx = pd.concat([train_, test_])

    xx_func = func_(xx)
    train_ = xx_func.iloc[:train_.shape[0], :]
    test_  = xx_func.iloc[train_.shape[0]:, :].drop('Target', axis=1)

    del xx, xx_func
    return train_, test_

In [None]:
train, test = train_test_apply_func(train, test, convert_OHE2LE)

In [None]:
train.info()

Compare the number of features with `int64` type to the previous info summary. The difference comes from convertion of OHE into LE (`convert_OHE2LE` function)

# Geo aggregates

In [None]:
cols_2_ohe = ['eviv_LE', 'etecho_LE', 'epared_LE', 'elimbasu_LE', 
              'energcocinar_LE', 'sanitario_LE', 'manual_elec_LE',
              'pared_LE']
cols_nums = ['age', 'meaneduc', 'dependency', 
             'hogar_nin', 'hogar_adul', 'hogar_mayor', 'hogar_total',
             'bedrooms', 'overcrowding']

def convert_geo2aggs(df_):
    tmp_df = pd.concat([df_[(['lugar_LE', 'idhogar']+cols_nums)],
                        pd.get_dummies(df_[cols_2_ohe], 
                                       columns=cols_2_ohe)],axis=1)
    geo_agg = tmp_df.groupby(['lugar_LE','idhogar']).mean().groupby('lugar_LE').mean().astype(np.float32)
    geo_agg.columns = pd.Index(['geo_' + e + '_MEAN' for e in geo_agg.columns.tolist()])
    
    del tmp_df
    return df_.join(geo_agg, how='left', on='lugar_LE')

train, test = train_test_apply_func(train, test, convert_geo2aggs)

In [None]:
train.info()

# VERY IMPORTANT
> Note that ONLY the heads of household are used in scoring. All household members are included in test + the sample submission, but only heads of households are scored.

In [None]:
X = train.query('parentesco1==1')
#X = train

# pull out the target variable
y = X['Target'] - 1
X = X.drop(['Target'], axis=1)

In [None]:
cols_2_drop = ['abastagua_LE', 'agg18_estadocivil1_MEAN', 'agg18_instlevel6_MEAN', 'agg18_parentesco10_MEAN', 'agg18_parentesco11_MEAN', 'agg18_parentesco12_MEAN', 'agg18_parentesco4_MEAN', 'agg18_parentesco5_MEAN', 'agg18_parentesco6_MEAN', 'agg18_parentesco7_MEAN', 'agg18_parentesco8_MEAN', 'agg18_parentesco9_MEAN', 'fe_people_not_living', 'fe_people_weird_stat', 'geo_elimbasu_LE_3_MEAN', 'geo_elimbasu_LE_4_MEAN', 'geo_energcocinar_LE_0_MEAN', 'geo_energcocinar_LE_1_MEAN', 'geo_energcocinar_LE_2_MEAN', 'geo_epared_LE_0_MEAN', 'geo_epared_LE_2_MEAN', 'geo_etecho_LE_2_MEAN', 'geo_eviv_LE_0_MEAN', 'geo_hogar_mayor_MEAN', 'geo_hogar_nin_MEAN', 'geo_manual_elec_LE_1_MEAN', 'geo_manual_elec_LE_2_MEAN', 'geo_manual_elec_LE_3_MEAN', 'geo_pared_LE_0_MEAN', 'geo_pared_LE_1_MEAN', 'geo_pared_LE_3_MEAN', 'geo_pared_LE_4_MEAN', 'geo_pared_LE_5_MEAN', 'geo_pared_LE_6_MEAN', 'geo_pared_LE_7_MEAN', 'hacapo', 'hacdor', 'mobilephone', 'parentesco1', 'parentesco_LE', 'rez_esc', 'techo_LE', 'v14a', 'v18q']
#cols_2_drop = ['agg18_estadocivil1_MEAN', 'agg18_parentesco10_MEAN', 'agg18_parentesco11_MEAN', 'agg18_parentesco12_MEAN', 'agg18_parentesco4_MEAN', 'agg18_parentesco6_MEAN', 'agg18_parentesco7_MEAN', 'agg18_parentesco8_MEAN', 'fe_people_weird_stat', 'hacapo', 'hacdor', 'mobilephone', 'parentesco1', 'parentesco_LE', 'rez_esc', 'v14a']
#cols_2_drop=[]

X.drop((cols_2_drop+['idhogar']), axis=1, inplace=True)
test.drop((cols_2_drop+['idhogar']), axis=1, inplace=True)

## Let's look on the most correlated with `Target` features

In [None]:
XY = pd.concat([X,y], axis=1)
max_corr = XY.corr()['Target'].loc[lambda x: abs(x)>0.2].index
#min_corr = XY.corr()['Target'].loc[lambda x: abs(x)<0.05].index

In [None]:
_ = plt.figure(figsize=(10,7))
_ = sns.heatmap(XY[max_corr].corr(), vmin=-0.5, vmax=0.5, cmap='coolwarm')

# Model fitting

We will use LightGBM classifier - LightGBM allows to build very sophysticated models with a very short training time.

## Use test subset for early stopping criterion

This allows us to avoid overtraining and we do not need to optimise the number of trees. We also use F1 macro-averaged score to decide when to stop


In [None]:
from sklearn.metrics import f1_score
def evaluate_macroF1_lgb(truth, predictions):  
    # this follows the discussion in https://github.com/Microsoft/LightGBM/issues/1483
    pred_labels = predictions.reshape(len(np.unique(truth)),-1).argmax(axis=0)
    f1 = f1_score(truth, pred_labels, average='macro')
    return ('macroF1', f1, True) 

def learning_rate_power_0997(current_iter):
    base_learning_rate = 0.1
    min_learning_rate = 0.02
    lr = base_learning_rate  * np.power(.99, current_iter)
    return max(lr, min_learning_rate)

import lightgbm as lgb
fit_params={"early_stopping_rounds":300, 
            "eval_metric" : 'multiclass',
            "eval_metric" : evaluate_macroF1_lgb, 
            #"eval_set" : [(X_train,y_train), (X_test,y_test)],
            'eval_names': ['train', 'early_stop'],
            'callbacks': [lgb.reset_parameter(learning_rate=learning_rate_power_0997)],
            'verbose': False,
            'categorical_feature': 'auto'}

#fit_params['verbose'] = 200

# LightGBM optimal parameters

The parameters are optimised with a random search in this kernel: https://www.kaggle.com/mlisovyi/lighgbm-hyperoptimisation-with-f1-macro


In [None]:
#v8
#opt_parameters = {'colsample_bytree': 0.93, 'min_child_samples': 56, 'num_leaves': 19, 'subsample': 0.84}
#v9
#opt_parameters = {'colsample_bytree': 0.89, 'min_child_samples': 70, 'num_leaves': 17, 'subsample': 0.96}
#v14
#opt_parameters = {'colsample_bytree': 0.88, 'min_child_samples': 90, 'num_leaves': 16, 'subsample': 0.94}
#v17
opt_parameters = {'colsample_bytree': 0.89, 'min_child_samples': 90, 'num_leaves': 14, 'subsample': 0.96}

# Fit a voting classifier
Define a derived VotingClassifier class that uses pre-

In [None]:
from sklearn.ensemble import VotingClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.utils.validation import has_fit_parameter, check_is_fitted

class VotingPrefitClassifier(VotingClassifier):
    '''
    This implements the VotingClassifier with prefitted classifiers
    '''
    def fit(self, X, y, sample_weight=None, **fit_params):
        self.estimators_ = [x[1] for x in self.estimators]
        self.le_ = LabelEncoder().fit(y)
        self.classes_ = self.le_.classes_
        
        return self    

In [None]:
from sklearn.model_selection import StratifiedKFold

def train_lgbm_model(X_, y_, random_state_=None, opt_parameters_={}, fit_params_={}):
    clf  = lgb.LGBMClassifier(max_depth=-1, learning_rate=0.1, objective='multiclass',
                             random_state=random_state_, silent=True, metric='None', 
                             n_jobs=4, n_estimators=5000, class_weight='balanced')
    clf.set_params(**opt_parameters_)
    return clf.fit(X_, y_, **fit_params_)

# the list of classifiers for voting ensable
clfs = []

# nested CV parameters
inner_seed = 31416
inner_n = 10
outer_seed = 314
outer_n = 10

# performance 
perf_eval = {'f1_oof': [],
             'f1_ave': [],
             'f1_std': [],
             'f1_early_stop_ave': [],
             'f1_early_stop': [],
             'f1_early_stop_vc_w0_soft': [],
             'f1_early_stop_vc_w0_hard': [],
             'f1_early_stop_vc_w1_soft': [],
             'f1_early_stop_vc_w1_hard': [],
             'f1_early_stop_vc_w2_soft': [],
             'f1_early_stop_vc_w2_hard': []
            }
# full-sample oof prediction
y_full_oof = pd.Series(np.zeros(shape=(X.shape[0],)), 
                      index=X.index)

outer_cv = StratifiedKFold(outer_n, shuffle=True, random_state=outer_seed)
for n_outer_fold, (outer_trn_idx, outer_val_idx) in enumerate(outer_cv.split(X,y)):
    print('--- Outer loop iteration: {} ---'.format(n_outer_fold))
    X_out, y_out = X.iloc[outer_trn_idx], y.iloc[outer_trn_idx]
    X_stp, y_stp = X.iloc[outer_val_idx], y.iloc[outer_val_idx]
    
    inner_cv = StratifiedKFold(inner_n, shuffle=True, random_state=inner_seed+n_outer_fold)
    # The out-of-fold (oof) prediction for the k-1 sample in the outer CV loop
    y_outer_oof = pd.Series(np.zeros(shape=(X_out.shape[0],)), 
                      index=X_out.index)
    f1_scores_inner = []
    clfs_inner = []
    
    for n_inner_fold, (inner_trn_idx, inner_val_idx) in enumerate(inner_cv.split(X_out,y_out)):
        X_trn, y_trn = X_out.iloc[inner_trn_idx], y_out.iloc[inner_trn_idx]
        X_val, y_val = X_out.iloc[inner_val_idx], y_out.iloc[inner_val_idx]
        
        # use _stp data for early stopping
        fit_params["eval_set"] = [(X_trn,y_trn), (X_stp,y_stp)]
        fit_params['verbose'] = False
        
        clf = train_lgbm_model(X_trn, y_trn, 314+n_inner_fold, opt_parameters, fit_params)
        
        clfs_inner.append(('lgbm{}_inner'.format(n_outer_fold), clf))
        # evaluate performance
        y_outer_oof.iloc[inner_val_idx] = clf.predict(X_val)        
        f1_scores_inner.append(f1_score(y_val, y_outer_oof.iloc[inner_val_idx], average='macro'))
        #cleanup
        del X_trn, y_trn, X_val, y_val
    # Do the predictions for early-stop sub-sample for comparison with VotingPrefitClassifier
    f1_score_inner_early_stop=[f1_score(y_stp, clf_.predict(X_stp), average='macro')
                               for _,clf_ in clfs_inner]
    
    # Store performance info for this outer fold
    perf_eval['f1_oof'].append(f1_score(y_out, y_outer_oof, average='macro'))
    perf_eval['f1_ave'].append(np.array(f1_scores_inner).mean())
    perf_eval['f1_std'].append(np.array(f1_scores_inner).std())
    perf_eval['f1_early_stop_ave'].append(np.mean(f1_score_inner_early_stop))
    # Record performance of Voting classifiers
    w = np.array(f1_scores_inner)
    for w_, w_name_ in [(None, '_w0'),
                        (w/w.sum(), '_w1'),
                        ((w**2)/np.sum(w**2), '_w2')
                       ]:
        vc = VotingPrefitClassifier(clfs_inner, weights=w_).fit(X_stp, y_stp)
        vc.voting = 'soft'
        perf_eval['f1_early_stop_vc{}_soft'.format(w_name_)].append(f1_score(y_stp, vc.predict(X_stp), average='macro'))
        vc.voting = 'hard'
        perf_eval['f1_early_stop_vc{}_hard'.format(w_name_)].append(f1_score(y_stp, vc.predict(X_stp), average='macro'))
    
    # Train main model for the voting average
    fit_params["eval_set"] = [(X_out,y_out), (X_stp,y_stp)]
    fit_params['verbose'] = 200
    print('Fit the final model on the outer loop iteration: ')
    clf = train_lgbm_model(X_out, y_out, 314+n_outer_fold, opt_parameters, fit_params)
    perf_eval['f1_early_stop'].append(f1_score(y_stp, clf.predict(X_stp), average='macro'))
    clfs.append(('lgbm{}'.format(n_outer_fold), clf))
    y_full_oof.iloc[outer_val_idx] = clf.predict(X_stp)
    # cleanup
    del inner_cv, X_out, y_out, X_stp, y_stp, clfs_inner

In [None]:
w = np.array(perf_eval['f1_early_stop'])
ws = [(None, '_w0'),
      (w/w.sum(), '_w1'),
      ((w**2)/np.sum(w**2), '_w2')
     ]
vc = {}
for w_, w_name_ in ws:
    vc['vc{}'.format(w_name_)] = VotingPrefitClassifier(clfs, weights=w_).fit(X, y)

clf_final = clfs[0][1]

In [None]:
global_score = np.mean(perf_eval['f1_oof'])
global_score_std = np.std(perf_eval['f1_oof'])

print('Mean validation score LGBM Classifier: {:.4f}'.format(global_score))
print('Std  validation score LGBM Classifier: {:.4f}'.format(global_score_std))
print('EarlyStop OOF score LGBM Classifier: {:.4f}'.format(f1_score(y, y_full_oof, average='macro')))
print('EarlyStop mean score LGBM Classifier: {:.4f}'.format(np.mean(perf_eval['f1_early_stop_ave'])))
print('EarlyStop VotingPrefit SOFT: {:.4f}'.format(np.mean(perf_eval['f1_early_stop_vc_w0_soft'])))
print('EarlyStop VotingPrefit HARD: {:.4f}'.format(np.mean(perf_eval['f1_early_stop_vc_w0_hard'])))

Look at the performance on invidivual folds:

In [None]:
perf_eval_df = pd.DataFrame(perf_eval)
perf_eval_df

# F1 score across different classes
Let's see if all classes show similar performance

In [None]:
from sklearn.metrics import precision_score, recall_score, classification_report

In [None]:
#print(classification_report(y_test, clf_final.predict(X_test)))

In [None]:
#vc.voting = 'hard'
#print(classification_report(y_test, vc.predict(X_test)))

In [None]:
#vc.voting = 'soft'
#print(classification_report(y_test, vc.predict(X_test)))

# Plot feature importances (using gain)
See if added features show among most significant ones

In [None]:
def display_importances(feature_importance_df_, doWorst=False, n_feat=50):
    # Plot feature importances
    if not doWorst:
        cols = feature_importance_df_[["feature", "importance"]].groupby("feature").mean().sort_values(
            by="importance", ascending=False)[:n_feat].index        
    else:
        cols = feature_importance_df_[["feature", "importance"]].groupby("feature").mean().sort_values(
            by="importance", ascending=False)[-n_feat:].index
    
    mean_imp = feature_importance_df_[["feature", "importance"]].groupby("feature").mean()
    df_2_neglect = mean_imp[mean_imp['importance'] < 1e-3]
    print('The list of features with 0 importance: ')
    print(df_2_neglect.index.values.tolist())
    del mean_imp, df_2_neglect
    
    best_features = feature_importance_df_.loc[feature_importance_df_.feature.isin(cols)]
    
    plt.figure(figsize=(8,10))
    sns.barplot(x="importance", y="feature", 
                data=best_features.sort_values(by="importance", ascending=False))
    plt.title('LightGBM Features')
    plt.tight_layout()
    #plt.savefig('lgbm_importances.png')
    
importance_df = pd.DataFrame()
importance_df["feature"] = X.columns.tolist()      
importance_df["importance"] = clf_final.booster_.feature_importance('gain')
display_importances(feature_importance_df_=importance_df, n_feat=20)

In [None]:
#display_importances(feature_importance_df_=importance_df, doWorst=True, n_feat=20)

# Plot feature importances (using SHAP)
See if added features show among most significant ones

In [None]:
import shap
shap_values = shap.TreeExplainer(clf_final.booster_).shap_values(X)

#shap_df = pd.DataFrame()
#shap_df["feature"] = X_train.columns.tolist()    
#shap_df["importance"] = np.sum(np.abs(shap_values), 0)[:-1]

In [None]:
#display_importances(feature_importance_df_=shap_df, n_feat=20)

In [None]:
shap.summary_plot(shap_values, X, plot_type='bar')

# Prepare submission

In [None]:
y_subm = pd.read_csv('../input/sample_submission.csv')

In [None]:
from datetime import datetime
now = datetime.now()

sub_file = 'submission_LGB_{:.4f}_{}.csv'.format(global_score, str(now.strftime('%Y-%m-%d-%H-%M')))
y_subm['Target'] = clf_final.predict(test) + 1
y_subm.to_csv(sub_file, index=False)

# Store predictions with voting classifiers
for vc_name_,vc_ in vc.items():
    for vc_type_ in ['soft', 'hard']:
        vc_.voting = vc_type_
        name = '{}_{}'.format(vc_name_, vc_type_)
        y_subm_vc = y_subm.copy(deep=True)
        y_subm_vc.loc[:,'Target'] = vc_.predict(test) + 1
        sub_file = 'submission_{}_LGB_{:.4f}_{}.csv'.format(name, 
                                                            global_score, 
                                                            str(now.strftime('%Y-%m-%d-%H-%M'))
                                                           )
        y_subm_vc.to_csv(sub_file, index=False)

In [None]:
!ls