## Cross Validation Test for LightGBM
### New Cross Validation Scheme
- Just doing cross validation on the whole test data and then averaging these to submit to the leaderboard has a few flaws:
    - potential overfit to leaderboard since we are using that for model validation
    - potential data leakage in each fold
    
- to do a proper model validation, it is good practice to have an extra holdout set to test the model predictions. If the holdout set predictions are not too different from the CV performance then we can be more confident on generalisation to new data
- disadvantage is less data for training the model (we can't do this well if we have a small dataset)

In [9]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
import datetime
import missingno as msno
import lightgbm as lgb
import xgboost as xgb
from sklearn import preprocessing
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import KFold, TimeSeriesSplit, train_test_split,StratifiedKFold
import gc
from statistics import mean

# Any results you write to the current directory are saved as output.

### Setup Cross Validation
1. Divide Train set in subsets (Cross Validation folds + Holdout set (separate from leaderboard test set))
2. Define Validation Metric (in our case it is ROC-AUC)
3. Stop training when Validation metric stops improving
4. Take average of each fold's prediction for the Local Test set.

* Make sure to set shuffle=False

In [10]:
train_full = pd.read_pickle('data/train_full.pkl')
test_full = pd.read_pickle('data/test_full.pkl')

train_full=train_full.sort_values('TransactionDT',ascending=True).reset_index(drop=True)


In [11]:
# Label Encoding
for f in test_full.columns:
    if train_full[f].dtype=='object' or test_full[f].dtype=='object': 
        train_full[f] = train_full[f].fillna('unseen_before_label')
        test_full[f]  = test_full[f].fillna('unseen_before_label')
        lbl = preprocessing.LabelEncoder()
        lbl.fit(list(train_full[f].values) + list(test_full[f].values))
        train_full[f] = lbl.transform(list(train_full[f].values))
        test_full[f] = lbl.transform(list(test_full[f].values)) 
        
        
train_full = train_full.fillna(-999)
test_full = test_full.fillna(-999)

In [12]:
rm_cols = [
    'TransactionID','TransactionDT', 
    'isFraud'                         
]

# Final features
features_columns = [col for col in list(train_full.columns) if col not in rm_cols]

In [13]:
X = train_full[features_columns]
y = train_full['isFraud']

# # Split holdout as 15% of the train set - probably don't need this as the holdout performance is very similar anyway, lets just rely on my CV and the LB
# X, X_holdout, y, y_holdout = train_test_split(train_full[features_columns], train_full['isFraud'], 
#                                               test_size=0.15, random_state=42, shuffle=False)
# #                                               stratify = train_full['isFraud'])

del train_full
gc.collect()

288

In [14]:
params = {
                    'objective':'binary',
                    'boosting_type':'gbdt',
                    'metric':'auc',
                    'n_jobs':-1,
                    'learning_rate':0.01, # speed up the learning rate a bit - might sacrifice a bit of accuracy
                    'num_leaves':2**8, # reduce number of leaves to reduce overfitting
                    'max_depth': -1, # max_depth should be constrained, -1 would mean unconstrained
                    'tree_learner':'serial',
                    'colsample_bytree': 0.7,
                    'subsample_freq':1,
                    'subsample':0.7,
                    'n_estimators':800,
                    'max_bin':255, # less bins if overfitting
                    'verbose':-1,
                    'seed': 2019,
                    'early_stopping_rounds':100,
#                     'lambda_l1':5,
#                     'lambda_l2':5,
                } 

# params = {
#                     'objective':'binary',
#                     'boosting_type':'gbdt',
#                     'metric':'auc',
#                     'n_jobs':-1,
#                     'learning_rate':0.05, # speed up the learning rate a bit - might sacrifice a bit of accuracy
#                     'num_leaves':2**8, # reduce number of leaves to reduce overfitting
#                     'max_depth': 8, # max_depth should be constrained, -1 would mean unconstrained
#                     'tree_learner':'serial',
#                     'colsample_bytree': 0.7,
#                     'subsample_freq':1,
#                     'subsample':0.7,
#                     'n_estimators':1000,
#                     'max_bin':255, # less bins if overfitting
#                     'verbose':-1,
#                     'seed': 2019,
#                     'early_stopping_rounds':100,
# #                     'lambda_l1':5,
# #                     'lambda_l2':5,
#                 } 

In [15]:
NFOLDS =5
# folds = StratifiedKFold(n_splits=NFOLDS,random_state=123,shuffle=False) # split by stratified folds
folds = KFold(n_splits=NFOLDS,random_state=123,shuffle=False) # split by stratified folds
# folds = TimeSeriesSplit(n_splits=NFOLDS) # split by time - try timeseries split, perhaps less overfitting? result: worse overfitting

aucs = []
clfs=[]
pred_len = len(test_full)
prediction = np.zeros(pred_len)

for fold, (trn_idx, test_idx) in enumerate(folds.split(X,y)):
    print('Training on fold {}'.format(fold + 1))
    
    trn_data = lgb.Dataset(data=X.iloc[trn_idx], label=y.iloc[trn_idx])
    val_data = lgb.Dataset(data=X.iloc[test_idx], label=y.iloc[test_idx])
    clf = lgb.train(params, 
                    trn_data, 
                    valid_sets = [trn_data, val_data], 
                    verbose_eval=200)
    
    print('AUC for validation fold {}: {}'.format(fold+1, clf.best_score['valid_1']['auc']))
    aucs.append(clf.best_score['valid_1']['auc'])
    
#     holdout_pred = clf.predict(X_holdout)
#     print('AUC for holdout set - fold ', roc_auc_score(y_holdout, holdout_pred))
    
    prediction += clf.predict(test_full[features_columns])

print("Cross Validation AUC: ", sum(aucs)/NFOLDS)
final_predictions = prediction/NFOLDS

Training on fold 1




Training until validation scores don't improve for 100 rounds.
[200]	training's auc: 0.951756	valid_1's auc: 0.886563
[400]	training's auc: 0.979561	valid_1's auc: 0.904942
[600]	training's auc: 0.989616	valid_1's auc: 0.913059
[800]	training's auc: 0.994189	valid_1's auc: 0.916755
Did not meet early stopping. Best iteration is:
[800]	training's auc: 0.994189	valid_1's auc: 0.916755
AUC for validation fold 1: 0.9167548469992037
Training on fold 2
Training until validation scores don't improve for 100 rounds.
[200]	training's auc: 0.951624	valid_1's auc: 0.905164
[400]	training's auc: 0.98051	valid_1's auc: 0.923355
[600]	training's auc: 0.990884	valid_1's auc: 0.930018
[800]	training's auc: 0.995133	valid_1's auc: 0.933007
Did not meet early stopping. Best iteration is:
[800]	training's auc: 0.995133	valid_1's auc: 0.933007
AUC for validation fold 2: 0.9330067559668452
Training on fold 3
Training until validation scores don't improve for 100 rounds.
[200]	training's auc: 0.954347	valid

The average AUC for the timeseries split is much lower, and the LB score is a lower too. Might not be the correct CV scheme for time series as well. Still need to look for a good CV scheme.

KFold Performance: CV: 0.9317; LB: 0.9417

In [None]:
fig, ax = plt.subplots(figsize=(15, 20))
lgb.plot_importance(clf,max_num_features=50,ax=ax)
# for i in range(NFOLDS):
#     fig, ax = plt.subplots(figsize=(15, 20))
#     xgb.plot_importance(clfs[i],max_num_features=50,ax=ax)

In [None]:
sample_submission = pd.read_csv('data/sample_submission.csv', index_col='TransactionID')
sample_submission['isFraud'] = prediction
sample_submission.to_csv('data/lightgbm_cv_kfold_noholdout.csv')