In this competition there are a lot of discussions and notebooks on hyper-parameter tuning, XGBoost and stacking, since they are the most effective techniques.

Yet, something is missing. For instance, there little reasoning has been done on the best way to cross-validate.

Actually, if you watch carefully the data, it seems like data distributions are segregated into specific portions of space, something reminiscent ot me of the Madelon dataset created by Isabelle Guyon, one of the ideators of the Support Vector Machines (see for Guyon's contribution: https://www.kdnuggets.com/2016/07/guyon-data-mining-history-svm-support-vector-machines.html for the Madelon dataset see instead: https://archive.ics.uci.edu/ml/datasets/madelon).

I therefore tried to stratifiy my folds based on a k-means clustering of the non-noisy data (see https://www.kaggle.com/c/30-days-of-ml/discussion/267931) and my local cv has become more reliable (very correlated with the public leaderboard) and my models are performing much better with cv prediction.

Try it and let me know, if it works also on your models!

Happy Kaggling!

In [1]:
# Importing core libraries
import numpy as np
import pandas as pd
import joblib

# Importing from Scikit-Learn
from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.preprocessing import OrdinalEncoder
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

In [2]:
# Loading data 
X_train = pd.read_csv("../input/30-days-of-ml/train.csv")
X_test = pd.read_csv("../input/30-days-of-ml/test.csv")

In [3]:
# Preparing data as a tabular matrix
y_train = X_train.target
X_train = X_train.set_index('id').drop('target', axis='columns')
X_test = X_test.set_index('id')

In [4]:
# Pointing out categorical features
categoricals = [item for item in X_train.columns if 'cat' in item]

In [5]:
# Dealing with categorical data using get_dummies
dummies = pd.get_dummies(X_train.append(X_test)[categoricals])
X_train[dummies.columns] = dummies.iloc[:len(X_train), :]
X_test[dummies.columns] = dummies.iloc[len(X_train): , :]
del(dummies)

In [6]:
# Dealing with categorical data using OrdinalEncoder (only when there are 3 or more levels)
ordinal_encoder = OrdinalEncoder()
X_train[categoricals[3:]] = ordinal_encoder.fit_transform(X_train[categoricals[3:]]).astype(int)
X_test[categoricals[3:]] = ordinal_encoder.transform(X_test[categoricals[3:]]).astype(int)
X_train = X_train.drop(categoricals[:3], axis="columns")
X_test = X_test.drop(categoricals[:3], axis="columns")

In [7]:
# Feature selection (https://www.kaggle.com/lucamassaron/tutorial-feature-selection-with-boruta-shap)
important_features = ['cat1_A', 'cat1_B', 'cat5', 'cat8', 'cat8_C', 'cat8_E', 'cont0', 
                      'cont1', 'cont10', 'cont11', 'cont12', 'cont13', 'cont2', 'cont3', 
                      'cont4', 'cont5', 'cont6', 'cont7', 'cont8', 'cont9']

categoricals = ['cat5', 'cat8']

X_train = X_train[important_features]
X_test = X_test[important_features]

In [8]:
# Stratifying the data

pca = PCA(n_components=16, random_state=0)
km = KMeans(n_clusters=32, random_state=0)

pca.fit(X_train)
km.fit(pca.transform(X_train))

print(np.unique(km.labels_, return_counts=True))

y_stratified = km.labels_

(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31],
      dtype=int32), array([ 8885, 23161, 18215, 13810, 13461, 18955,  8334,  8874,  3903,
       11152,  8617,  3471,  8492, 10460,  2726,  3428, 12002,  8632,
        7151,  7087, 14235, 10591,  2337, 12269,  3762, 10066,  1579,
       14104, 14622,  2005,  7083,  6531]))


In [9]:
# Creating your folds for repeated use (for instance, stacking)

folds = 10
seeds = [42, 0, 101]
fold_idxs = list()

for seed in seeds:
    skf = StratifiedKFold(n_splits=folds,
                          shuffle=True, 
                          random_state=seed)

    fold_idxs.append(list(skf.split(X_train, y_stratified)))

In [10]:
# Checking the produced folds
for j, fold_idxs_ in enumerate(fold_idxs):
    print(f"\n--- seed={seeds[j]} ---")
    for k, (train_idx, validation_idx) in enumerate(fold_idxs_):
        print(f"fold {k} train idxs: {len(train_idx)} validation idxs: {len(validation_idx)} -> {validation_idx[:10]}")


--- seed=42 ---
fold 0 train idxs: 270000 validation idxs: 30000 -> [ 1 17 21 26 31 34 38 40 54 62]
fold 1 train idxs: 270000 validation idxs: 30000 -> [ 2 11 27 33 44 45 48 52 63 64]
fold 2 train idxs: 270000 validation idxs: 30000 -> [ 4  5 12 15 23 39 42 53 58 82]
fold 3 train idxs: 270000 validation idxs: 30000 -> [16 19 29 30 55 57 66 72 75 78]
fold 4 train idxs: 270000 validation idxs: 30000 -> [ 10  13  18  71  80  94  98 103 120 133]
fold 5 train idxs: 270000 validation idxs: 30000 -> [ 6  8 22 24 25 28 32 61 83 85]
fold 6 train idxs: 270000 validation idxs: 30000 -> [ 41  50  59  65  67  74  90  92 102 134]
fold 7 train idxs: 270000 validation idxs: 30000 -> [ 0  3 14 36 56 73 87 93 97 99]
fold 8 train idxs: 270000 validation idxs: 30000 -> [  9  37  46  47  69  76  77  96 115 123]
fold 9 train idxs: 270000 validation idxs: 30000 -> [ 7 20 35 43 49 51 60 68 70 86]

--- seed=0 ---
fold 0 train idxs: 270000 validation idxs: 30000 -> [  4  17  26  31  41  56  66  96 102 125]
fol

In [11]:
# Storing into the notebook for future use
joblib.dump(fold_idxs, './fold_idxs.job')

['./fold_idxs.job']

In [12]:
# Retrieving from the notebook
fold_idxs = joblib.load('./fold_idxs.job')