potential data leakage in TargetEncoder #35

takashioya · 2019-08-03T23:27:25Z

it will cause leakage to simply use the average of target in target encoding. probably it's better to use KFold in TargetEncoder.

jeongyoonlee · 2019-08-05T18:29:12Z

I agree that there can be leakage and it should be used with caution. In practice, I'm using it with cross validation as you suggested as follows:

from sklearn.model_selection import KFold
from kaggler.preprocessing import TargetEncoder

...
cat_cols = [col for col in X.columns if X[col].dtype == 'object']

X_cat = pd.DataFrame(np.zeros_like(X[cat_cols]), columns=cat_cols)
cv = KFold(N_FOLD)
for i, (i_trn, i_val) enumerate(cv.split(X, 1)):
    te = TargetEncoder()
    te.fit(X.loc[i_trn, cat_cols])
    X_cat.loc[i_val] += te.transform(X.loc[i_val, cat_cols]) / N_FOLD

X.loc[:, cat_cols] = X_cat.values

@takashioya Do you think it will be helpful to have the cross validation routine inside the class?

takashioya · 2019-08-05T18:40:48Z

yes it will be helpful because maybe some people don't notice the leakage problem and make a mistake.
my favorite API is like this.

te = TargetEncoder(folds, nfold, stratified, shuffle)

for the detail of these 4 arguments, please see lightgbm.cv function in https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.cv.html

jeongyoonlee · 2019-08-05T19:43:22Z

That's fair. I will work on fit() and fit_transform() with the cross validation. In terms of API, I'd like to follow the scikit-learn style API and take the optional cv object as an input as follows:

cv = KFold(N_FOLD, shuffle=True, random_state=RANDOM_SEED)
te = TargetEncoder(cv)

Thanks for the suggestion!

BTW, if you want, please feel free to work on it and submit a PR. :)

jeongyoonlee added this to In progress in Kaggler AutoML Aug 8, 2019

jeongyoonlee mentioned this issue Aug 8, 2019

Add the CV option and test code for TargetEncoder #40

Merged

jeongyoonlee closed this as completed Aug 8, 2019

Kaggler AutoML automation moved this from In progress to Done Aug 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

potential data leakage in TargetEncoder #35

potential data leakage in TargetEncoder #35

takashioya commented Aug 3, 2019

jeongyoonlee commented Aug 5, 2019 •

edited

takashioya commented Aug 5, 2019 •

edited

jeongyoonlee commented Aug 5, 2019

potential data leakage in TargetEncoder #35

potential data leakage in TargetEncoder #35

Comments

takashioya commented Aug 3, 2019

jeongyoonlee commented Aug 5, 2019 • edited

takashioya commented Aug 5, 2019 • edited

jeongyoonlee commented Aug 5, 2019

jeongyoonlee commented Aug 5, 2019 •

edited

takashioya commented Aug 5, 2019 •

edited