Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

potential data leakage in TargetEncoder #35

Closed
takashioya opened this issue Aug 3, 2019 · 3 comments
Closed

potential data leakage in TargetEncoder #35

takashioya opened this issue Aug 3, 2019 · 3 comments

Comments

@takashioya
Copy link

it will cause leakage to simply use the average of target in target encoding. probably it's better to use KFold in TargetEncoder.

@jeongyoonlee
Copy link
Owner

jeongyoonlee commented Aug 5, 2019

I agree that there can be leakage and it should be used with caution. In practice, I'm using it with cross validation as you suggested as follows:

from sklearn.model_selection import KFold
from kaggler.preprocessing import TargetEncoder

...
cat_cols = [col for col in X.columns if X[col].dtype == 'object']

X_cat = pd.DataFrame(np.zeros_like(X[cat_cols]), columns=cat_cols)
cv = KFold(N_FOLD)
for i, (i_trn, i_val) enumerate(cv.split(X, 1)):
    te = TargetEncoder()
    te.fit(X.loc[i_trn, cat_cols])
    X_cat.loc[i_val] += te.transform(X.loc[i_val, cat_cols]) / N_FOLD

X.loc[:, cat_cols] = X_cat.values

@takashioya Do you think it will be helpful to have the cross validation routine inside the class?

@takashioya
Copy link
Author

takashioya commented Aug 5, 2019

yes it will be helpful because maybe some people don't notice the leakage problem and make a mistake.
my favorite API is like this.

te = TargetEncoder(folds, nfold, stratified, shuffle) 

for the detail of these 4 arguments, please see lightgbm.cv function in https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.cv.html

@jeongyoonlee
Copy link
Owner

That's fair. I will work on fit() and fit_transform() with the cross validation. In terms of API, I'd like to follow the scikit-learn style API and take the optional cv object as an input as follows:

cv = KFold(N_FOLD, shuffle=True, random_state=RANDOM_SEED)
te = TargetEncoder(cv)

Thanks for the suggestion!

BTW, if you want, please feel free to work on it and submit a PR. :)

@jeongyoonlee jeongyoonlee added this to In progress in Kaggler AutoML Aug 8, 2019
Kaggler AutoML automation moved this from In progress to Done Aug 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

2 participants