-
-
Notifications
You must be signed in to change notification settings - Fork 167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
potential data leakage in TargetEncoder #35
Comments
I agree that there can be leakage and it should be used with caution. In practice, I'm using it with cross validation as you suggested as follows: from sklearn.model_selection import KFold
from kaggler.preprocessing import TargetEncoder
...
cat_cols = [col for col in X.columns if X[col].dtype == 'object']
X_cat = pd.DataFrame(np.zeros_like(X[cat_cols]), columns=cat_cols)
cv = KFold(N_FOLD)
for i, (i_trn, i_val) enumerate(cv.split(X, 1)):
te = TargetEncoder()
te.fit(X.loc[i_trn, cat_cols])
X_cat.loc[i_val] += te.transform(X.loc[i_val, cat_cols]) / N_FOLD
X.loc[:, cat_cols] = X_cat.values @takashioya Do you think it will be helpful to have the cross validation routine inside the class? |
yes it will be helpful because maybe some people don't notice the leakage problem and make a mistake.
for the detail of these 4 arguments, please see |
That's fair. I will work on cv = KFold(N_FOLD, shuffle=True, random_state=RANDOM_SEED)
te = TargetEncoder(cv) Thanks for the suggestion! BTW, if you want, please feel free to work on it and submit a PR. :) |
it will cause leakage to simply use the average of target in target encoding. probably it's better to use KFold in TargetEncoder.
The text was updated successfully, but these errors were encountered: