Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Target Encoding to Sparkling Water Scala API #4549

Closed
exalate-issue-sync bot opened this issue May 22, 2023 · 7 comments
Closed

Add Target Encoding to Sparkling Water Scala API #4549

exalate-issue-sync bot opened this issue May 22, 2023 · 7 comments
Assignees

Comments

@exalate-issue-sync
Copy link

Target Encoding is available in R/Python API. Add it to the Sparkling Water Scala API.

@exalate-issue-sync
Copy link
Author

Marek Novotny commented: Target Encoding will be implemented as a separate SW estimator to Spark ML pipelines. [~accountid:557058:f0137791-c6cb-47bd-bcce-fc81ad4cfefa] Do you have any objections?

@exalate-issue-sync
Copy link
Author

Megan Kurka commented: [~accountid:5c9943ec3a5542225fedb6b9] I think that makes sense. I think the target encoder would need to be a pre-processing step in a Spark ML pipeline. Is that still possible if it is implemented as a separate SW estimator?

@exalate-issue-sync
Copy link
Author

Marek Novotny commented: Yes, I think we're talking about the same thing, but with a slightly different terminology. :-) By a SW estimator, I meant a pre-precessing stage that needs to be trained.

Just wanted to check that we're on the same page and we won't append the logic of a target encoder to H2O models.

@exalate-issue-sync
Copy link
Author

Megan Kurka commented: That sounds great! Thanks!

@exalate-issue-sync
Copy link
Author

Marek Novotny commented: Hi [~accountid:557058:f0137791-c6cb-47bd-bcce-fc81ad4cfefa],
Maybe you've already heard that we currently have troubles to get the task done since we hit several blockers on the current H2O-3 implementation. I just want to summarize what we miss in the current implementation of TE in H2O-3.

  • The trained encoding map is represented only as a Map<String, Frame> and there is no API that allow us to transform it into a representation that wouldn't require a H2O cluster running. This is a crucial thing for SW pipelines deployed in PROD.
  • The same thing we can say also about the methods applying the encoding table and transforming training and testing datasets. Ideally, we would like to get something like a MOJO, that would contain an encoding map and would be able to perform transformations on the record-level basis.
  • applyTargetEncoding functions go against the principles of Spark pipelines. When a dataset is being transformed in the pipeline, it doesn't matter whether it's a training or testing dataset. They are treated the same with the same set of parameters. But the parameters of ApplyTargetEncoding functions differ with the type of a dataset. The whole day, I was trying to find a workaround for that, but I didn't come up with anything robust enough.

Also I would like to mention two minor things that should be improved IMHO:

  • An user should be allowed to specify names of output columns
  • If the adding of noise is disabled, I think that there is no need to specify a seed

If you want to talk about that in more details, we can set up a call tomorrow.

@DinukaH2O
Copy link

JIRA Issue Migration Info

Jira Issue: SW-1207
Assignee: Marek Novotny
Reporter: Megan Kurka
State: Resolved
Fix Version: 3.26.2
Attachments: N/A
Development PRs: Available

Linked PRs from JIRA

#1380
#1382
#1192

@hasithjp
Copy link
Member

JIRA Issue Migration Info Cont'd

Jira Issue Created Date: 2019-04-15T14:26:51.873-0700

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants