Add Target Encoding to Sparkling Water Scala API #4549

exalate-issue-sync · 2023-05-22T16:20:21Z

Target Encoding is available in R/Python API. Add it to the Sparkling Water Scala API.

exalate-issue-sync · 2023-05-22T16:20:23Z

Marek Novotny commented: Target Encoding will be implemented as a separate SW estimator to Spark ML pipelines. [~accountid:557058:f0137791-c6cb-47bd-bcce-fc81ad4cfefa] Do you have any objections?

exalate-issue-sync · 2023-05-22T16:20:24Z

Megan Kurka commented: [~accountid:5c9943ec3a5542225fedb6b9] I think that makes sense. I think the target encoder would need to be a pre-processing step in a Spark ML pipeline. Is that still possible if it is implemented as a separate SW estimator?

exalate-issue-sync · 2023-05-22T16:20:26Z

Marek Novotny commented: Yes, I think we're talking about the same thing, but with a slightly different terminology. :-) By a SW estimator, I meant a pre-precessing stage that needs to be trained.

Just wanted to check that we're on the same page and we won't append the logic of a target encoder to H2O models.

exalate-issue-sync · 2023-05-22T16:20:28Z

Megan Kurka commented: That sounds great! Thanks!

exalate-issue-sync · 2023-05-22T16:20:30Z

Marek Novotny commented: Hi [~accountid:557058:f0137791-c6cb-47bd-bcce-fc81ad4cfefa],
Maybe you've already heard that we currently have troubles to get the task done since we hit several blockers on the current H2O-3 implementation. I just want to summarize what we miss in the current implementation of TE in H2O-3.

The trained encoding map is represented only as a Map<String, Frame> and there is no API that allow us to transform it into a representation that wouldn't require a H2O cluster running. This is a crucial thing for SW pipelines deployed in PROD.
The same thing we can say also about the methods applying the encoding table and transforming training and testing datasets. Ideally, we would like to get something like a MOJO, that would contain an encoding map and would be able to perform transformations on the record-level basis.
applyTargetEncoding functions go against the principles of Spark pipelines. When a dataset is being transformed in the pipeline, it doesn't matter whether it's a training or testing dataset. They are treated the same with the same set of parameters. But the parameters of ApplyTargetEncoding functions differ with the type of a dataset. The whole day, I was trying to find a workaround for that, but I didn't come up with anything robust enough.

Also I would like to mention two minor things that should be improved IMHO:

An user should be allowed to specify names of output columns
If the adding of noise is disabled, I think that there is no need to specify a seed

If you want to talk about that in more details, we can set up a call tomorrow.

DinukaH2O · 2023-05-23T11:20:10Z

JIRA Issue Migration Info

Jira Issue: SW-1207
Assignee: Marek Novotny
Reporter: Megan Kurka
State: Resolved
Fix Version: 3.26.2
Attachments: N/A
Development PRs: Available

Linked PRs from JIRA

#1380
#1382
#1192

hasithjp · 2023-05-29T14:28:24Z

JIRA Issue Migration Info Cont'd

Jira Issue Created Date: 2019-04-15T14:26:51.873-0700

DinukaH2O assigned mn-mikke May 23, 2023

DinukaH2O closed this as completed May 23, 2023

DinukaH2O added the fixVersion/3.26.2 label May 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Target Encoding to Sparkling Water Scala API #4549

Add Target Encoding to Sparkling Water Scala API #4549

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

DinukaH2O commented May 23, 2023

hasithjp commented May 29, 2023

Add Target Encoding to Sparkling Water Scala API #4549

Add Target Encoding to Sparkling Water Scala API #4549

Comments

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

exalate-issue-sync bot commented May 22, 2023

DinukaH2O commented May 23, 2023

hasithjp commented May 29, 2023