[SW-1207] Implement H2OTargetEncoder for Scala API #1192

mn-mikke · 2019-05-08T17:40:18Z

This PR requires h2oai/h2o-3#3282 to be released for a successful build.

The implementation of pysparkling wrappers and detailed tests will come in subsequent PRs.

mn-mikke · 2019-06-13T10:33:37Z

ml/src/main/scala/org/apache/spark/ml/h2o/param/H2OTargetEncoderParams.scala

+    labelCol -> "label",
+    inputCols -> Array[String](),
+    holdoutStrategy -> H2OTargetEncoderHoldoutStrategy.None,
+    blending -> null,


Blending and noise parameters are grouped together. if blending or noise is disabled, the particular feature is disabled. @jakubhava, I would like to know your opinion on this.

Good question! Does it semantically make sense to group this feature under one new, specific configuration object?

if you started to use this estimator and you would see parameters like smoothing, inflectionPoint would you implicitly understand that they are parameters of blending? If blending is disabled they don't have any impact on anything. We can pose a similar question about the seed parameter for noise.

An alternative could be to name the parameters with some prefix that groups them, but I'm not sure what is better.

On the other hand, when I thing about it the users setting these properties probably have to know they require bleeding to be enabled. Maybe we can keep it this is simple.

At the end, it is easier for the user to set the options directly compared to creating some additional configuration object.

I would keep it like it is now

@mn-mikke could you please elaborate on this comment? I don't see them to be grouped in current code. Is this concern coming from h2o-3 TE API?

The main point of my comment here is how to define H2OTargetEncoderParams in the best way that new users will quickly and easily understand that a smoothing and inflectionPoint belong to blending and seed to noise.

Personally, I don't know what's better... the current solution, name the parameters with some prefix or something else? Welcome any suggestions :-)

ml/src/main/scala/org/apache/spark/ml/h2o/features/H2OTargetEncoder.scala

jakubhava

PR looks good, just a few comments.

Thanks you @mn-mikke for such a tremendous effort!

ml/src/main/scala/org/apache/spark/ml/h2o/features/H2OTargetEncoder.scala

ml/src/main/scala/org/apache/spark/ml/h2o/models/H2OTargetEncoderModel.scala

jakubhava · 2019-06-13T19:37:05Z

ml/src/main/scala/org/apache/spark/ml/h2o/models/H2OTargetEncoderModel.scala

+
+  private def inTrainingMode: Boolean = {
+    val stackTrace = Thread.currentThread().getStackTrace()
+    stackTrace.exists(e => e.getMethodName == "fit" && e.getClassName == "org.apache.spark.ml.Pipeline")


hehe, you found a way at the end! 👍

@mn-mikke It is even more tricky than reflection :) It is IO state( it makes our model a stateful one).... can't we just use boolean flag for storing the state instead of iteration through stack trace?
Btw what if fit was performed in a different thread?

The piece of code tries to distinguish between the cases when H2OTargetEncoderModel.transform is called from Pipeline.fit and PipelineModel.transform (+ other callers). Unfortunately, we can't make any changes to Pipeline or PipelineModel.

Good point with the multiple threads, but it seems to me that this is not the case. H2OTargetEncoderM.transform is called for a training dataset straight after H2OTargetEncoder.fit only from one method.

Personally, I'm not happy with this solution either, so I welcome any proposals. What's the idea with a flag? Who or what would set a flag?

@mn-mikke No.... no way for flag here as you are using spark's Pipeline. I just remember we were discussing in Confluence that SW has its own wrapper (H2OPipeline?) for pipelines ( which is probably what @jakubhava is referring to here)

We do not have H2OPipeline anymore and we definitely need to use the provided Pipeline from Spark

jakubhava · 2019-06-13T19:39:43Z

ml/src/main/scala/org/apache/spark/ml/h2o/param/H2OTargetEncoderParams.scala

+    labelCol -> "label",
+    inputCols -> Array[String](),
+    holdoutStrategy -> H2OTargetEncoderHoldoutStrategy.None,
+    blending -> null,


Good question! Does it semantically make sense to group this feature under one new, specific configuration object?

jakubhava · 2019-06-14T10:56:24Z

One more note: @mn-mikke could you please also add tutorial into our documentation how to use target encoding in Sparkling Water? Different PR is fine for this

mn-mikke · 2019-06-14T11:00:17Z

Yes, will do once this gets merged :-) Thanks @jakubhava!

deil87

Thank you @mn-mikke ! I left some comments/questions... but overall looks good! Maybe I would suggest to add more tests though.

deil87 · 2019-06-24T21:23:30Z

ml/src/main/scala/org/apache/spark/ml/h2o/param/H2OTargetEncoderParams.scala

+  def getNoise(): H2OTargetEncoderNoiseSettings = $(noise)
+
+  //
+  // Others


@mn-mikke should this method be here? Class that - judging by the name - is supposed to keep parameters' values - should not contain logic about how to transform anything.

Yep, will change as part of the comment here. Thanks!

deil87 · 2019-06-25T06:02:01Z

ml/src/main/scala/org/apache/spark/ml/h2o/param/H2OTargetEncoderParams.scala

+    labelCol -> "label",
+    inputCols -> Array[String](),
+    holdoutStrategy -> H2OTargetEncoderHoldoutStrategy.None,
+    blending -> null,


@mn-mikke could you please elaborate on this comment? I don't see them to be grouped in current code. Is this concern coming from h2o-3 TE API?

deil87 · 2019-06-25T06:13:36Z

ml/src/main/scala/org/apache/spark/ml/h2o/features/H2OTargetEncoder.scala

+  override def fit(dataset: Dataset[_]): H2OTargetEncoderModel = {
+    val h2oContext = H2OContext.getOrCreate(SparkSession.builder().getOrCreate())
+    val input = h2oContext.asH2OFrame(dataset.toDF())
+    changeRelevantColumnsToCategorical(input)


@mn-mikke Should we do this implicitly for the user? In h2o-3 TargetEncoder we are only checking if all the expected columns are categorical. If I got it correctly we in h2o-3 prefer to ask user to prepare data.

Sparkling Water is trying( and still some way to reach the goal) to hide tasks which can be done automatically. I believe we should do these automatically, not ask user for explicit data prep

@kuba I agree that we should try to do automatically as much as possible just what if user specified wrong column for TE and it was numerical one... algorithm will silently convert it but performance would be awful (not saying that probably it is not what user was planning to do). If user had not to specify TE columns himself then I would totally agree with full automation. When human is in the loop I would make any transformations more explicit (log warning or throw). Just my opinion.

deil87 · 2019-06-25T06:18:29Z

ml/src/main/scala/org/apache/spark/ml/h2o/features/H2OTargetEncoder.scala

+  def this() = this(Identifiable.randomUID("H2OTargetEncoder"))
+
+  override def fit(dataset: Dataset[_]): H2OTargetEncoderModel = {
+    val h2oContext = H2OContext.getOrCreate(SparkSession.builder().getOrCreate())


@mn-mikke what about IOC (Cake Pattern) here? we don't need multiple implementations yet but part of this pattern (moving some logic into trait) might be good for code reuse. We can introduce H2OContextProvider trait and add h2oContext method there. Wdyt?

Good idea, but not related to the core change. Let's first focus on TE part please :)

deil87 · 2019-06-25T06:43:02Z

ml/src/main/scala/org/apache/spark/ml/h2o/param/H2OTargetEncoderParams.scala

+    labelCol -> "label",
+    inputCols -> Array[String](),
+    holdoutStrategy -> H2OTargetEncoderHoldoutStrategy.None,
+    blending -> null,


@mn-mikke In Scala using null is antipattern.. did you consider to use None?. And another pattern(not rule maybe) that comes from spark framework...is that there is no nulls in default params. And maybe most to the point... we have default value in h2o-3 for blending params new BlendingParams(10, 20) . Maybe I can expose default values in h2o-3 so that we can reuse them here as well. Same with noise=0.01 default value.

These parameters are serialized by reflection and deserialized by Pyspark. I'm not sure whether there is a support for options already. Maybe @jakubhava could shed some light on that.

Quite like skipping null defaults, but if we do that, we should do the same for all SW algorithms. We need to eventually double check the consequences.

Exposed default values in H2O-3 would be nice, it least it will be always consistent if a H2O-3 default value changes in future.

If bleeding has some default value than H2O should define it and we should use it. Otherwise use null as is in the rest of the code for consistency

deil87 · 2019-06-25T07:57:10Z

ml/src/main/scala/org/apache/spark/ml/h2o/models/H2OTargetEncoderModel.scala

+
+  private def inTrainingMode: Boolean = {
+    val stackTrace = Thread.currentThread().getStackTrace()
+    stackTrace.exists(e => e.getMethodName == "fit" && e.getClassName == "org.apache.spark.ml.Pipeline")


@mn-mikke It is even more tricky than reflection :) It is IO state( it makes our model a stateful one).... can't we just use boolean flag for storing the state instead of iteration through stack trace?
Btw what if fit was performed in a different thread?

ml/src/main/scala/org/apache/spark/ml/h2o/param/H2OTargetEncoderParams.scala

deil87 · 2019-06-25T08:47:33Z

ml/src/test/scala/org/apache/spark/ml/h2o/features/H2OTargetEncoderTest.scala

+
+  override def createSparkContext = new SparkContext("local[*]", "H2OTargetEncoderTest", conf = defaultSparkConf)
+
+  private def loadDataFrameFromCsv(path: String): DataFrame = {


@mn-mikke Is it really the very first time you needed a helper methods? Don't you have something like these methods(loadDataFrameFromCsvAsResource, assertDataFramesAreIdentical) in TestUtils or somewhere?

Yep, something generic has already been created here recently.

deil87 · 2019-06-25T09:06:40Z

ml/src/test/scala/org/apache/spark/ml/h2o/features/H2OTargetEncoderTest.scala

+    assertDataFramesAreIdentical(expectedTestingDataset, transformedTestingDataset)
+  }
+
+  test("The target encoder doesn't apply noise on the testing dataset") {


@mn-mikke Having a hard time to understand this test. Could you please help me with this. So... we call fit on our pipeline...it means that in H2OTargetEncoderModel we are in inTrainingMode mode, right? And we have following logic val noise = Option(getNoise()).getOrElse(H2OTargetEncoderNoiseSettings(amount = 0.0)) and it feels like we should apply noise as we set it to 0.5. And it is not H2OTargetEncoderMojoModel. What am I missing?

inTrainingMode is valid only for fitting. At that time ,noise is applied according to the parameters. When testing dataset is transformed we are not inTrainingMode anymore and noise won't be applied.

Does it make sense?

@mn-mikke the question is how do we switch from fitting/inTrainingMode to not inTrainingMode? Because it feels like StackTraceElement which satisfies inTrainingMode conditions could be still there.

Because it feels like StackTraceElement which satisfies inTrainingMode conditions could be still there.

@deil87 Please can you give me an example?

ml/src/main/scala/org/apache/spark/ml/h2o/models/H2OTargetEncoderModel.scala

ml/src/main/scala/org/apache/spark/ml/h2o/param/H2OTargetEncoderParams.scala

deil87 · 2019-06-26T18:02:24Z

ml/src/main/scala/org/apache/spark/ml/h2o/param/H2OTargetEncoderParams.scala

+    """If set, the target average becomes a weighted average of the group target value and the global target value of a given row.
+      |The weight is determined by the size of the given group that the row belongs to.
+      |Attributes:
+      | InflectionPoint - The bigger number it's, the bigger groups will consider the global target value as a component in the weighted average.


@mn-mikke better rephrase InflectoinPoint as well. We should reason from the size of the given group. It is relative. We can't just say the bigger the bigger. Maybe can say kile this about the diff. Something like this: the bigger difference between size of the given group and IP the more weight will be put for posterior average of the target.

@deil87 Would it make sense if we rephrased the sentence like this: "the bigger groups" -> "the groups relatively bigger to the overall data set size will ..." ?

@mn-mikke It should be relatively bigger than inflection point. "The bigger group is with respect to an inflection point the more weight we will put onto its target average and less weight onto a prior average. " . Take a look at Py/R documentation for extra inspiration.

ml/src/main/scala/org/apache/spark/ml/h2o/param/H2OTargetEncoderParams.scala

jakubhava · 2019-07-12T10:21:26Z

@mn-mikke I changed jenkins files a bit, please rebase on top of master so we can run tests on this PR. Thanks!

mn-mikke · 2019-07-12T10:31:12Z

@jakubhava Thanks for letting me know! I will do the rebase together with the next set of changes.

core/src/main/scala/org/apache/spark/sql/DatasetExtensions.scala

ml/src/main/scala/ai/h2o/sparkling/ml/features/H2OTargetEncoder.scala

ml/src/main/java/ai/h2o/sparkling/ml/features/H2OTargetEncoderHoldoutStrategy.java

ml/src/main/scala/ai/h2o/sparkling/ml/features/H2OTargetEncoder.scala

ml/src/main/scala/ai/h2o/sparkling/ml/features/H2OTargetEncoderBase.scala

ml/src/main/scala/ai/h2o/sparkling/ml/models/H2OTargetEncoderModel.scala

ml/src/main/scala/ai/h2o/sparkling/ml/models/H2OTargetEncoderMojoModel.scala

ml/src/main/scala/ai/h2o/sparkling/ml/params/H2OTargetEncoderParams.scala

ml/src/test/scala/ai/h2o/sparkling/ml/features/H2OTargetEncoderTestSuite.scala

This reverts commit c9aa942.

…1380) This reverts commit c9aa942.

…1380) This reverts commit c9aa942. (cherry picked from commit 727f77d)

…1192)" (#1380)" This reverts commit 727f77d.

…1382) This reverts commit 727f77d.

…1382) This reverts commit 727f77d. (cherry picked from commit 94dbe2d)

mn-mikke force-pushed the mn/SW-1207 branch from 8bcb1de to 6a39a2e Compare June 3, 2019 11:15

mn-mikke force-pushed the mn/SW-1207 branch from 6a39a2e to 467aa02 Compare June 13, 2019 10:11

mn-mikke changed the title ~~[SW-1207][WIP] Implement H2OTargetEncoder~~ [SW-1207][WIP] Implement H2OTargetEncoder for Scala API Jun 13, 2019

mn-mikke requested a review from jakubhava June 13, 2019 10:30

mn-mikke commented Jun 13, 2019

View reviewed changes

mn-mikke changed the title ~~[SW-1207][WIP] Implement H2OTargetEncoder for Scala API~~ [SW-1207] Implement H2OTargetEncoder for Scala API Jun 13, 2019

mn-mikke mentioned this pull request Jun 13, 2019

PUBDEV-6255: Mojo support for Target Encoding h2oai/h2o-3#3282

Merged

jakubhava reviewed Jun 13, 2019

View reviewed changes

ml/src/main/scala/org/apache/spark/ml/h2o/features/H2OTargetEncoder.scala Outdated Show resolved Hide resolved

jakubhava reviewed Jun 13, 2019

View reviewed changes

jakubhava previously approved these changes Jun 14, 2019

View reviewed changes

mn-mikke requested review from deil87 and michalkurka June 21, 2019 07:48

deil87 reviewed Jun 25, 2019

View reviewed changes

mn-mikke commented Jun 26, 2019

View reviewed changes

ml/src/main/scala/org/apache/spark/ml/h2o/param/H2OTargetEncoderParams.scala Outdated Show resolved Hide resolved

deil87 reviewed Jun 26, 2019

View reviewed changes

ml/src/main/scala/org/apache/spark/ml/h2o/param/H2OTargetEncoderParams.scala Outdated Show resolved Hide resolved

deil87 reviewed Jun 26, 2019

View reviewed changes

ml/src/main/scala/org/apache/spark/ml/h2o/param/H2OTargetEncoderParams.scala Outdated Show resolved Hide resolved

deil87 reviewed Jun 26, 2019

View reviewed changes

ml/src/main/scala/org/apache/spark/ml/h2o/param/H2OTargetEncoderParams.scala Outdated Show resolved Hide resolved

deil87 reviewed Jun 26, 2019

View reviewed changes

ml/src/main/scala/org/apache/spark/ml/h2o/param/H2OTargetEncoderParams.scala Outdated Show resolved Hide resolved

deil87 reviewed Jun 26, 2019

View reviewed changes

ml/src/main/scala/org/apache/spark/ml/h2o/param/H2OTargetEncoderParams.scala Outdated Show resolved Hide resolved

mn-mikke dismissed jakubhava’s stale review via 08d31c1 June 27, 2019 12:09

mn-mikke force-pushed the mn/SW-1207 branch 3 times, most recently from b43d150 to 0a09b5d Compare July 17, 2019 16:09

jakubhava reviewed Jul 22, 2019

View reviewed changes

mn-mikke force-pushed the mn/SW-1207 branch from 461b91c to b160aa5 Compare July 23, 2019 12:51

mn-mikke added 16 commits July 26, 2019 12:41

Getting TE model via getTargetEncoderModel

93b604f

Moving TargetEncoder to the package ai.h2o.sparkling.ml

9cacd08

More tests

80dff9e

Adding more tests

0ce13e0

Remove cache

285f94b

Remove includeBuild

48cf158

More tests

a1a4826

Fixing description

2b4fa39

Fixing scala style

4d13f1d

Adressing review comments

5ee4228

Using enum for houldout strategy

81d765b

Updating test cases according to the changes in H2O-3

1a80364

Updating tests

5d96508

Adding more test testing Java API scenarios

6067acf

Updating reference to H2OAlgoParamsHelper

e4ab4f2

Fixing problems after rebase

748ecc7

mn-mikke force-pushed the mn/SW-1207 branch from 6dd9897 to 748ecc7 Compare July 26, 2019 11:52

mn-mikke merged commit c9aa942 into h2oai:master Jul 26, 2019

mn-mikke added a commit that referenced this pull request Jul 26, 2019

[SW-1207] Implement H2OTargetEncoder for Scala API (#1192)

3faa9b5

mn-mikke added a commit that referenced this pull request Jul 26, 2019

Revert "[SW-1207] Implement H2OTargetEncoder for Scala API (#1192)"

593dd9c

This reverts commit c9aa942.

mn-mikke mentioned this pull request Jul 26, 2019

Revert "[SW-1207] Implement H2OTargetEncoder for Scala API" #1380

Merged

mn-mikke added a commit that referenced this pull request Jul 26, 2019

Revert "[SW-1207] Implement H2OTargetEncoder for Scala API (#1192)" (#…

727f77d

…1380) This reverts commit c9aa942.

mn-mikke added a commit that referenced this pull request Jul 26, 2019

Revert "[SW-1207] Implement H2OTargetEncoder for Scala API (#1192)" (#…

d5eb1ea

…1380) This reverts commit c9aa942. (cherry picked from commit 727f77d)

mn-mikke added a commit that referenced this pull request Jul 27, 2019

Revert "Revert "[SW-1207] Implement H2OTargetEncoder for Scala API (#…

debb19a

…1192)" (#1380)" This reverts commit 727f77d.

mn-mikke mentioned this pull request Jul 27, 2019

[SW-1207][merge-2] Implement H2OTargetEncoder for Scala API #1382

Merged

jakubhava pushed a commit that referenced this pull request Jul 29, 2019

[SW-1207] Implement H2OTargetEncoder for Scala API (#1192)" (#1380)" (#…

94dbe2d

…1382) This reverts commit 727f77d.

jakubhava pushed a commit that referenced this pull request Jul 29, 2019

[SW-1207] Implement H2OTargetEncoder for Scala API (#1192)" (#1380)" (#…

a4480a1

…1382) This reverts commit 727f77d. (cherry picked from commit 94dbe2d)

This was referenced May 12, 2023

Investigate Benefits of Exposing the 'convertUnknownCategoricalLevelsToNa' Parameter on Target Encoder h2oai/h2o-3#8928

Open

Custom output columns on Target Encoder h2oai/h2o-3#9132

Closed

DinukaH2O mentioned this pull request May 23, 2023

Add Target Encoding to Sparkling Water Scala API #4549

Closed


		override def createSparkContext = new SparkContext("local[*]", "H2OTargetEncoderTest", conf = defaultSparkConf)

		private def loadDataFrameFromCsv(path: String): DataFrame = {

[SW-1207] Implement H2OTargetEncoder for Scala API #1192

[SW-1207] Implement H2OTargetEncoder for Scala API #1192

Conversation

mn-mikke commented May 8, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jakubhava left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jakubhava commented Jun 14, 2019

mn-mikke commented Jun 14, 2019

deil87 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jakubhava Jun 26, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jakubhava Jun 26, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jakubhava commented Jul 12, 2019

mn-mikke commented Jul 12, 2019

mn-mikke commented May 8, 2019 •

edited

jakubhava Jun 26, 2019 •

edited

jakubhava Jun 26, 2019 •

edited