New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SW-1207] Implement H2OTargetEncoder for Scala API #1192
Conversation
labelCol -> "label", | ||
inputCols -> Array[String](), | ||
holdoutStrategy -> H2OTargetEncoderHoldoutStrategy.None, | ||
blending -> null, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Blending and noise parameters are grouped together. if blending
or noise
is disabled, the particular feature is disabled. @jakubhava, I would like to know your opinion on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question! Does it semantically make sense to group this feature under one new, specific configuration object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you started to use this estimator and you would see parameters like smoothing
, inflectionPoint
would you implicitly understand that they are parameters of blending? If blending is disabled they don't have any impact on anything. We can pose a similar question about the seed
parameter for noise
.
An alternative could be to name the parameters with some prefix that groups them, but I'm not sure what is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the other hand, when I thing about it the users setting these properties probably have to know they require bleeding to be enabled. Maybe we can keep it this is simple.
At the end, it is easier for the user to set the options directly compared to creating some additional configuration object.
I would keep it like it is now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mn-mikke could you please elaborate on this comment? I don't see them to be grouped in current code. Is this concern coming from h2o-3 TE API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main point of my comment here is how to define H2OTargetEncoderParams
in the best way that new users will quickly and easily understand that a smoothing
and inflectionPoint
belong to blending
and seed
to noise
.
Personally, I don't know what's better... the current solution, name the parameters with some prefix or something else? Welcome any suggestions :-)
ml/src/main/scala/org/apache/spark/ml/h2o/features/H2OTargetEncoder.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR looks good, just a few comments.
Thanks you @mn-mikke for such a tremendous effort!
ml/src/main/scala/org/apache/spark/ml/h2o/features/H2OTargetEncoder.scala
Outdated
Show resolved
Hide resolved
ml/src/main/scala/org/apache/spark/ml/h2o/features/H2OTargetEncoder.scala
Outdated
Show resolved
Hide resolved
ml/src/main/scala/org/apache/spark/ml/h2o/models/H2OTargetEncoderModel.scala
Outdated
Show resolved
Hide resolved
|
||
private def inTrainingMode: Boolean = { | ||
val stackTrace = Thread.currentThread().getStackTrace() | ||
stackTrace.exists(e => e.getMethodName == "fit" && e.getClassName == "org.apache.spark.ml.Pipeline") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hehe, you found a way at the end! 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mn-mikke It is even more tricky than reflection :) It is IO state( it makes our model a stateful one).... can't we just use boolean flag for storing the state instead of iteration through stack trace?
Btw what if fit
was performed in a different thread?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The piece of code tries to distinguish between the cases when H2OTargetEncoderModel.transform
is called from Pipeline.fit
and PipelineModel.transform
(+ other callers). Unfortunately, we can't make any changes to Pipeline
or PipelineModel
.
Good point with the multiple threads, but it seems to me that this is not the case. H2OTargetEncoderM.transform
is called for a training dataset straight after H2OTargetEncoder.fit
only from one method
.
Personally, I'm not happy with this solution either, so I welcome any proposals. What's the idea with a flag? Who or what would set a flag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mn-mikke No.... no way for flag here as you are using spark's Pipeline
. I just remember we were discussing in Confluence that SW has its own wrapper (H2OPipeline
?) for pipelines ( which is probably what @jakubhava is referring to here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not have H2OPipeline anymore and we definitely need to use the provided Pipeline from Spark
labelCol -> "label", | ||
inputCols -> Array[String](), | ||
holdoutStrategy -> H2OTargetEncoderHoldoutStrategy.None, | ||
blending -> null, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question! Does it semantically make sense to group this feature under one new, specific configuration object?
One more note: @mn-mikke could you please also add tutorial into our documentation how to use target encoding in Sparkling Water? Different PR is fine for this |
Yes, will do once this gets merged :-) Thanks @jakubhava! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @mn-mikke ! I left some comments/questions... but overall looks good! Maybe I would suggest to add more tests though.
def getNoise(): H2OTargetEncoderNoiseSettings = $(noise) | ||
|
||
// | ||
// Others |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mn-mikke should this method be here? Class that - judging by the name - is supposed to keep parameters' values - should not contain logic about how to transform anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, will change as part of the comment here. Thanks!
labelCol -> "label", | ||
inputCols -> Array[String](), | ||
holdoutStrategy -> H2OTargetEncoderHoldoutStrategy.None, | ||
blending -> null, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mn-mikke could you please elaborate on this comment? I don't see them to be grouped in current code. Is this concern coming from h2o-3 TE API?
override def fit(dataset: Dataset[_]): H2OTargetEncoderModel = { | ||
val h2oContext = H2OContext.getOrCreate(SparkSession.builder().getOrCreate()) | ||
val input = h2oContext.asH2OFrame(dataset.toDF()) | ||
changeRelevantColumnsToCategorical(input) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mn-mikke Should we do this implicitly for the user? In h2o-3 TargetEncoder we are only checking if all the expected columns are categorical. If I got it correctly we in h2o-3 prefer to ask user to prepare data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sparkling Water is trying( and still some way to reach the goal) to hide tasks which can be done automatically. I believe we should do these automatically, not ask user for explicit data prep
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kuba I agree that we should try to do automatically as much as possible just what if user specified wrong column for TE and it was numerical one... algorithm will silently convert it but performance would be awful (not saying that probably it is not what user was planning to do). If user had not to specify TE columns himself then I would totally agree with full automation. When human is in the loop I would make any transformations more explicit (log warning or throw). Just my opinion.
def this() = this(Identifiable.randomUID("H2OTargetEncoder")) | ||
|
||
override def fit(dataset: Dataset[_]): H2OTargetEncoderModel = { | ||
val h2oContext = H2OContext.getOrCreate(SparkSession.builder().getOrCreate()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mn-mikke what about IOC (Cake Pattern) here? we don't need multiple implementations yet but part of this pattern (moving some logic into trait) might be good for code reuse. We can introduce H2OContextProvider
trait and add h2oContext
method there. Wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, but not related to the core change. Let's first focus on TE part please :)
labelCol -> "label", | ||
inputCols -> Array[String](), | ||
holdoutStrategy -> H2OTargetEncoderHoldoutStrategy.None, | ||
blending -> null, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mn-mikke In Scala using null
is antipattern.. did you consider to use None
?. And another pattern(not rule maybe) that comes from spark framework...is that there is no nulls
in default params. And maybe most to the point... we have default value in h2o-3 for blending params new BlendingParams(10, 20)
. Maybe I can expose default values in h2o-3 so that we can reuse them here as well. Same with noise=0.01 default value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These parameters are serialized by reflection and deserialized by Pyspark. I'm not sure whether there is a support for options already. Maybe @jakubhava could shed some light on that.
Quite like skipping null defaults, but if we do that, we should do the same for all SW algorithms. We need to eventually double check the consequences.
Exposed default values in H2O-3 would be nice, it least it will be always consistent if a H2O-3 default value changes in future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If bleeding has some default value than H2O should define it and we should use it. Otherwise use null
as is in the rest of the code for consistency
|
||
private def inTrainingMode: Boolean = { | ||
val stackTrace = Thread.currentThread().getStackTrace() | ||
stackTrace.exists(e => e.getMethodName == "fit" && e.getClassName == "org.apache.spark.ml.Pipeline") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mn-mikke It is even more tricky than reflection :) It is IO state( it makes our model a stateful one).... can't we just use boolean flag for storing the state instead of iteration through stack trace?
Btw what if fit
was performed in a different thread?
ml/src/main/scala/org/apache/spark/ml/h2o/param/H2OTargetEncoderParams.scala
Outdated
Show resolved
Hide resolved
|
||
override def createSparkContext = new SparkContext("local[*]", "H2OTargetEncoderTest", conf = defaultSparkConf) | ||
|
||
private def loadDataFrameFromCsv(path: String): DataFrame = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mn-mikke Is it really the very first time you needed a helper methods? Don't you have something like these methods(loadDataFrameFromCsvAsResource
, assertDataFramesAreIdentical
) in TestUtils or somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, something generic has already been created here recently.
assertDataFramesAreIdentical(expectedTestingDataset, transformedTestingDataset) | ||
} | ||
|
||
test("The target encoder doesn't apply noise on the testing dataset") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mn-mikke Having a hard time to understand this test. Could you please help me with this. So... we call fit
on our pipeline...it means that in H2OTargetEncoderModel we are in inTrainingMode
mode, right? And we have following logic val noise = Option(getNoise()).getOrElse(H2OTargetEncoderNoiseSettings(amount = 0.0))
and it feels like we should apply noise as we set it to 0.5. And it is not H2OTargetEncoderMojoModel
. What am I missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inTrainingMode
is valid only for fitting. At that time ,noise is applied according to the parameters. When testing dataset is transformed we are not inTrainingMode
anymore and noise won't be applied.
Does it make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mn-mikke the question is how do we switch from fitting/inTrainingMode
to not inTrainingMode
? Because it feels like StackTraceElement which satisfies inTrainingMode
conditions could be still there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it feels like StackTraceElement which satisfies inTrainingMode conditions could be still there.
@deil87 Please can you give me an example?
ml/src/main/scala/org/apache/spark/ml/h2o/models/H2OTargetEncoderModel.scala
Outdated
Show resolved
Hide resolved
ml/src/main/scala/org/apache/spark/ml/h2o/param/H2OTargetEncoderParams.scala
Outdated
Show resolved
Hide resolved
ml/src/main/scala/org/apache/spark/ml/h2o/param/H2OTargetEncoderParams.scala
Outdated
Show resolved
Hide resolved
ml/src/main/scala/org/apache/spark/ml/h2o/param/H2OTargetEncoderParams.scala
Outdated
Show resolved
Hide resolved
ml/src/main/scala/org/apache/spark/ml/h2o/param/H2OTargetEncoderParams.scala
Outdated
Show resolved
Hide resolved
"""If set, the target average becomes a weighted average of the group target value and the global target value of a given row. | ||
|The weight is determined by the size of the given group that the row belongs to. | ||
|Attributes: | ||
| InflectionPoint - The bigger number it's, the bigger groups will consider the global target value as a component in the weighted average. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mn-mikke better rephrase InflectoinPoint as well. We should reason from the size of the given group. It is relative. We can't just say the bigger the bigger. Maybe can say kile this about the diff. Something like this: the bigger difference between size of the given group and IP the more weight will be put for posterior average of the target.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@deil87 Would it make sense if we rephrased the sentence like this: "the bigger groups" -> "the groups relatively bigger to the overall data set size will ..." ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mn-mikke It should be relatively bigger than inflection point. "The bigger group is with respect to an inflection point the more weight we will put onto its target average and less weight onto a prior average. " . Take a look at Py/R documentation for extra inspiration.
ml/src/main/scala/org/apache/spark/ml/h2o/param/H2OTargetEncoderParams.scala
Outdated
Show resolved
Hide resolved
ml/src/main/scala/org/apache/spark/ml/h2o/param/H2OTargetEncoderParams.scala
Outdated
Show resolved
Hide resolved
@mn-mikke I changed jenkins files a bit, please rebase on top of master so we can run tests on this PR. Thanks! |
@jakubhava Thanks for letting me know! I will do the rebase together with the next set of changes. |
b43d150
to
0a09b5d
Compare
This PR requires h2oai/h2o-3#3282 to be released for a successful build.
The implementation of pysparkling wrappers and detailed tests will come in subsequent PRs.