Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SW-1207] Implement H2OTargetEncoder for Scala API #1192

Merged
merged 31 commits into from Jul 26, 2019

Conversation

mn-mikke
Copy link
Collaborator

@mn-mikke mn-mikke commented May 8, 2019

This PR requires h2oai/h2o-3#3282 to be released for a successful build.

The implementation of pysparkling wrappers and detailed tests will come in subsequent PRs.

@mn-mikke mn-mikke changed the title [SW-1207][WIP] Implement H2OTargetEncoder [SW-1207][WIP] Implement H2OTargetEncoder for Scala API Jun 13, 2019
@mn-mikke mn-mikke requested a review from jakubhava June 13, 2019 10:30
labelCol -> "label",
inputCols -> Array[String](),
holdoutStrategy -> H2OTargetEncoderHoldoutStrategy.None,
blending -> null,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blending and noise parameters are grouped together. if blending or noise is disabled, the particular feature is disabled. @jakubhava, I would like to know your opinion on this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! Does it semantically make sense to group this feature under one new, specific configuration object?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you started to use this estimator and you would see parameters like smoothing, inflectionPoint would you implicitly understand that they are parameters of blending? If blending is disabled they don't have any impact on anything. We can pose a similar question about the seed parameter for noise.

An alternative could be to name the parameters with some prefix that groups them, but I'm not sure what is better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand, when I thing about it the users setting these properties probably have to know they require bleeding to be enabled. Maybe we can keep it this is simple.

At the end, it is easier for the user to set the options directly compared to creating some additional configuration object.

I would keep it like it is now

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mn-mikke could you please elaborate on this comment? I don't see them to be grouped in current code. Is this concern coming from h2o-3 TE API?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main point of my comment here is how to define H2OTargetEncoderParams in the best way that new users will quickly and easily understand that a smoothing and inflectionPoint belong to blending and seed to noise.

Personally, I don't know what's better... the current solution, name the parameters with some prefix or something else? Welcome any suggestions :-)

@mn-mikke mn-mikke changed the title [SW-1207][WIP] Implement H2OTargetEncoder for Scala API [SW-1207] Implement H2OTargetEncoder for Scala API Jun 13, 2019
Copy link
Contributor

@jakubhava jakubhava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR looks good, just a few comments.

Thanks you @mn-mikke for such a tremendous effort!


private def inTrainingMode: Boolean = {
val stackTrace = Thread.currentThread().getStackTrace()
stackTrace.exists(e => e.getMethodName == "fit" && e.getClassName == "org.apache.spark.ml.Pipeline")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hehe, you found a way at the end! 👍

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mn-mikke It is even more tricky than reflection :) It is IO state( it makes our model a stateful one).... can't we just use boolean flag for storing the state instead of iteration through stack trace?
Btw what if fit was performed in a different thread?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The piece of code tries to distinguish between the cases when H2OTargetEncoderModel.transform is called from Pipeline.fit and PipelineModel.transform (+ other callers). Unfortunately, we can't make any changes to Pipeline or PipelineModel.

Good point with the multiple threads, but it seems to me that this is not the case. H2OTargetEncoderM.transform is called for a training dataset straight after H2OTargetEncoder.fit only from one method.

Personally, I'm not happy with this solution either, so I welcome any proposals. What's the idea with a flag? Who or what would set a flag?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mn-mikke No.... no way for flag here as you are using spark's Pipeline. I just remember we were discussing in Confluence that SW has its own wrapper (H2OPipeline?) for pipelines ( which is probably what @jakubhava is referring to here)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not have H2OPipeline anymore and we definitely need to use the provided Pipeline from Spark

labelCol -> "label",
inputCols -> Array[String](),
holdoutStrategy -> H2OTargetEncoderHoldoutStrategy.None,
blending -> null,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! Does it semantically make sense to group this feature under one new, specific configuration object?

jakubhava
jakubhava previously approved these changes Jun 14, 2019
@jakubhava
Copy link
Contributor

One more note: @mn-mikke could you please also add tutorial into our documentation how to use target encoding in Sparkling Water? Different PR is fine for this

@mn-mikke
Copy link
Collaborator Author

Yes, will do once this gets merged :-) Thanks @jakubhava!

Copy link

@deil87 deil87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @mn-mikke ! I left some comments/questions... but overall looks good! Maybe I would suggest to add more tests though.

def getNoise(): H2OTargetEncoderNoiseSettings = $(noise)

//
// Others
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mn-mikke should this method be here? Class that - judging by the name - is supposed to keep parameters' values - should not contain logic about how to transform anything.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, will change as part of the comment here. Thanks!

labelCol -> "label",
inputCols -> Array[String](),
holdoutStrategy -> H2OTargetEncoderHoldoutStrategy.None,
blending -> null,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mn-mikke could you please elaborate on this comment? I don't see them to be grouped in current code. Is this concern coming from h2o-3 TE API?

override def fit(dataset: Dataset[_]): H2OTargetEncoderModel = {
val h2oContext = H2OContext.getOrCreate(SparkSession.builder().getOrCreate())
val input = h2oContext.asH2OFrame(dataset.toDF())
changeRelevantColumnsToCategorical(input)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mn-mikke Should we do this implicitly for the user? In h2o-3 TargetEncoder we are only checking if all the expected columns are categorical. If I got it correctly we in h2o-3 prefer to ask user to prepare data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sparkling Water is trying( and still some way to reach the goal) to hide tasks which can be done automatically. I believe we should do these automatically, not ask user for explicit data prep

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kuba I agree that we should try to do automatically as much as possible just what if user specified wrong column for TE and it was numerical one... algorithm will silently convert it but performance would be awful (not saying that probably it is not what user was planning to do). If user had not to specify TE columns himself then I would totally agree with full automation. When human is in the loop I would make any transformations more explicit (log warning or throw). Just my opinion.

def this() = this(Identifiable.randomUID("H2OTargetEncoder"))

override def fit(dataset: Dataset[_]): H2OTargetEncoderModel = {
val h2oContext = H2OContext.getOrCreate(SparkSession.builder().getOrCreate())
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mn-mikke what about IOC (Cake Pattern) here? we don't need multiple implementations yet but part of this pattern (moving some logic into trait) might be good for code reuse. We can introduce H2OContextProvider trait and add h2oContext method there. Wdyt?

Copy link
Contributor

@jakubhava jakubhava Jun 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, but not related to the core change. Let's first focus on TE part please :)

labelCol -> "label",
inputCols -> Array[String](),
holdoutStrategy -> H2OTargetEncoderHoldoutStrategy.None,
blending -> null,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mn-mikke In Scala using null is antipattern.. did you consider to use None?. And another pattern(not rule maybe) that comes from spark framework...is that there is no nulls in default params. And maybe most to the point... we have default value in h2o-3 for blending params new BlendingParams(10, 20) . Maybe I can expose default values in h2o-3 so that we can reuse them here as well. Same with noise=0.01 default value.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These parameters are serialized by reflection and deserialized by Pyspark. I'm not sure whether there is a support for options already. Maybe @jakubhava could shed some light on that.

Quite like skipping null defaults, but if we do that, we should do the same for all SW algorithms. We need to eventually double check the consequences.

Exposed default values in H2O-3 would be nice, it least it will be always consistent if a H2O-3 default value changes in future.

Copy link
Contributor

@jakubhava jakubhava Jun 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If bleeding has some default value than H2O should define it and we should use it. Otherwise use null as is in the rest of the code for consistency


private def inTrainingMode: Boolean = {
val stackTrace = Thread.currentThread().getStackTrace()
stackTrace.exists(e => e.getMethodName == "fit" && e.getClassName == "org.apache.spark.ml.Pipeline")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mn-mikke It is even more tricky than reflection :) It is IO state( it makes our model a stateful one).... can't we just use boolean flag for storing the state instead of iteration through stack trace?
Btw what if fit was performed in a different thread?


override def createSparkContext = new SparkContext("local[*]", "H2OTargetEncoderTest", conf = defaultSparkConf)

private def loadDataFrameFromCsv(path: String): DataFrame = {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mn-mikke Is it really the very first time you needed a helper methods? Don't you have something like these methods(loadDataFrameFromCsvAsResource, assertDataFramesAreIdentical) in TestUtils or somewhere?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, something generic has already been created here recently.

assertDataFramesAreIdentical(expectedTestingDataset, transformedTestingDataset)
}

test("The target encoder doesn't apply noise on the testing dataset") {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mn-mikke Having a hard time to understand this test. Could you please help me with this. So... we call fit on our pipeline...it means that in H2OTargetEncoderModel we are in inTrainingMode mode, right? And we have following logic val noise = Option(getNoise()).getOrElse(H2OTargetEncoderNoiseSettings(amount = 0.0)) and it feels like we should apply noise as we set it to 0.5. And it is not H2OTargetEncoderMojoModel. What am I missing?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inTrainingMode is valid only for fitting. At that time ,noise is applied according to the parameters. When testing dataset is transformed we are not inTrainingMode anymore and noise won't be applied.

Does it make sense?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mn-mikke the question is how do we switch from fitting/inTrainingMode to not inTrainingMode? Because it feels like StackTraceElement which satisfies inTrainingMode conditions could be still there.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it feels like StackTraceElement which satisfies inTrainingMode conditions could be still there.

@deil87 Please can you give me an example?

"""If set, the target average becomes a weighted average of the group target value and the global target value of a given row.
|The weight is determined by the size of the given group that the row belongs to.
|Attributes:
| InflectionPoint - The bigger number it's, the bigger groups will consider the global target value as a component in the weighted average.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mn-mikke better rephrase InflectoinPoint as well. We should reason from the size of the given group. It is relative. We can't just say the bigger the bigger. Maybe can say kile this about the diff. Something like this: the bigger difference between size of the given group and IP the more weight will be put for posterior average of the target.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deil87 Would it make sense if we rephrased the sentence like this: "the bigger groups" -> "the groups relatively bigger to the overall data set size will ..." ?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mn-mikke It should be relatively bigger than inflection point. "The bigger group is with respect to an inflection point the more weight we will put onto its target average and less weight onto a prior average. " . Take a look at Py/R documentation for extra inspiration.

@jakubhava
Copy link
Contributor

@mn-mikke I changed jenkins files a bit, please rebase on top of master so we can run tests on this PR. Thanks!

@mn-mikke
Copy link
Collaborator Author

@jakubhava Thanks for letting me know! I will do the rebase together with the next set of changes.

@mn-mikke mn-mikke force-pushed the mn/SW-1207 branch 3 times, most recently from b43d150 to 0a09b5d Compare July 17, 2019 16:09
@mn-mikke mn-mikke merged commit c9aa942 into h2oai:master Jul 26, 2019
mn-mikke added a commit that referenced this pull request Jul 26, 2019
mn-mikke added a commit that referenced this pull request Jul 26, 2019
mn-mikke added a commit that referenced this pull request Jul 26, 2019
mn-mikke added a commit that referenced this pull request Jul 27, 2019
jakubhava pushed a commit that referenced this pull request Jul 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants