Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SWPRIVATE-16] NA handling for Spark algorithms #75

Merged
merged 1 commit into from Nov 3, 2016

Conversation

mdymczyk
Copy link
Contributor

@mdymczyk mdymczyk commented Sep 6, 2016

  • added NA value handling for SVM: mean imputation, skip, not allowed (old variant with an error message)
  • means are calculated via RDD as H2OFrame.means() doesn't support enum columns
  • changed the SVM model scoring/pojo generation to include column means when running in MeanImputation mode
  • refactored tests a bit (frame creation now supports multi column frames)

@mdymczyk mdymczyk changed the title [WIP][SWPRIVATE-16] NA handling for Spark algorithms [SWPRIVATE-16] NA handling for Spark algorithms Sep 7, 2016
@vpatryshev
Copy link
Contributor

I think we need some kind of coordination here.

@mdymczyk
Copy link
Contributor Author

@vpatryshev you are referring to your refactoring? I'm ok with adopting your changes since most of the stuff in this commit is just moving things around in tests.

@vpatryshev
Copy link
Contributor

Yes, this is a good step forward, but it partially overlaps with the data handling fixes that are waiting for the merge, there's a pull request. I'd rather line up these things, I don't know, otherwise we would be committing the same stuff, code chunks fighting with each other. Michal, any updates?

@jakubhava
Copy link
Contributor

I would rather merge first @vpatryshev branch. I've done the review already and I'm ok with merging right away, just waiting for @mmalohlava approval and then I would put @mdymczyk behaviour on top of that.

@vpatryshev's changes seems like good base for the rest of the code build on top of that, It unifies stuff and also will play nicely with the external backend.

Would it be complicated to put your changes on @vpatryshev's branch @mdymczyk ?

But let's also wait for Michal response.

thanks!

@mdymczyk
Copy link
Contributor Author

mdymczyk commented Sep 13, 2016

Sure, first merge @vpatryshev changes and I'll rebase them onto this branch, no problem.

I'd just like to get this and the progress bar PRs into master as I'll be using them for the next MLlib algorithms + I want to write a blog post about SVM in SW with screenshots and it would be much easier for me to get all the features in one branch.

@vpatryshev
Copy link
Contributor

Yes, same here.

Also, I had a talk with Arno, and so the general idea of returning 0 if an
integer value is missing was considered wrong, so I had commited another
change, taking care of these issues.

Anyway, that's why I did not review @mdymczyk's change; it can happen that
most of it is actually implemented.
Sorry that we did not have enough communication here beforehand.

I actually have two more issues after talking with Arno.

First, the loop via absolute row indexes is not efficient; a nested loop,
chunks, and relative indexes inside, would make more sense. Just
performance, not sure about the actual impact, may be negligible - but this
was documented.

Second, how come UUID and Enum are not handled as such, but converted to
strings? We could as well have them.

(Or, and #3 - what's the fuss with UTF8String? Do we need, in Scala, to
have them all along with regular strings? Or just rather keep one kind
(either way), with implicit transformations when needed?)

Thanks,
-Vlad

On Tue, Sep 13, 2016 at 1:45 AM, Jakub Háva notifications@github.com
wrote:

I would rather merge first @vpatryshev https://github.com/vpatryshev
branch. I've done the review already and I'm ok with merging right away,
just waiting for @mmalohlava https://github.com/mmalohlava approval and
then I would put @mdymczyk https://github.com/mdymczyk behaviour on top
of that.

@vpatryshev https://github.com/vpatryshev's changes seems like good
base for the rest of the code build on top of that, It unifies stuff and
also will play nicely with the external backend.

Would it be complicated to put your changes on @vpatryshev
https://github.com/vpatryshev's branch @mdymczyk
https://github.com/mdymczyk ?

But let's also wait for Michal response.

thanks!


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#75 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAOuKySy0nYrTVhvbiTFuy54ApNhbIHCks5qpmK2gaJpZM4J13My
.

@mdymczyk
Copy link
Contributor Author

@vpatryshev no problems.

  1. true, I also saw the doc about looping through rows not being super efficient but it seemed much easier so I decided to go with it before trying something else - if it's indeed too slow we can change it

  2. if you're referring to converting H2O's UUIDs to String before putting them in a DF/RDD then I guess it's because there's no UUID DataType in Spark, same for enums. I guess we could do it for RDDs (at least with UUID, not sure how you'd want to create enums on the fly) but don't think that would be possible with DFs and we'd stop being consistent. That's my guess. Would need to ask @mmalohlava

  3. UTF8String should only be used for Spark's SQL module, that's what they use internally for String types, we should not use it outside it in our code.

@mdymczyk
Copy link
Contributor Author

@vpatryshev @jakubhava how's the other refactoring going? Do you guys need any help/reviews there?

@jakubhava
Copy link
Contributor

I'm happy with @vpatryshev's PR in the current state, just waiting for @mmalohlava approval and merge

@vpatryshev
Copy link
Contributor

I'm still waiting for the merge of that previous PR.

Thanks,
-Vlad

On Mon, Sep 19, 2016 at 4:01 AM, Jakub Háva notifications@github.com
wrote:

I'm happy with @vpatryshev https://github.com/vpatryshev's PR in the
current state, just waiting for @mmalohlava
https://github.com/mmalohlava approval and merge


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#75 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAOuK9Efy8B53HL3ojP6H-S9zxbF-Vckks5qrmt8gaJpZM4J13My
.

@jakubhava
Copy link
Contributor

@mdymczyk I integrated the things I wanted into the master already. Could you please rebase on top of master, fix the problems and then let me know ? I will have a look and do review afterwards.

case DataTypes.IntegerType => value.asInstanceOf[Integer].doubleValue
case DataTypes.DoubleType => value.asInstanceOf[Double]
case DataTypes.StringType => domain.indexOf(value)
case _ => throw new IllegalArgumentException("Target column has to be an enum or a number. " + fieldStruct.toString)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you want to have .toString here? It'll convert automatically. And it will crash on null.

agg1
}

private[ml] def toDouble(value: Any, fieldStruct: StructField, domain: Array[String]): Double = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not very type safe; why not check value for its type, in match/case?

E.g. value match {
case b; Byte if fieldStruct.dataType == DataTypes.ByteType => b.doubleValue
... etc ...

And I would make it less verbose.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really like this one, thanks!

String vecName = _train.name(i);
if (vec.naCnt() > 0 && (null == _parms._ignored_columns || Arrays.binarySearch(_parms._ignored_columns, vecName) < 0)) {
error("_train", "Training frame cannot contain any missing values [" + vecName + "].");
if(MissingValuesHandling.NotAllowed.equals(_parms._missing_values_handling)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are comparing enums, right? Will == not work here?

Copy link
Contributor

@vpatryshev vpatryshev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments; not very essential, but...

training.cache();

if(training.count() == 0 && MissingValuesHandling.Skip.equals(_parms._missing_values_handling)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are comparing enums, right? Will == not work here?

@@ -28,6 +31,7 @@ object SVMModel {
var interceptor: Double = .0
var iterations: Int = 0
var weights: Array[Double] = null
var numMeans: Array[Double] = null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using nulls in Scala is not a good practice at all. If a value is optional, use Option.
Using vars... well... it may be harder to eliminate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True but it has to work with the underlying core H2O model for Model Outputs (that's also why I had to write parts of code in Java as the underlying framework requires several completely different constructors) - not sure if Option is an... option here :-) But will look into it.

val pred =
data.zip(_output.weights).foldRight(_output.interceptor){ case ((d, w), acc) => d * w + acc}
data.zip(_output.weights).foldRight(_output.interceptor){ case ((d, w), acc) =>
if(meanImputation && lang.Double.isNaN(d)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, if it's not meanImputation, it's okay to multiply by d?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it will be a NaN in that case, I can do that or throw an exception. Skipping NaN values doesn't make much sense computation wise (the MissingValuesHandling.Skip during training is skipping the whole row if it has NaNs, not singular columns).

@mdymczyk
Copy link
Contributor Author

Good comments, thanks @vpatryshev !

@mdymczyk
Copy link
Contributor Author

@vpatryshev @jakubhava @mmalohlava made appropriate changes according to Vlad's comments, can I merge this in or is something still wrong? Would like to use it also for GM clustering.

@mmalohlava
Copy link
Member

@mdymczyk can you please update the branch with the latest master?

@jakubhava
Copy link
Contributor

jakubhava commented Oct 13, 2016

I was thinking if it wouldn't be better to separate this PR to 2. The first one with updated test infrastructure and the second one with the implementation itself. To keep PRs really focused on one thing.

Just an idea.

@mdymczyk
Copy link
Contributor Author

@mmalohlava sorry for the delay, completely missed that comment ;-)

@jakubhava I was thinking about it but since the whole test change isn't that big and is there only because of the NA handling feature I thought it would be ok. I can separate it into 2 PRs if you guys want but just thought it would be an overkill.

@mdymczyk
Copy link
Contributor Author

mdymczyk commented Nov 1, 2016

Will merge it tomorrow if there are no further complaints @jakubhava @vpatryshev @mmalohlava Need it for several other branches which are blocked by this. Could rebase off of this but then this would've to get merged too anyway...

@jakubhava
Copy link
Contributor

@mdymczyk I will have a look on it today ( in hour or so), sorry for the delay

Copy link
Contributor

@jakubhava jakubhava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, just pointed out a few minor changes/ideas

@@ -51,26 +52,51 @@ trait SharedSparkTestContext extends SparkTestContext { self: Suite =>
super.afterAll()
}

def buildChunks[T: ClassTag](fname: String, data: Array[T], cidx: Integer, h2oType: Array[Byte]): Chunk = {
def makeH2OFrame[T: ClassTag](fname: String, colNames: Array[String], chunkLayout: Array[Long],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use this TestFrameBuilder h2oai/h2o-3#417 from H2O once it's merged instead all of this. I wouldn't it as part of this PR ( so we can finally merge ), but I would just create a JIRA for later. It's minor thing, but it would be better to reuse code instead of having duplicates.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will add a TODO, cool

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think JIRA would be better, it's easier to track than to track TODOs in the code. It's also more generic JIRA - replace all code which we use to create small frames with this builder.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* numerical values.

*

* @param frame Input frame to be converted

* @param _response_column Column which contains the labels

Copy link
Contributor

@jakubhava jakubhava Nov 1, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it contain underscores on purpose ? It's really minor thing which caught my eye

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

autgenerated with intellij since that's the convention in H2O, can change it to without - don't have any strong opinions myself

Copy link
Contributor

@jakubhava jakubhava Nov 3, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's good to be consistent, but if that's convention in H2O, let's keep it like that 👍


object FrameMLUtils {
/**

* Converts a H2O Frame into an RDD[LabeledPoint]. Assumes that the last column is the response column which will be used

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking whether we could put this code into the converters. So when the user calls h2oContext.asRDD[LabeledPoint](h2oFrame) this would get executed. It's build on top of the internals but it would be nice to have this as normal conversion instead of having to call helper methods. But again, we can do JIRA for it and work on this change later so we don't block you. What do you think @mdymczyk ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes making this more robust, maybe even truly user facing, since MLlib uses LabeledPoint a lot, is a good idea - will make a JIRA for it since in that case it will need a bit more thought

}

(trainingRDD.map(row => {
val features = new Array[Double](nfeatures)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of these 2 lines I like this more

val features = (0 until nfeatures).map{ i => if (row.isNullAt(i)) means(i) else toDouble(row.get(i), fields(i), domains(i)) }.toArray[Double] It seems more functional to me since we set the features right away, but it's probably just matter of taste

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@jakubhava
Copy link
Contributor

Can you pls squash those 2 commits ? Also is sparkling_water_yarn branch passing ? If yes, I'm merging 👍

@mmalohlava mmalohlava merged commit 833ae04 into master Nov 3, 2016
@mmalohlava mmalohlava deleted the MD_svm_missing_vals branch November 3, 2016 23:02
mmalohlava pushed a commit that referenced this pull request Nov 4, 2016
@Tagar Tagar mentioned this pull request Mar 16, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants