[SWPRIVATE-16] NA handling for Spark algorithms #75

mdymczyk · 2016-09-06T13:53:44Z

added NA value handling for SVM: mean imputation, skip, not allowed (old variant with an error message)
means are calculated via RDD as H2OFrame.means() doesn't support enum columns
changed the SVM model scoring/pojo generation to include column means when running in MeanImputation mode
refactored tests a bit (frame creation now supports multi column frames)

vpatryshev · 2016-09-12T18:10:54Z

I think we need some kind of coordination here.

mdymczyk · 2016-09-13T04:54:28Z

@vpatryshev you are referring to your refactoring? I'm ok with adopting your changes since most of the stuff in this commit is just moving things around in tests.

vpatryshev · 2016-09-13T07:35:34Z

Yes, this is a good step forward, but it partially overlaps with the data handling fixes that are waiting for the merge, there's a pull request. I'd rather line up these things, I don't know, otherwise we would be committing the same stuff, code chunks fighting with each other. Michal, any updates?

jakubhava · 2016-09-13T08:45:41Z

I would rather merge first @vpatryshev branch. I've done the review already and I'm ok with merging right away, just waiting for @mmalohlava approval and then I would put @mdymczyk behaviour on top of that.

@vpatryshev's changes seems like good base for the rest of the code build on top of that, It unifies stuff and also will play nicely with the external backend.

Would it be complicated to put your changes on @vpatryshev's branch @mdymczyk ?

But let's also wait for Michal response.

thanks!

mdymczyk · 2016-09-13T09:04:55Z

Sure, first merge @vpatryshev changes and I'll rebase them onto this branch, no problem.

I'd just like to get this and the progress bar PRs into master as I'll be using them for the next MLlib algorithms + I want to write a blog post about SVM in SW with screenshots and it would be much easier for me to get all the features in one branch.

vpatryshev · 2016-09-13T21:21:49Z

Yes, same here.

Also, I had a talk with Arno, and so the general idea of returning 0 if an
integer value is missing was considered wrong, so I had commited another
change, taking care of these issues.

Anyway, that's why I did not review @mdymczyk's change; it can happen that
most of it is actually implemented.
Sorry that we did not have enough communication here beforehand.

I actually have two more issues after talking with Arno.

First, the loop via absolute row indexes is not efficient; a nested loop,
chunks, and relative indexes inside, would make more sense. Just
performance, not sure about the actual impact, may be negligible - but this
was documented.

Second, how come UUID and Enum are not handled as such, but converted to
strings? We could as well have them.

(Or, and #3 - what's the fuss with UTF8String? Do we need, in Scala, to
have them all along with regular strings? Or just rather keep one kind
(either way), with implicit transformations when needed?)

Thanks,
-Vlad

On Tue, Sep 13, 2016 at 1:45 AM, Jakub Háva notifications@github.com
wrote:

I would rather merge first @vpatryshev https://github.com/vpatryshev
branch. I've done the review already and I'm ok with merging right away,
just waiting for @mmalohlava https://github.com/mmalohlava approval and
then I would put @mdymczyk https://github.com/mdymczyk behaviour on top
of that.

@vpatryshev https://github.com/vpatryshev's changes seems like good
base for the rest of the code build on top of that, It unifies stuff and
also will play nicely with the external backend.

Would it be complicated to put your changes on @vpatryshev
https://github.com/vpatryshev's branch @mdymczyk
https://github.com/mdymczyk ?

But let's also wait for Michal response.

thanks!

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#75 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAOuKySy0nYrTVhvbiTFuy54ApNhbIHCks5qpmK2gaJpZM4J13My
.

mdymczyk · 2016-09-14T03:03:04Z

@vpatryshev no problems.

true, I also saw the doc about looping through rows not being super efficient but it seemed much easier so I decided to go with it before trying something else - if it's indeed too slow we can change it
if you're referring to converting H2O's UUIDs to String before putting them in a DF/RDD then I guess it's because there's no UUID DataType in Spark, same for enums. I guess we could do it for RDDs (at least with UUID, not sure how you'd want to create enums on the fly) but don't think that would be possible with DFs and we'd stop being consistent. That's my guess. Would need to ask @mmalohlava
UTF8String should only be used for Spark's SQL module, that's what they use internally for String types, we should not use it outside it in our code.

mdymczyk · 2016-09-19T03:34:45Z

@vpatryshev @jakubhava how's the other refactoring going? Do you guys need any help/reviews there?

jakubhava · 2016-09-19T11:01:15Z

I'm happy with @vpatryshev's PR in the current state, just waiting for @mmalohlava approval and merge

vpatryshev · 2016-09-19T20:39:47Z

I'm still waiting for the merge of that previous PR.

Thanks,
-Vlad

On Mon, Sep 19, 2016 at 4:01 AM, Jakub Háva notifications@github.com
wrote:

I'm happy with @vpatryshev https://github.com/vpatryshev's PR in the
current state, just waiting for @mmalohlava
https://github.com/mmalohlava approval and merge

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#75 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAOuK9Efy8B53HL3ojP6H-S9zxbF-Vckks5qrmt8gaJpZM4J13My
.

jakubhava · 2016-09-28T23:32:32Z

@mdymczyk I integrated the things I wanted into the master already. Could you please rebase on top of master, fix the problems and then let me know ? I will have a look and do review afterwards.

vpatryshev · 2016-09-29T15:32:23Z

ml/src/main/scala/org/apache/spark/ml/FrameMLUtils.scala

+      case DataTypes.IntegerType => value.asInstanceOf[Integer].doubleValue
+      case DataTypes.DoubleType => value.asInstanceOf[Double]
+      case DataTypes.StringType => domain.indexOf(value)
+      case _ => throw new IllegalArgumentException("Target column has to be an enum or a number. " + fieldStruct.toString)


Why do you want to have .toString here? It'll convert automatically. And it will crash on null.

vpatryshev · 2016-09-29T15:32:27Z

ml/src/main/scala/org/apache/spark/ml/FrameMLUtils.scala

+    agg1
+  }
+
+  private[ml] def toDouble(value: Any, fieldStruct: StructField, domain: Array[String]): Double = {


This is not very type safe; why not check value for its type, in match/case?

E.g. value match {
case b; Byte if fieldStruct.dataType == DataTypes.ByteType => b.doubleValue
... etc ...

And I would make it less verbose.

Really like this one, thanks!

vpatryshev · 2016-09-29T15:36:15Z

ml/src/main/scala/org/apache/spark/ml/spark/models/svm/SVM.java

-            String vecName = _train.name(i);
-            if (vec.naCnt() > 0 && (null == _parms._ignored_columns || Arrays.binarySearch(_parms._ignored_columns, vecName) < 0)) {
-                error("_train", "Training frame cannot contain any missing values [" + vecName + "].");
+        if(MissingValuesHandling.NotAllowed.equals(_parms._missing_values_handling)) {


You are comparing enums, right? Will == not work here?

vpatryshev

Added some comments; not very essential, but...

vpatryshev · 2016-09-29T15:36:26Z

ml/src/main/scala/org/apache/spark/ml/spark/models/svm/SVM.java

            training.cache();

+            if(training.count() == 0 && MissingValuesHandling.Skip.equals(_parms._missing_values_handling)) {


You are comparing enums, right? Will == not work here?

vpatryshev · 2016-09-29T15:37:42Z

ml/src/main/scala/org/apache/spark/ml/spark/models/svm/SVMModel.scala

@@ -28,6 +31,7 @@ object SVMModel {
    var interceptor: Double = .0
    var iterations: Int = 0
    var weights: Array[Double] = null
+    var numMeans: Array[Double] = null


Using nulls in Scala is not a good practice at all. If a value is optional, use Option.
Using vars... well... it may be harder to eliminate

True but it has to work with the underlying core H2O model for Model Outputs (that's also why I had to write parts of code in Java as the underlying framework requires several completely different constructors) - not sure if Option is an... option here :-) But will look into it.

vpatryshev · 2016-09-29T15:38:47Z

ml/src/main/scala/org/apache/spark/ml/spark/models/svm/SVMModel.scala

    val pred =
-      data.zip(_output.weights).foldRight(_output.interceptor){ case ((d, w), acc) => d * w + acc}
+      data.zip(_output.weights).foldRight(_output.interceptor){ case ((d, w), acc) =>
+        if(meanImputation && lang.Double.isNaN(d)) {


So, if it's not meanImputation, it's okay to multiply by d?

Yes it will be a NaN in that case, I can do that or throw an exception. Skipping NaN values doesn't make much sense computation wise (the MissingValuesHandling.Skip during training is skipping the whole row if it has NaNs, not singular columns).

mdymczyk · 2016-09-30T03:33:49Z

Good comments, thanks @vpatryshev !

mdymczyk · 2016-10-10T06:32:18Z

@vpatryshev @jakubhava @mmalohlava made appropriate changes according to Vlad's comments, can I merge this in or is something still wrong? Would like to use it also for GM clustering.

mmalohlava · 2016-10-11T17:12:20Z

@mdymczyk can you please update the branch with the latest master?

jakubhava · 2016-10-13T17:24:18Z

I was thinking if it wouldn't be better to separate this PR to 2. The first one with updated test infrastructure and the second one with the implementation itself. To keep PRs really focused on one thing.

Just an idea.

mdymczyk · 2016-10-21T08:06:33Z

@mmalohlava sorry for the delay, completely missed that comment ;-)

@jakubhava I was thinking about it but since the whole test change isn't that big and is there only because of the NA handling feature I thought it would be ok. I can separate it into 2 PRs if you guys want but just thought it would be an overkill.

mdymczyk · 2016-11-01T14:38:02Z

Will merge it tomorrow if there are no further complaints @jakubhava @vpatryshev @mmalohlava Need it for several other branches which are blocked by this. Could rebase off of this but then this would've to get merged too anyway...

jakubhava · 2016-11-01T14:39:05Z

@mdymczyk I will have a look on it today ( in hour or so), sorry for the delay

jakubhava

Looks good to me, just pointed out a few minor changes/ideas

jakubhava · 2016-11-01T15:48:59Z

core/src/test/scala/org/apache/spark/h2o/utils/SharedSparkTestContext.scala

@@ -51,26 +52,51 @@ trait SharedSparkTestContext extends SparkTestContext { self: Suite =>
    super.afterAll()
  }

-  def buildChunks[T: ClassTag](fname: String, data: Array[T], cidx: Integer, h2oType: Array[Byte]): Chunk = {
+  def makeH2OFrame[T: ClassTag](fname: String, colNames: Array[String], chunkLayout: Array[Long],


We could use this TestFrameBuilder h2oai/h2o-3#417 from H2O once it's merged instead all of this. I wouldn't it as part of this PR ( so we can finally merge ), but I would just create a JIRA for later. It's minor thing, but it would be better to reuse code instead of having duplicates.

will add a TODO, cool

I think JIRA would be better, it's easier to track than to track TODOs in the code. It's also more generic JIRA - replace all code which we use to create small frames with this builder.

done https://0xdata.atlassian.net/browse/SW-250

jakubhava · 2016-11-01T15:58:29Z

ml/src/main/scala/org/apache/spark/ml/FrameMLUtils.scala

+    * numerical values. 
+    * 
+    * @param frame Input frame to be converted 
+    * @param _response_column Column which contains the labels 


Does it contain underscores on purpose ? It's really minor thing which caught my eye

autgenerated with intellij since that's the convention in H2O, can change it to without - don't have any strong opinions myself

It's good to be consistent, but if that's convention in H2O, let's keep it like that 👍

jakubhava · 2016-11-01T16:00:25Z

ml/src/main/scala/org/apache/spark/ml/FrameMLUtils.scala

+
+object FrameMLUtils {
+  /** 
+    * Converts a H2O Frame into an RDD[LabeledPoint]. Assumes that the last column is the response column which will be used 


I'm thinking whether we could put this code into the converters. So when the user calls h2oContext.asRDD[LabeledPoint](h2oFrame) this would get executed. It's build on top of the internals but it would be nice to have this as normal conversion instead of having to call helper methods. But again, we can do JIRA for it and work on this change later so we don't block you. What do you think @mdymczyk ?

yes making this more robust, maybe even truly user facing, since MLlib uses LabeledPoint a lot, is a good idea - will make a JIRA for it since in that case it will need a bit more thought

jakubhava · 2016-11-01T16:13:07Z

ml/src/main/scala/org/apache/spark/ml/FrameMLUtils.scala

+    }
+
+    (trainingRDD.map(row => {
+      val features = new Array[Double](nfeatures)


Instead of these 2 lines I like this more

val features = (0 until nfeatures).map{ i => if (row.isNullAt(i)) means(i) else toDouble(row.get(i), fields(i), domains(i)) }.toArray[Double] It seems more functional to me since we set the features right away, but it's probably just matter of taste

jakubhava · 2016-11-03T15:18:09Z

Can you pls squash those 2 commits ? Also is sparkling_water_yarn branch passing ? If yes, I'm merging 👍

(cherry picked from commit 833ae04)

mdymczyk force-pushed the MD_svm_missing_vals branch from 488648b to 2f252ba Compare September 7, 2016 17:07

mdymczyk changed the title ~~[WIP][SWPRIVATE-16] NA handling for Spark algorithms~~ [SWPRIVATE-16] NA handling for Spark algorithms Sep 7, 2016

mdymczyk assigned vpatryshev and mmalohlava Sep 7, 2016

mdymczyk force-pushed the MD_svm_missing_vals branch from 2f252ba to 10b6b64 Compare September 27, 2016 03:05

vpatryshev reviewed Sep 29, 2016

View reviewed changes

mdymczyk force-pushed the MD_svm_missing_vals branch from 10b6b64 to ec2039d Compare October 7, 2016 06:18

mdymczyk force-pushed the MD_svm_missing_vals branch from ec2039d to 2ecc73f Compare October 21, 2016 08:02

mdymczyk assigned jakubhava Nov 1, 2016

jakubhava approved these changes Nov 1, 2016

View reviewed changes

mdymczyk added the work in progress WIP label Nov 3, 2016

mdymczyk force-pushed the MD_svm_missing_vals branch from 2ecc73f to 880e2cb Compare November 3, 2016 12:59

mdymczyk added approved and removed work in progress WIP labels Nov 3, 2016

mdymczyk force-pushed the MD_svm_missing_vals branch from 880e2cb to 47fe4c4 Compare November 3, 2016 18:21

mmalohlava force-pushed the MD_svm_missing_vals branch from 47fe4c4 to 037f48c Compare November 3, 2016 20:56

[SWPRIVATE-16] NA handling for Spark algorithms

0ebb709

mmalohlava force-pushed the MD_svm_missing_vals branch from 037f48c to 0ebb709 Compare November 3, 2016 22:07

mmalohlava merged commit 833ae04 into master Nov 3, 2016

mmalohlava deleted the MD_svm_missing_vals branch November 3, 2016 23:02

mmalohlava pushed a commit that referenced this pull request Nov 4, 2016

[SWPRIVATE-16] NA handling for Spark algorithms (#75)

d001e0b

(cherry picked from commit 833ae04)

jakubhava pushed a commit that referenced this pull request Dec 15, 2016

[SWPRIVATE-16] NA handling for Spark algorithms (#75)

b6102f8

mdymczyk added a commit that referenced this pull request Jan 25, 2017

[SWPRIVATE-16] NA handling for Spark algorithms (#75)

eee6488

jakubhava pushed a commit that referenced this pull request Mar 22, 2017

[SWPRIVATE-16] NA handling for Spark algorithms (#75)

2b8e49d

Tagar mentioned this pull request Mar 16, 2018

OOM on small dataset #624

Closed

		training.cache();

		if(training.count() == 0 && MissingValuesHandling.Skip.equals(_parms._missing_values_handling)) {

[SWPRIVATE-16] NA handling for Spark algorithms #75

[SWPRIVATE-16] NA handling for Spark algorithms #75

Conversation

mdymczyk commented Sep 6, 2016 • edited

vpatryshev commented Sep 12, 2016

mdymczyk commented Sep 13, 2016

vpatryshev commented Sep 13, 2016

jakubhava commented Sep 13, 2016

mdymczyk commented Sep 13, 2016 • edited

vpatryshev commented Sep 13, 2016

mdymczyk commented Sep 14, 2016

mdymczyk commented Sep 19, 2016

jakubhava commented Sep 19, 2016

vpatryshev commented Sep 19, 2016

jakubhava commented Sep 28, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vpatryshev left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mdymczyk commented Sep 30, 2016

mdymczyk commented Oct 10, 2016

mmalohlava commented Oct 11, 2016

jakubhava commented Oct 13, 2016 • edited

mdymczyk commented Oct 21, 2016

mdymczyk commented Nov 1, 2016

jakubhava commented Nov 1, 2016

jakubhava left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jakubhava Nov 1, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jakubhava Nov 3, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jakubhava commented Nov 3, 2016

mdymczyk commented Sep 6, 2016 •

edited

mdymczyk commented Sep 13, 2016 •

edited

jakubhava commented Oct 13, 2016 •

edited

jakubhava Nov 1, 2016 •

edited

jakubhava Nov 3, 2016 •

edited