New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SWPRIVATE-16] NA handling for Spark algorithms #75
Conversation
mdymczyk
commented
Sep 6, 2016
•
edited
edited
- added NA value handling for SVM: mean imputation, skip, not allowed (old variant with an error message)
- means are calculated via RDD as H2OFrame.means() doesn't support enum columns
- changed the SVM model scoring/pojo generation to include column means when running in MeanImputation mode
- refactored tests a bit (frame creation now supports multi column frames)
488648b
to
2f252ba
Compare
I think we need some kind of coordination here. |
@vpatryshev you are referring to your refactoring? I'm ok with adopting your changes since most of the stuff in this commit is just moving things around in tests. |
Yes, this is a good step forward, but it partially overlaps with the data handling fixes that are waiting for the merge, there's a pull request. I'd rather line up these things, I don't know, otherwise we would be committing the same stuff, code chunks fighting with each other. Michal, any updates? |
I would rather merge first @vpatryshev branch. I've done the review already and I'm ok with merging right away, just waiting for @mmalohlava approval and then I would put @mdymczyk behaviour on top of that. @vpatryshev's changes seems like good base for the rest of the code build on top of that, It unifies stuff and also will play nicely with the external backend. Would it be complicated to put your changes on @vpatryshev's branch @mdymczyk ? But let's also wait for Michal response. thanks! |
Sure, first merge @vpatryshev changes and I'll rebase them onto this branch, no problem. I'd just like to get this and the progress bar PRs into master as I'll be using them for the next MLlib algorithms + I want to write a blog post about SVM in SW with screenshots and it would be much easier for me to get all the features in one branch. |
Yes, same here. Also, I had a talk with Arno, and so the general idea of returning 0 if an Anyway, that's why I did not review @mdymczyk's change; it can happen that I actually have two more issues after talking with Arno. First, the loop via absolute row indexes is not efficient; a nested loop, Second, how come UUID and Enum are not handled as such, but converted to (Or, and #3 - what's the fuss with UTF8String? Do we need, in Scala, to Thanks, On Tue, Sep 13, 2016 at 1:45 AM, Jakub Háva notifications@github.com
|
@vpatryshev no problems.
|
@vpatryshev @jakubhava how's the other refactoring going? Do you guys need any help/reviews there? |
I'm happy with @vpatryshev's PR in the current state, just waiting for @mmalohlava approval and merge |
I'm still waiting for the merge of that previous PR. Thanks, On Mon, Sep 19, 2016 at 4:01 AM, Jakub Háva notifications@github.com
|
2f252ba
to
10b6b64
Compare
@mdymczyk I integrated the things I wanted into the master already. Could you please rebase on top of master, fix the problems and then let me know ? I will have a look and do review afterwards. |
case DataTypes.IntegerType => value.asInstanceOf[Integer].doubleValue | ||
case DataTypes.DoubleType => value.asInstanceOf[Double] | ||
case DataTypes.StringType => domain.indexOf(value) | ||
case _ => throw new IllegalArgumentException("Target column has to be an enum or a number. " + fieldStruct.toString) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you want to have .toString here? It'll convert automatically. And it will crash on null.
agg1 | ||
} | ||
|
||
private[ml] def toDouble(value: Any, fieldStruct: StructField, domain: Array[String]): Double = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not very type safe; why not check value for its type, in match/case?
E.g. value match {
case b; Byte if fieldStruct.dataType == DataTypes.ByteType => b.doubleValue
... etc ...
And I would make it less verbose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really like this one, thanks!
String vecName = _train.name(i); | ||
if (vec.naCnt() > 0 && (null == _parms._ignored_columns || Arrays.binarySearch(_parms._ignored_columns, vecName) < 0)) { | ||
error("_train", "Training frame cannot contain any missing values [" + vecName + "]."); | ||
if(MissingValuesHandling.NotAllowed.equals(_parms._missing_values_handling)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are comparing enums, right? Will == not work here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some comments; not very essential, but...
training.cache(); | ||
|
||
if(training.count() == 0 && MissingValuesHandling.Skip.equals(_parms._missing_values_handling)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are comparing enums, right? Will == not work here?
@@ -28,6 +31,7 @@ object SVMModel { | |||
var interceptor: Double = .0 | |||
var iterations: Int = 0 | |||
var weights: Array[Double] = null | |||
var numMeans: Array[Double] = null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using nulls in Scala is not a good practice at all. If a value is optional, use Option.
Using vars... well... it may be harder to eliminate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True but it has to work with the underlying core H2O model for Model Outputs (that's also why I had to write parts of code in Java as the underlying framework requires several completely different constructors) - not sure if Option is an... option here :-) But will look into it.
val pred = | ||
data.zip(_output.weights).foldRight(_output.interceptor){ case ((d, w), acc) => d * w + acc} | ||
data.zip(_output.weights).foldRight(_output.interceptor){ case ((d, w), acc) => | ||
if(meanImputation && lang.Double.isNaN(d)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, if it's not meanImputation, it's okay to multiply by d?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it will be a NaN in that case, I can do that or throw an exception. Skipping NaN values doesn't make much sense computation wise (the MissingValuesHandling.Skip during training is skipping the whole row if it has NaNs, not singular columns).
Good comments, thanks @vpatryshev ! |
10b6b64
to
ec2039d
Compare
@vpatryshev @jakubhava @mmalohlava made appropriate changes according to Vlad's comments, can I merge this in or is something still wrong? Would like to use it also for GM clustering. |
@mdymczyk can you please update the branch with the latest master? |
I was thinking if it wouldn't be better to separate this PR to 2. The first one with updated test infrastructure and the second one with the implementation itself. To keep PRs really focused on one thing. Just an idea. |
ec2039d
to
2ecc73f
Compare
@mmalohlava sorry for the delay, completely missed that comment ;-) @jakubhava I was thinking about it but since the whole test change isn't that big and is there only because of the NA handling feature I thought it would be ok. I can separate it into 2 PRs if you guys want but just thought it would be an overkill. |
Will merge it tomorrow if there are no further complaints @jakubhava @vpatryshev @mmalohlava Need it for several other branches which are blocked by this. Could rebase off of this but then this would've to get merged too anyway... |
@mdymczyk I will have a look on it today ( in hour or so), sorry for the delay |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, just pointed out a few minor changes/ideas
@@ -51,26 +52,51 @@ trait SharedSparkTestContext extends SparkTestContext { self: Suite => | |||
super.afterAll() | |||
} | |||
|
|||
def buildChunks[T: ClassTag](fname: String, data: Array[T], cidx: Integer, h2oType: Array[Byte]): Chunk = { | |||
def makeH2OFrame[T: ClassTag](fname: String, colNames: Array[String], chunkLayout: Array[Long], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could use this TestFrameBuilder h2oai/h2o-3#417 from H2O once it's merged instead all of this. I wouldn't it as part of this PR ( so we can finally merge ), but I would just create a JIRA for later. It's minor thing, but it would be better to reuse code instead of having duplicates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will add a TODO, cool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think JIRA would be better, it's easier to track than to track TODOs in the code. It's also more generic JIRA - replace all code which we use to create small frames with this builder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* numerical values. | ||
* | ||
* @param frame Input frame to be converted | ||
* @param _response_column Column which contains the labels |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it contain underscores on purpose ? It's really minor thing which caught my eye
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
autgenerated with intellij since that's the convention in H2O, can change it to without - don't have any strong opinions myself
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's good to be consistent, but if that's convention in H2O, let's keep it like that 👍
|
||
object FrameMLUtils { | ||
/** | ||
* Converts a H2O Frame into an RDD[LabeledPoint]. Assumes that the last column is the response column which will be used |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking whether we could put this code into the converters. So when the user calls h2oContext.asRDD[LabeledPoint](h2oFrame)
this would get executed. It's build on top of the internals but it would be nice to have this as normal conversion instead of having to call helper methods. But again, we can do JIRA for it and work on this change later so we don't block you. What do you think @mdymczyk ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes making this more robust, maybe even truly user facing, since MLlib uses LabeledPoint a lot, is a good idea - will make a JIRA for it since in that case it will need a bit more thought
} | ||
|
||
(trainingRDD.map(row => { | ||
val features = new Array[Double](nfeatures) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of these 2 lines I like this more
val features = (0 until nfeatures).map{ i => if (row.isNullAt(i)) means(i) else toDouble(row.get(i), fields(i), domains(i)) }.toArray[Double]
It seems more functional to me since we set the features right away, but it's probably just matter of taste
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
2ecc73f
to
880e2cb
Compare
Can you pls squash those 2 commits ? Also is sparkling_water_yarn branch passing ? If yes, I'm merging 👍 |
880e2cb
to
47fe4c4
Compare
47fe4c4
to
037f48c
Compare
037f48c
to
0ebb709
Compare
(cherry picked from commit 833ae04)