-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PUBDEV-4266: Adds support for converting word2vec model to a Frame #935
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Michale, I like the idea - do you want to have it fully automatic via a dedicate JVM SPIs?
Or should AbstractRegister
do that?
But i really like it, we were thinking about Rapids extensibility for long time, nobody tried that, you are the first one!
|
||
public class RegisterAlgos extends water.api.AbstractRegister { | ||
// Register the algorithms and their builder handlers: | ||
@Override public void register(String relativeResourcePath) throws ClassNotFoundException { | ||
// Register algorithm-specific Rapids | ||
RapidsInit.registerAlgoRapids(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we do that via SPI automatically? Or do you have reason to register them explicitly?
public class RapidsInit { | ||
|
||
public static void registerAlgoRapids() { | ||
Env.init(new AstWord2VecToFrame()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about it more:
If we go through SPI way, we need some interface/abstract class here, and something like EnvContext, which would permit registration:
public class W2VRapidsInit implements RegisterRapids {
@Override public void registerAlgoRapids(EnvContext envContext) {
envContext.init(new AstWord2VecToFrame());
}
}
where EnvContext is just dummy class:
class DummyEnvCOntext implements EnvContext {
@Override public void init(AstOp astOp) {
Env.init(astOp); // static call
}
}```
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the feedback. I made this just to have something quickly.
We can use SPI however we need to keep in mind that the Rapids depend on availability of the algo (which means the corresponding ModelBuilder needs to be registered).
I don't really like the way I implemented this. I think a little cleaner way is to let ModelBuilder to return the rapids (or factory for them) for the particular algorithm (and register them in the loop in RegisterAlgos).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, agree - we need to connect registration of rapids operations with registration of the algo.
5c5427c
to
63b5495
Compare
} | ||
|
||
private static class ConvertToFrameTask extends MRTask<ConvertToFrameTask> { | ||
private Word2VecModel _model; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The model key
should be enough no? and then pre-fetch the model in setupLocal
.
(Note: i am not sure now, if there is optimization in serialization layer which replaces models by its key).
Furthermore, if you decide to go with Model then we can replace _model
reference by null
at map call (to avoid transfer of model back).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Silly mistake, thanks for catching that. Fixed.
@@ -288,6 +290,9 @@ static void init(AstPrimitive ast, String name) { | |||
init(new AstSeq()); | |||
init(new AstSeqLen()); | |||
|
|||
// Custom (eg. algo-specific) | |||
for (AstPrimitive prim : PrimsService.INSTANCE.getAllPrims()) | |||
init(prim); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, thanks Michal for that! btw: later i will need to document all extension points we are having (tracked here; https://0xdata.atlassian.net/browse/PUBDEV-4272)
* PrimService manages access to non-core Rapid primitives. | ||
* This includes algorithm specific rapids & 3rd party rapids. | ||
*/ | ||
class PrimsService { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👏
:returns: a frame representing learned word embeddings. | ||
""" | ||
return h2o.H2OFrame._expr(expr=ExprNode("word2vec.to.frame", self)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What will happened if the backend rapids is not available? Should be check for that at client level and report it? I am planning to expose capability API as part of modularization (https://0xdata.atlassian.net/browse/PUBDEV-4280)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Capability API makes sense, it should be applied on a larger unit, eg. for the whole word2vec algo. Rapid "word2vec.to.frame" should always be available if word2vec itself is available. I think we don't need to check in individual methods like in this one.
We could probably check just in the constructor of H2OWord2vecEstimator, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, agree, cap are high-level units.
We could probably check just in the constructor of H2OWord2vecEstimator, what do you think?
yes
@h2o-ops please test! |
@h2o-ops please test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@h2o-ops please test |
@h2o-ops please test! |
This PR introduces support for converting word2vec model into a Frame.
It uses algo-specific rapids and introduces a new rapids discovery mechanism. Rapids now support working with models
eg.:
would convert a word2vec model to an H2O frame