Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PUBDEV-4266: Adds support for converting word2vec model to a Frame #935

Merged
merged 2 commits into from
Apr 14, 2017

Conversation

michalkurka
Copy link
Contributor

@michalkurka michalkurka commented Mar 21, 2017

This PR introduces support for converting word2vec model into a Frame.

It uses algo-specific rapids and introduces a new rapids discovery mechanism. Rapids now support working with models

eg.:

fr <- .newExpr("word2vec.to.frame", w2v.model@model_id)

would convert a word2vec model to an H2O frame

Copy link
Member

@mmalohlava mmalohlava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Michale, I like the idea - do you want to have it fully automatic via a dedicate JVM SPIs?

Or should AbstractRegister do that?

But i really like it, we were thinking about Rapids extensibility for long time, nobody tried that, you are the first one!


public class RegisterAlgos extends water.api.AbstractRegister {
// Register the algorithms and their builder handlers:
@Override public void register(String relativeResourcePath) throws ClassNotFoundException {
// Register algorithm-specific Rapids
RapidsInit.registerAlgoRapids();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we do that via SPI automatically? Or do you have reason to register them explicitly?

public class RapidsInit {

public static void registerAlgoRapids() {
Env.init(new AstWord2VecToFrame());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it more:

If we go through SPI way, we need some interface/abstract class here, and something like EnvContext, which would permit registration:

public class W2VRapidsInit implements RegisterRapids {
  @Override public  void registerAlgoRapids(EnvContext envContext) {
    envContext.init(new AstWord2VecToFrame());
  }
}

where EnvContext is just dummy class:

class DummyEnvCOntext implements EnvContext {
  @Override public void init(AstOp astOp) {
     Env.init(astOp); // static call
  }
}```

WDYT?

Copy link
Contributor Author

@michalkurka michalkurka Mar 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the feedback. I made this just to have something quickly.

We can use SPI however we need to keep in mind that the Rapids depend on availability of the algo (which means the corresponding ModelBuilder needs to be registered).

I don't really like the way I implemented this. I think a little cleaner way is to let ModelBuilder to return the rapids (or factory for them) for the particular algorithm (and register them in the loop in RegisterAlgos).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, agree - we need to connect registration of rapids operations with registration of the algo.

@michalkurka michalkurka changed the title WIP/Proposal, a mechanism for supporting model-specific Rapids PUBDEV-4266: Adds support for converting word2vec model to a Frame Apr 10, 2017
}

private static class ConvertToFrameTask extends MRTask<ConvertToFrameTask> {
private Word2VecModel _model;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The model key should be enough no? and then pre-fetch the model in setupLocal.
(Note: i am not sure now, if there is optimization in serialization layer which replaces models by its key).

Furthermore, if you decide to go with Model then we can replace _model reference by null at map call (to avoid transfer of model back).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Silly mistake, thanks for catching that. Fixed.

@@ -288,6 +290,9 @@ static void init(AstPrimitive ast, String name) {
init(new AstSeq());
init(new AstSeqLen());

// Custom (eg. algo-specific)
for (AstPrimitive prim : PrimsService.INSTANCE.getAllPrims())
init(prim);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks Michal for that! btw: later i will need to document all extension points we are having (tracked here; https://0xdata.atlassian.net/browse/PUBDEV-4272)

* PrimService manages access to non-core Rapid primitives.
* This includes algorithm specific rapids & 3rd party rapids.
*/
class PrimsService {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏

:returns: a frame representing learned word embeddings.
"""
return h2o.H2OFrame._expr(expr=ExprNode("word2vec.to.frame", self))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will happened if the backend rapids is not available? Should be check for that at client level and report it? I am planning to expose capability API as part of modularization (https://0xdata.atlassian.net/browse/PUBDEV-4280)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Capability API makes sense, it should be applied on a larger unit, eg. for the whole word2vec algo. Rapid "word2vec.to.frame" should always be available if word2vec itself is available. I think we don't need to check in individual methods like in this one.

We could probably check just in the constructor of H2OWord2vecEstimator, what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, agree, cap are high-level units.

We could probably check just in the constructor of H2OWord2vecEstimator, what do you think?
yes

@michalkurka
Copy link
Contributor Author

@h2o-ops please test!

@michalkurka
Copy link
Contributor Author

@h2o-ops please test

Copy link
Member

@mmalohlava mmalohlava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@michalkurka
Copy link
Contributor Author

@h2o-ops please test

@michalkurka
Copy link
Contributor Author

@h2o-ops please test!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants