Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to specify algo parameters (e.g. monotonicity constraints) in H2O AutoML #4005

Merged
merged 16 commits into from
Oct 29, 2019

Conversation

sebhrusen
Copy link
Contributor

@sebhrusen sebhrusen commented Oct 22, 2019

https://0xdata.atlassian.net/browse/PUBDEV-6737

Current implementation restricts the parameters that can be assigned by the user to:

  • monotone_constraints as required by initial customer request.

Added a System property to allow overriding of any parameter: the plan is to use this for development/benchmarking.
Tuning algorithms, and running benchmarks against versions with different parameters is currently a hassle (recompile, rename built jar, upload it, keep reference...).
Having those parameters configurable from client will allow us to try more combinations faster (may also support hyperparameters override in the future for benchmarking as well).

stopping_criteria = new AutoMLStoppingCriteria();

// reasonable defaults:
stopping_criteria.set_max_models(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is -1 or 0 more common as a replacement for None/NULL for an integer-valued parameter on the backend? I have seen both... (which I guess means we can choose whichever one is "better"?).

Copy link
Contributor Author

@sebhrusen sebhrusen Oct 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ledell I didn't change that code, just moved it down: usually in Java, attributes are declared before constructor, I was just applying this "norm" here.

Regarding the -1/0 debate... there's no rule written in stone. -1 is commonly used for "unset" (although Java assigns 0 by default to int variables), and when there's no ambiguity like here, I tend to use 0 for "unlimited". Maybe it deserves a tiny comment next to it.

@ledell
Copy link
Contributor

ledell commented Oct 23, 2019

@seb-h2o Can you update the description to remove ntrees and to add algo_parameters instead?

My concern with algo_parameters is that if someone only wants to limit ntrees in GBMs, for example, there is no way to limit that to just GBMs. Since DRF and XGBoost also has an ntrees parameter, it will also limit it there too (from my understanding of your current implementation). I guess the only way to give further precision is to have it be a nested dict/list:

algo_parameters = {"GBM": {"ntrees": 1000, "learn_rate": 0.01}, "DRF": {"ntrees": 200}}

In general, GBMs require more (usually more shallow) trees than Random Forest, so I think it makes sense to allow the user to be precise in their configurations.

Copy link
Contributor

@deil87 deil87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Seb, great feature!

if (!addParameter(param.algo, param.name, param.value))
throw new H2OIllegalValueException(param.name, ROOT_PARAM, param.value);
}
return this;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seb-h2o usually build pattern returns different type so that we can't do chain calls anymore( SmthBuilder.add().add().build() -> Smth). With current approach we maybe don't need to return this at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I hesitated into turning this into a proper "builder" (that's also why I renamed the build method).
Mainly trying to keep it simple: not sure builders are justified most of the time.
When I see

 AutoMLCustomParameters algo_parameters;

my reflex is:

algo_parameters = new AutoMLCustomParameters();

not: "damn, constructor is private, any factory method? AutoMLCustomParameters.newInstance(...)? damn, what then? Oh found it new AutoMLCustomParametersBuilder()....build()"

So trying to combine the best of both:

algo_parameters = new AutoMLCustomParameters().add(...)
                                                                                         .add(...)
                                                                                         .end();

It's not without flaws though, conscious about this. Will reconsider and try to find sth more satisfying.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ended up with AutoMLCustomParameters.create().add(...).build()
create returns an internal Builder instance`

return algo_parameter_names.get(algo.name());
}

public Model.Parameters getCustomParameters(Algo algo) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as we actually get both default and custom parameters here.... naming is slightly confusing. Maybe getCustomisedDefaults?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this should be public anymore actually, which would make the naming less important.
getCustomDefaults has something oxymoronic in it that makes me like it somehow :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed, and lowered visibility (used in tests)

@sebhrusen
Copy link
Contributor Author

sebhrusen commented Oct 24, 2019

Can you update the description to remove ntrees and to add algo_parameters instead?

@ledell you mean in the JIRA ticket?

regarding the implementation, it is actually possible to set params for a specific algo using the following syntax (works both in R and Py):

algo_parameters = dict(
    ntrees=200,    # applies to all algos supporting ntrees
    GBM__ntrees=1000 .   # applies only to GBMs
)

Anyway, as soon as we're not sure about exposing ntrees, waiting for discussion with customer, I'll keep this feature hidden.
Remains the question of monotone_constraints: do we want to have the param at top level as Michal suggests?

@ledell
Copy link
Contributor

ledell commented Oct 24, 2019

@ledell you mean in the JIRA ticket?

@seb-h2o No I mean in this PR description (it still says ntrees).

Thanks for clarifying about how to apply the parameter to a particular algorithm. I remember seeing this algoname__paramname format before (was that here or do we use this format in another part of H2O?).

Can you list the strings that you accept for each algorithm? I want to make sure we are using consistent naming of the algorithms... I filed this JIRA recently because I noticed that H2O has several sets of algorithm names. It would be great to unify this across H2O moving forward.

Regarding algo_parameters in general -- I know that we had planned to allow users to fully customize the AutoML "plan" at some point (we had this on a roadmap, but it's not a JIRA yet). Would the algo_parameters parameter become redundant or conflicting at that point, or do you think that it will still provide value? I wonder how often people will want to fix a particular hyperparameter value across all the models, since the goal in AutoML is usually to try many values for a single hyperparameter...instead of just one. Just thinking out-loud.

@sebhrusen
Copy link
Contributor Author

sebhrusen commented Oct 24, 2019

@ledell

Thanks for clarifying about how to apply the parameter to a particular algorithm. I remember seeing this algoname__paramname format before (was that here or do we use this format in another part of H2O?).

It is the "standard" flat format used by scikit learn to specify nested parameters: I must say that I like it. When we have the possibility to avoid an additional level of depth without possibility of confusion, it brings only clarity I think.

Can you list the strings that you accept for each algorithm? I want to make sure we are using consistent naming of the algorithms...

AutoML is using the algo names listed in https://github.com/h2oai/h2o-3/blob/master/h2o-automl/src/main/java/ai/h2o/automl/Algo.java everywhere.
I also noticed that the parsing logic we use to extract params from the JSON body is case-sensitive with enums, meaning we usually can't use gbm but we have to use GBM.
I think we should be more lenient here and accept both: it's a very easy fix.

Regarding algo_parameters in general - I know that we had planned to allow users to fully customize the AutoML "plan" at some point

For me algo_parameters is useful when user wants to set 1 or 2 specific constraints on algos, not to define complete models, at least not at end API level (tbh, when exposing this feature, I've been wondering if we could not create a model designer on top of it, only client side: it would be easy to have a "custom steps" in Java that got only fed by some of those params, internally all "custom___xxx" params for example...), also thinking out loud.

But overall, I think the idea of full customization should be thought like DAI recipes, with a mini-language describing additional models, grids, feature engineering...
This needs to be designed top down.

This being said, after disabling ntrees, there will be only monotone_constraints allowed by default (as said in the description, I keep the possibility to fully enable the feature for dev/benchmark through a java / h2o.init parameter).
So the question remains if we want to keep monotone_constraints here only, or if expose it as a 1st level param in the API.

@sebhrusen
Copy link
Contributor Author

@ledell As all concerns on this PR have been resolved, I'm merging it.
See additional PR #4020 for the decision to expose monotone_constraints as a top level param in AutoML.

@sebhrusen sebhrusen merged commit 97d757c into master Oct 29, 2019
Tagar pushed a commit to Tagar/h2o-3 that referenced this pull request Nov 8, 2019
…) in H2O AutoML (h2oai#4005)

* Java API for custom params
* specs for custom params
* move some logic from schema to BuildSpec
* Py support for algo_parameters
* removed useless schema from AutoMLBuildSpec
* improved parsing logic for REST API polymorphic params
* R client support
* disabled ntrees, improved tests and addressed PR comments
@sebhrusen sebhrusen deleted the seb_pubdev-6737 branch March 16, 2020 17:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants