-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PUBDEV-7820: Add a topbasemodel
attribute to AutoML
#5259
PUBDEV-7820: Add a topbasemodel
attribute to AutoML
#5259
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From our conversation, here's the updated, proposed API:
Python:
aml.get_best_model(algorithm="gbm", criterion="auc")
aml.get_best_model(criterion="auc")
aml.get_best_model(criterion="prediction_time")
R:
h2o.get_best_model(object = aml, algorithm = "gbm", criterion = "auc")
In R, we can restrict the allowable input to the supported algorithms in AutoML, and the criterion should be limited to the column names from the leaderboard (I think the column names match up exactly with how they are specified in sort_metric
but let's double check that because there might be at least one inconsistency there, and if so we need to decide whether to use the existing colname or arg name here).
R & Python: We could consider adding support to a list of models, similar to how we do in h2o.explain()
, which supports: a list of H2O models, an H2OAutoML object or an H2OAutoML Leaderboard slice. If we support a list of models, then the allowable "algorithms" should include all supervised H2O models rather than the list of AutoML algorithms.
a768ebd
to
2b6f2fb
Compare
a47165f
to
3609544
Compare
Column names don't correspond exactly to sort_metric, e.g., columns names are lower case and in sort metric some metrics are upper case. (I convert the input to lower case so this isn't a problem) When I get some criterion that isn't present in the leaderboard I throw the following error:
If I'm not mistaken, this would also mean forcing user to provide a "leaderboard" frame to this function for the list of models scenario. For the H2OAutoML Leaderboard slice - I can't think of a use-case for getting a particular best model in a subset. Is there any or did you just mention it to keep it similar to the |
h2o-r/h2o-package/R/automl.R
Outdated
leaderboard <- do.call(h2o.arrange, list( | ||
leaderboard, | ||
if(ascending) criterion else bquote(desc(.(criterion))))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not happy with this: h2o.arrange(frame, ...)
delays the evaluation of the ...
to mimic the functions like dplyr::arrange
but doing it in base R way (not rlang
/tidyeval
way as dplyr
) which makes it hard to use in non-interactive way, e.g., using a variable to pick the column name to sort by.
One reason for quoting/delaying the evaluation is the desc
which isn't defined.
AFAIK using functions that behave like this (e.g. base::subset
) in non-interactive code is usually frowned upon so if there is another way I think it would be preferable but I didn't find any yet.
One more question about the API, I think @ledell mentioned that using "top" instead of "best" would be better but it might have been forgotten during the call. Should I change it to |
@tomasfryda I think The suggestion to use Which raises another question actually.
well, there could be ties actually... what do we do in this case? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please see my suggestions to avoid bringing backend logic to the client
h2o-automl/src/main/java/ai/h2o/automl/leaderboard/AlgoName.java
Outdated
Show resolved
Hide resolved
h2o-automl/src/main/java/ai/h2o/automl/leaderboard/AlgoName.java
Outdated
Show resolved
Hide resolved
Sounds good, let's use current name, |
hey @tomasfryda, one last comment on this: I noticed that in R, the default for |
@ledell yes, the reason was that some would be irrelevant but if we have that precedent I will modify it. |
@tomasfryda Sound good. See here for an example. |
@ledell thank you. I already done it but with lowercase letters[1] (since that is how it is shown in the leaderboard) but there is no problem making it uppercase since it is case insensitive... I think the names should not be from Should I use the names from leaderboard (either lower case or abbreviations in upper case) or should I use the names from [1] 531eee4#diff-281765758396a67fb111de92d8f6845e4eaf27317e7849f16f35f078021ec149R619-R621 h2o-3/h2o-r/h2o-package/R/automl.R Line 112 in 1ed7747
|
@tomasfryda Yeah this is another case of very old code which has weird (and annoying!) idiosyncrasies. I think it's best to just copy over the existing convention (upper case in R in places; see |
… criterion = 'deviance'
…_residual_deviance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, looks great!
Just not a huge fan of redundancies in documentation as it creates a risk of variations between methods docs (lowercase here, camelcase there...).
Making a suggestion for Python, not sure about the best practices for R.
Those suggestions are not a blocker though, if agreed, could be done in a follow-up ticket.
|
||
if algorithm is not None: | ||
if algorithm.lower() not in ("basemodel", "deeplearning", "drf", "gbm", | ||
"glm", "stackedensemble", "xgboost"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
feel free to externalize the list of supported algos into a constant.
We could then reuse it in the docstring as well, e.g.:
:param algorithm: One of "basemodel", or member of :const:`H2OAutoMLBaseMixin.supported_algos`.
as well as in various places in documentation and avoid hard to maintain duplications.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I completely agree with you - I hate duplications but I don't like the solution you proposed - I modified the docstring and added the supported_algos
and then I tried to use it in Jupyter (shift+tab to show the documentation), the result was not expanded to the actual value of supported_algos
but it end up exactly as it was written in the docstring which is IMHO not very user friendly.
In python, I could do something like:
def doc_format(**kwargs):
def _(fun):
fun.__doc__ = fun.__doc__.format(**kwargs)
return fun
return _
class H2OAutoMLBaseMixin:
...
@doc_format(algos='", "'.join(supported_algos))
def get_best_model(self, algorithm=None, criterion=None):
"""
Get best model of a given family/algorithm for a given criterion from an AutoML object.
:param algorithm: One of "basemodel", "{algos}".
...
"""
This way user gets the information in a tooltip and doesn't have to look for some property somewhere in some internal class. Also I have no idea if something like this is possible in R or if there is some better more idiomatic solution so I will create a JIRA later today so I can explore it more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doc_format
looks good to me.
You could add it to the metaclass
utility module.
(I also wanted to add a decorator there to extend docstring in subclasses)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for R: https://cran.r-project.org/web/packages/roxygen2/vignettes/rd.html#do-repeat-yourself
in spite of this paragraph title, looks like you can avoid repetition with the @eval
tag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the tips! I added reference to this discussion to the new jira.
h2o-py/h2o/automl/_base.py
Outdated
Avaliable criteria: | ||
- Regression metrics: deviance, RMSE, MSE, MAE, RMSLE | ||
- Binomial metrics: AUC, logloss, AUCPR, mean_per_class_error, RMSE, MSE | ||
- Multinomial metrics: mean_per_class_error, logloss, RMSE, MSE, AUC, AUCPR | ||
The following additional leaderboard information can be also used as a criterion: | ||
- 'training_time_ms': column providing the training time of each model in milliseconds (doesn't include the training of cross validation models). | ||
- 'predict_time_per_row_ms`: column providing the average prediction time by the model for a single row. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like to duplicate information.
Could we just mention that criterion can be any column name from the leaderboard with numerical values: this includes metrics columns and columns for the extended leaderboard as described in :function:h2o.automl.get_leaderboard
if criterion in ("training_time_ms", "predict_time_per_row_ms"): | ||
extra_cols.append(criterion) | ||
|
||
leaderboard = h2o.automl.get_leaderboard(self, extra_columns=extra_cols) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, the circular dependency between _base
module and h2o.automl.get_leaderboard
is not super clean, but we can improve this later.
Technically, all static methods on H2OAutoML
could be just _private
functions in a separate module.
Same for get_automl
and get_leaderboard
.
algorithm = c("any", "basemodel", "deeplearning", "drf", "gbm", | ||
"glm", "stackedensemble", "xgboost"), | ||
criterion = c("AUTO", "AUC", "AUCPR", "logloss", "MAE", "mean_per_class_error", | ||
"deviance", "MSE", "predict_time_per_row_ms", | ||
"RMSE", "RMSLE", "training_time_ms")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it also possible to externalize some of those collections in R so that we can reuse them?
.h2o.automl_supported_algos = c("deeplearning", "drf", "gbm", "glm", "stackedensemble", "xgboost")
...
algorithm = c("any", "basemodel", .h2o.automl_supported_algos)
and something similar with the metrics/leaderboard columns.
This would allow reuse in h2o.automl
logic as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
h2o-py/h2o/automl/_base.py
Outdated
- Regression metrics: mean_residual_deviance, rmse, mse, mae, rmsle | ||
- Binomial metrics: auc, logloss, aucpr, mean_per_class_error, rmse, mse | ||
- Multinomial metrics: mean_per_class_error, logloss, rmse, mse, auc, aucpr | ||
- Regression metrics: deviance, RMSE, MSE, MAE, RMSLE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in Python we have always been using the lowercase everywhere (in examples, etc). You should double-check that, but if that's the case, you can change this docstring to be lowercase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tomasfryda I didn't want to hold this up, so I approved, but please see my comment here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ledell, you are right. I checked automl module and at least there we have lower-case so I changed it to lower-case.
* Initial add a topbasemodel attribute to automl implementation * Make python best_models a function and in R preload the models during init * Incorporate suggestions from meeting 2021/02/01 * Fix docstrings and match.arg in R for algorithm * Incorporate Seb's suggestions * Make algos and criterions case insensitive, improve tests * Automatically infer what extra cols should be retrieved from criterion * Incorporate more Seb's suggestions * Update docstrings * Replace grep+regex with conditions * Add list of options to R's criterion parameter * Change abbrev. metrics to upper case and add deprecation warning when criterion = 'deviance' * Remove warning about deprecating deviance and prioritize it over mean_residual_deviance * Make lower-case metrics in python docstring (cherry picked from commit 393ef32)
https://h2oai.atlassian.net/browse/PUBDEV-7820
Python:
R: