PUBDEV-7820: Add a `topbasemodel` attribute to AutoML #5259

tomasfryda · 2021-01-27T17:43:00Z

https://h2oai.atlassian.net/browse/PUBDEV-7820

Python:

aml.get_best_model()
aml.get_best_model(algorithm="gbm")
aml.get_best_model(criterion="auc")
aml.get_best_model(algorithm="gbm", criterion="auc")
aml.get_best_model(criterion="predict_time_per_row_ms")

R:

h2o.get_best_model(object = aml)
h2o.get_best_model(object = aml, algorithm = "gbm")
h2o.get_best_model(object = aml, criterion = "auc")
h2o.get_best_model(object = aml, algorithm = "gbm", criterion = "auc")
h2o.get_best_model(object = aml, criterion = "predict_time_per_row_ms")

… init

ledell

From our conversation, here's the updated, proposed API:

Python:

aml.get_best_model(algorithm="gbm", criterion="auc")
aml.get_best_model(criterion="auc")
aml.get_best_model(criterion="prediction_time")

R:

h2o.get_best_model(object = aml, algorithm = "gbm", criterion = "auc")

In R, we can restrict the allowable input to the supported algorithms in AutoML, and the criterion should be limited to the column names from the leaderboard (I think the column names match up exactly with how they are specified in sort_metric but let's double check that because there might be at least one inconsistency there, and if so we need to decide whether to use the existing colname or arg name here).

R & Python: We could consider adding support to a list of models, similar to how we do in h2o.explain(), which supports: a list of H2O models, an H2OAutoML object or an H2OAutoML Leaderboard slice. If we support a list of models, then the allowable "algorithms" should include all supervised H2O models rather than the list of AutoML algorithms.

tomasfryda · 2021-02-02T15:24:45Z

In R, we can restrict the allowable input to the supported algorithms in AutoML, and the criterion should be limited to the column names from the leaderboard (I think the column names match up exactly with how they are specified in sort_metric but let's double check that because there might be at least one inconsistency there, and if so we need to decide whether to use the existing colname or arg name here).

Column names don't correspond exactly to sort_metric, e.g., columns names are lower case and in sort metric some metrics are upper case. (I convert the input to lower case so this isn't a problem)

When I get some criterion that isn't present in the leaderboard I throw the following error: "Criterion \"", criterion,"\" is not present in the leaderboard!", which I think is good enough for the user to know where to look for allowed values. I did this to allow picking best models based on prediction speed etc (as long as the lower value means the better it wouldn't need any change) as well but now I think we cannot get prediction speed in the default leaderboard so it would be unavailable...

R & Python: We could consider adding support to a list of models, similar to how we do in h2o.explain(), which supports: a list of H2O models, an H2OAutoML object or an H2OAutoML Leaderboard slice. If we support a list of models, then the allowable "algorithms" should include all supervised H2O models rather than the list of AutoML algorithms.

If I'm not mistaken, this would also mean forcing user to provide a "leaderboard" frame to this function for the list of models scenario. For the H2OAutoML Leaderboard slice - I can't think of a use-case for getting a particular best model in a subset. Is there any or did you just mention it to keep it similar to the h2o.explain(...).

tomasfryda · 2021-02-02T15:38:32Z

h2o-r/h2o-package/R/automl.R

+    leaderboard <- do.call(h2o.arrange, list(
+      leaderboard,
+      if(ascending) criterion else bquote(desc(.(criterion)))))


I am not happy with this: h2o.arrange(frame, ...) delays the evaluation of the ... to mimic the functions like dplyr::arrange but doing it in base R way (not rlang/tidyeval way as dplyr) which makes it hard to use in non-interactive way, e.g., using a variable to pick the column name to sort by.

One reason for quoting/delaying the evaluation is the desc which isn't defined.

AFAIK using functions that behave like this (e.g. base::subset) in non-interactive code is usually frowned upon so if there is another way I think it would be preferable but I didn't find any yet.

tomasfryda · 2021-02-02T16:02:14Z

One more question about the API, I think @ledell mentioned that using "top" instead of "best" would be better but it might have been forgotten during the call. Should I change it to get_top_model?

h2o-py/h2o/automl/_base.py

sebhrusen · 2021-02-02T16:15:13Z

@tomasfryda I think top sounded better when we proposed sth like get_top_model(..., sort_by='auc') (where the correct descending/ascending order had to be implied by the user).

The suggestion to use get_best_model(..., criterion='auc') is to remove this idea of sorting models: there's only one best model for a given criterion.

Which raises another question actually.

there's only one best models for a given criterion.

well, there could be ties actually... what do we do in this case?
I guess it's not that common, so we can take the best ranked in the default leaderboard.

sebhrusen

please see my suggestions to avoid bringing backend logic to the client

h2o-automl/src/main/java/ai/h2o/automl/leaderboard/AlgoName.java

h2o-py/h2o/automl/_base.py

ledell · 2021-02-22T18:21:23Z

Sounds good, let's use current name, get_best_model().

ledell · 2021-02-23T09:14:57Z

hey @tomasfryda, one last comment on this: I noticed that in R, the default for criteria is NULL. Is there a reason you went with NULL instead of the character vector of allowable names? I know that depending on the task, some metrics won't be relevant in a particular case, but in other H2O functions in R, we have listed all the metrics, so there's some precedent for doing that. See: stopping_metric for example.

tomasfryda · 2021-02-23T09:36:10Z

hey @tomasfryda, one last comment on this: I noticed that in R, the default for criteria is NULL. Is there a reason you went with NULL instead of the character vector of allowable names? I know that depending on the task, some metrics won't be relevant in a particular case, but in other H2O functions in R, we have listed all the metrics, so there's some precedent for doing that. See: stopping_metric for example.

@ledell yes, the reason was that some would be irrelevant but if we have that precedent I will modify it.

ledell · 2021-02-23T17:09:55Z

@tomasfryda Sound good. See here for an example.

tomasfryda · 2021-02-23T17:55:39Z

@ledell thank you. I already done it but with lowercase letters[1] (since that is how it is shown in the leaderboard) but there is no problem making it uppercase since it is case insensitive...

I think the names should not be from stopping_metrics but rather sort_metric[2]. One thing that I find confusing is that in leaderboard there is no deviance but there is mean_residual_deviance.

Should I use the names from leaderboard (either lower case or abbreviations in upper case) or should I use the names from sort_metric + training/predict time?

[1] 531eee4#diff-281765758396a67fb111de92d8f6845e4eaf27317e7849f16f35f078021ec149R619-R621
[2]

h2o-3/h2o-r/h2o-package/R/automl.R

Line 112 in 1ed7747

    
           sort_metric = c("AUTO", "deviance", "logloss", "MSE", "RMSE", "MAE", "RMSLE", "AUC", "AUCPR", "mean_per_class_error"),

ledell · 2021-02-25T09:55:37Z

@tomasfryda Yeah this is another case of very old code which has weird (and annoying!) idiosyncrasies. I think it's best to just copy over the existing convention (upper case in R in places; see sort_metric) in R. In Python, lowercase is ok. At some point, I think we should change all the metrics to lowercase across all the R functions that use it. I don't have a good solution to the deviance vs mean_residual_deviance issue (if I am remembering correctly, that was also something that we copied over into AutoML from other places in H2O; using mean_residual_deviance in metrics tables.) -- i think breaking a column name by changing it is not going to work, so the only thing we can do to sync them is to deprecate "deviance" as an input and replace it with mean_residual_deviance (would require some custom mapping code in all the R functions that use it).

… criterion = 'deviance'

…_residual_deviance

sebhrusen

Overall, looks great!
Just not a huge fan of redundancies in documentation as it creates a risk of variations between methods docs (lowercase here, camelcase there...).
Making a suggestion for Python, not sure about the best practices for R.

Those suggestions are not a blocker though, if agreed, could be done in a follow-up ticket.

sebhrusen · 2021-03-02T21:40:06Z

h2o-py/h2o/automl/_base.py

+
+        if algorithm is not None:
+            if algorithm.lower() not in ("basemodel", "deeplearning", "drf", "gbm",
+                                         "glm", "stackedensemble", "xgboost"):


feel free to externalize the list of supported algos into a constant.
We could then reuse it in the docstring as well, e.g.:

:param algorithm: One of "basemodel", or member of :const:`H2OAutoMLBaseMixin.supported_algos`.

as well as in various places in documentation and avoid hard to maintain duplications.

I completely agree with you - I hate duplications but I don't like the solution you proposed - I modified the docstring and added the supported_algos and then I tried to use it in Jupyter (shift+tab to show the documentation), the result was not expanded to the actual value of supported_algos but it end up exactly as it was written in the docstring which is IMHO not very user friendly.

In python, I could do something like:

def doc_format(**kwargs): def _(fun): fun.__doc__ = fun.__doc__.format(**kwargs) return fun return _ class H2OAutoMLBaseMixin: ... @doc_format(algos='", "'.join(supported_algos)) def get_best_model(self, algorithm=None, criterion=None): """ Get best model of a given family/algorithm for a given criterion from an AutoML object. :param algorithm: One of "basemodel", "{algos}". ... """

This way user gets the information in a tooltip and doesn't have to look for some property somewhere in some internal class. Also I have no idea if something like this is possible in R or if there is some better more idiomatic solution so I will create a JIRA later today so I can explore it more.

doc_format looks good to me.
You could add it to the metaclass utility module.
(I also wanted to add a decorator there to extend docstring in subclasses)

for R: https://cran.r-project.org/web/packages/roxygen2/vignettes/rd.html#do-repeat-yourself
in spite of this paragraph title, looks like you can avoid repetition with the @eval tag.

Thanks for the tips! I added reference to this discussion to the new jira.

sebhrusen · 2021-03-02T21:50:36Z

h2o-py/h2o/automl/_base.py

+                          Avaliable criteria:
+                            - Regression metrics: deviance, RMSE, MSE, MAE, RMSLE
+                            - Binomial metrics: AUC, logloss, AUCPR, mean_per_class_error, RMSE, MSE
+                            - Multinomial metrics: mean_per_class_error, logloss, RMSE, MSE, AUC, AUCPR
+                          The following additional leaderboard information can be also used as a criterion:
+                            - 'training_time_ms': column providing the training time of each model in milliseconds (doesn't include the training of cross validation models).
+                            - 'predict_time_per_row_ms`: column providing the average prediction time by the model for a single row.


I don't like to duplicate information.
Could we just mention that criterion can be any column name from the leaderboard with numerical values: this includes metrics columns and columns for the extended leaderboard as described in :function:h2o.automl.get_leaderboard

sebhrusen · 2021-03-02T21:57:13Z

h2o-py/h2o/automl/_base.py

+        if criterion in ("training_time_ms", "predict_time_per_row_ms"):
+            extra_cols.append(criterion)
+
+        leaderboard = h2o.automl.get_leaderboard(self, extra_columns=extra_cols)


btw, the circular dependency between _base module and h2o.automl.get_leaderboard is not super clean, but we can improve this later.
Technically, all static methods on H2OAutoML could be just _private functions in a separate module.
Same for get_automl and get_leaderboard.

sebhrusen · 2021-03-02T22:05:02Z

h2o-r/h2o-package/R/automl.R

+                               algorithm = c("any", "basemodel", "deeplearning", "drf", "gbm",
+                                             "glm", "stackedensemble", "xgboost"),
+                               criterion = c("AUTO", "AUC", "AUCPR", "logloss", "MAE", "mean_per_class_error",
+                                             "deviance", "MSE", "predict_time_per_row_ms",
+                                             "RMSE", "RMSLE", "training_time_ms")) {


is it also possible to externalize some of those collections in R so that we can reuse them?

.h2o.automl_supported_algos = c("deeplearning", "drf", "gbm", "glm", "stackedensemble", "xgboost") ... algorithm = c("any", "basemodel", .h2o.automl_supported_algos)

and something similar with the metrics/leaderboard columns.
This would allow reuse in h2o.automl logic as well.

ledell

Thank you!

ledell · 2021-02-27T11:52:22Z

h2o-py/h2o/automl/_base.py

-                            - Regression metrics: mean_residual_deviance, rmse, mse, mae, rmsle
-                            - Binomial metrics: auc, logloss, aucpr, mean_per_class_error, rmse, mse
-                            - Multinomial metrics: mean_per_class_error, logloss, rmse, mse, auc, aucpr
+                            - Regression metrics: deviance, RMSE, MSE, MAE, RMSLE


I think in Python we have always been using the lowercase everywhere (in examples, etc). You should double-check that, but if that's the case, you can change this docstring to be lowercase.

@tomasfryda I didn't want to hold this up, so I approved, but please see my comment here.

Thanks @ledell, you are right. I checked automl module and at least there we have lower-case so I changed it to lower-case.

* Initial add a topbasemodel attribute to automl implementation * Make python best_models a function and in R preload the models during init * Incorporate suggestions from meeting 2021/02/01 * Fix docstrings and match.arg in R for algorithm * Incorporate Seb's suggestions * Make algos and criterions case insensitive, improve tests * Automatically infer what extra cols should be retrieved from criterion * Incorporate more Seb's suggestions * Update docstrings * Replace grep+regex with conditions * Add list of options to R's criterion parameter * Change abbrev. metrics to upper case and add deprecation warning when criterion = 'deviance' * Remove warning about deprecating deviance and prioritize it over mean_residual_deviance * Make lower-case metrics in python docstring (cherry picked from commit 393ef32)

Initial add a topbasemodel attribute to automl implementation

9d54b47

tomasfryda added python R please high-level review labels Jan 27, 2021

tomasfryda requested review from ledell and sebhrusen January 27, 2021 17:43

tomasfryda self-assigned this Jan 27, 2021

Make python best_models a function and in R preload the models during…

a7c196f

… init

ledell suggested changes Feb 2, 2021

View reviewed changes

Incorporate suggestions from meeting 2021/02/01

2b6f2fb

tomasfryda force-pushed the tomf_PUBDEV-7820_add_a_topbasemodel_attribute_to_automl branch from a768ebd to 2b6f2fb Compare February 2, 2021 14:17

Fix docstrings and match.arg in R for algorithm

3609544

tomasfryda force-pushed the tomf_PUBDEV-7820_add_a_topbasemodel_attribute_to_automl branch from a47165f to 3609544 Compare February 2, 2021 15:07

tomasfryda commented Feb 2, 2021

View reviewed changes

sebhrusen reviewed Feb 2, 2021

View reviewed changes

h2o-py/h2o/automl/_base.py Outdated Show resolved Hide resolved

h2o-py/h2o/automl/_base.py Outdated Show resolved Hide resolved

tomasfryda added 3 commits February 3, 2021 17:27

Incorporate Seb's suggestions

59e7e48

Make algos and criterions case insensitive, improve tests

80d755c

Automatically infer what extra cols should be retrieved from criterion

e3b0776

tomasfryda marked this pull request as ready for review February 9, 2021 13:08

tomasfryda added please review and removed please high-level review labels Feb 9, 2021

sebhrusen requested changes Feb 9, 2021

View reviewed changes

tomasfryda added 2 commits February 10, 2021 11:53

Incorporate more Seb's suggestions

3501824

Update docstrings

b2fefd5

sebhrusen reviewed Feb 10, 2021

View reviewed changes

h2o-py/h2o/automl/_base.py Outdated Show resolved Hide resolved

Replace grep+regex with conditions

0a3b519

tomasfryda requested a review from sebhrusen February 19, 2021 14:20

tomasfryda requested a review from ledell February 19, 2021 14:20

Add list of options to R's criterion parameter

531eee4

tomasfryda added 2 commits February 25, 2021 15:46

Change abbrev. metrics to upper case and add deprecation warning when…

a2cf9c0

… criterion = 'deviance'

Remove warning about deprecating deviance and prioritize it over mean…

9765e82

…_residual_deviance

sebhrusen approved these changes Mar 2, 2021

View reviewed changes

ledell approved these changes Mar 5, 2021

View reviewed changes

ledell approved these changes Mar 8, 2021

View reviewed changes

Make lower-case metrics in python docstring

9b6e69c

tomasfryda merged commit 393ef32 into rel-zermelo Mar 12, 2021

exalate-issue-sync bot mentioned this pull request May 11, 2023

Remove duplicates from AutoML docstrings #7609

Open

h2o-ops mentioned this pull request May 14, 2023

Add a topbasemodel attribute to AutoML #7821

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PUBDEV-7820: Add a `topbasemodel` attribute to AutoML #5259

PUBDEV-7820: Add a `topbasemodel` attribute to AutoML #5259

tomasfryda commented Jan 27, 2021 •

edited

Loading

ledell left a comment •

edited

Loading

tomasfryda commented Feb 2, 2021

tomasfryda Feb 2, 2021

tomasfryda commented Feb 2, 2021

sebhrusen commented Feb 2, 2021 •

edited

Loading

sebhrusen left a comment •

edited

Loading

ledell commented Feb 22, 2021

ledell commented Feb 23, 2021

tomasfryda commented Feb 23, 2021

ledell commented Feb 23, 2021

tomasfryda commented Feb 23, 2021

ledell commented Feb 25, 2021

sebhrusen left a comment

sebhrusen Mar 2, 2021

tomasfryda Mar 8, 2021

sebhrusen Mar 8, 2021

sebhrusen Mar 8, 2021

freedatoms Mar 9, 2021

sebhrusen Mar 2, 2021

sebhrusen Mar 2, 2021

sebhrusen Mar 2, 2021

ledell left a comment

ledell Feb 27, 2021

ledell Mar 8, 2021

freedatoms Mar 9, 2021

PUBDEV-7820: Add a topbasemodel attribute to AutoML #5259

PUBDEV-7820: Add a topbasemodel attribute to AutoML #5259

Conversation

tomasfryda commented Jan 27, 2021 • edited Loading

ledell left a comment • edited Loading

Choose a reason for hiding this comment

tomasfryda commented Feb 2, 2021

Choose a reason for hiding this comment

tomasfryda commented Feb 2, 2021

sebhrusen commented Feb 2, 2021 • edited Loading

sebhrusen left a comment • edited Loading

Choose a reason for hiding this comment

ledell commented Feb 22, 2021

ledell commented Feb 23, 2021

tomasfryda commented Feb 23, 2021

ledell commented Feb 23, 2021

tomasfryda commented Feb 23, 2021

ledell commented Feb 25, 2021

sebhrusen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ledell left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PUBDEV-7820: Add a `topbasemodel` attribute to AutoML #5259

PUBDEV-7820: Add a `topbasemodel` attribute to AutoML #5259

tomasfryda commented Jan 27, 2021 •

edited

Loading

ledell left a comment •

edited

Loading

sebhrusen commented Feb 2, 2021 •

edited

Loading

sebhrusen left a comment •

edited

Loading