Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PUBDEV-8175: improve AutoML behaviour when multiple instances are created in parallel #5532

Merged
merged 8 commits into from
Jun 22, 2021

Conversation

sebhrusen
Copy link
Contributor

@tomasfryda
Copy link
Contributor

I will probably get to a proper review on Monday but I realized this might mess up explanations a bit (models could have the whole name including timestamp instead of being shortened which is not practical in the plots)...

If this PR changes the format of default model names, would you mind changing a regex in these two functions in this PR:

@sebhrusen
Copy link
Contributor Author

sebhrusen commented Jun 17, 2021

@tomasfryda do you remember why you have asecond capturing group in this regexp?

gsub("(.*)_AutoML_\\d{8}_\\d{6}(.*)", "\\1\\2", model_ids)

the (previous) model names should always end with AutoML_DATE_TIME, so what is the final .* group here? CV models maybe? they are considered in explanation module?

Btw, didn't notice that we have AutoML-specific logic in generic functions like h2o.varimp_heatmap. Maybe it should have been dispatched to h2.varimp_heatmap.H2OAutoML for automl objects...

@tomasfryda
Copy link
Contributor

The final one should be used if it's a model from a grid I think, e.g., GBM_grid__1_AutoML_20210617_172619_model_3 - > GBM_grid__1_model_3 .

I wouldn't say it's specific to H2OAutoML object, it's specific to models trained using automl so it should work even if you just decide to use last n models like so h2o.varimp_heatmap(lapply(tail(aml@leaderboard$model_id), h2o.getModel)) as long as this transformation is isomorphic (i.e., it would not happen if there would be two models from automl whose model_id differs only in the timestamp).

@sebhrusen
Copy link
Contributor Author

@tomasfryda regexp should be fixed, please have a look.
I used the perl regex syntax for R as for this particular one, it is significantly simpler than the R extended regexp version (for which I need to add another non capturing group behaving like an atomic group by default with that syntax). Perl syntax is also usually faster.

@tomasfryda
Copy link
Contributor

I'd use r"(.*)_AutoML_[\d_]+\d(.*)$" in python - in python 2.7.16 I got this error sre_constants.error: unmatched group so my modification makes sure we always have 2 matches even if the second one is just an empty string. Python 3 didn't mind the unmatched group.

Otherwise everything looks ok. Thank you!

Copy link
Contributor

@tomasfryda tomasfryda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thank you @sebhrusen!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants