Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PUBDEV-5975: Proposal for a consistent behaviour of AutoML reruns #3907

Merged
merged 17 commits into from
Oct 16, 2019

Conversation

sebhrusen
Copy link
Contributor

@sebhrusen sebhrusen commented Sep 20, 2019

https://0xdata.atlassian.net/browse/PUBDEV-5975

Due to the inherent issues with the current Python API for AutoML, user can obtain surprising results when looking at the leaderboard when doing reruns in AutoML, especially in the following cases:

  • when running multiple aml.train on a single aml (AutoML) instance.
  • when creating multiple AutoML instances without any project_name.

On backend side, it seems that it was also designed with a feature in mind that was never completed: the notion of project distinct from the automl_id. But it's very difficult to guess what designers had in mind as there is not true support for the concept of a project that would include multiple AutoML instances for example.

The contract for this proposal is detailed in the pyunit_automl_reruns.py test suite, so I encourage to look at it first.
To sum it up, the idea is:

  • to always create a new unique project (and therefore leaderboard), each time the user creates and AutoML instance without specifying the project_name.
  • to allow cumulative reruns (keep adding new models to the leaderboard), if user is calling train multiple times on the same project name, with compatible data (same training_frame, same response column).
  • to allow reruns but with a new leaderboard, if user is changing the response column or the training_frame (previous leaderboard is still accessible by id).

More issues

https://0xdata.atlassian.net/browse/PUBDEV-6708

This issue is due to reruns as well.
Current behaviour allows changing the leaderboard_frame and the new models will still be appended to the existing leaderboard (ignoring the new frame), this is just plain wrong!

The leaderboard should be identified uniquely by project_name + leaderboard_frame, making this rerun logic still more complicated.

Rerun behaviour contract

see https://github.com/h2oai/h2o-3/pull/3907/files#diff-1281d5db9141adc08c3047885255f970

@sebhrusen
Copy link
Contributor Author

@ledell will probably loosen the constraint on training_frame now that I fixed PUBDEV-6494 (#3910), and will create a new leaderboard iff response_column has changed, allowing the user to add new models to the leaderboard even when passing a different training_frame to the train method.

Copy link
Contributor

@ledell ledell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great improvement. LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants