PUBDEV-5975: Proposal for a consistent behaviour of AutoML reruns #3907

sebhrusen · 2019-09-20T20:51:05Z

https://0xdata.atlassian.net/browse/PUBDEV-5975

Due to the inherent issues with the current Python API for AutoML, user can obtain surprising results when looking at the leaderboard when doing reruns in AutoML, especially in the following cases:

when running multiple aml.train on a single aml (AutoML) instance.
when creating multiple AutoML instances without any project_name.

On backend side, it seems that it was also designed with a feature in mind that was never completed: the notion of project distinct from the automl_id. But it's very difficult to guess what designers had in mind as there is not true support for the concept of a project that would include multiple AutoML instances for example.

The contract for this proposal is detailed in the pyunit_automl_reruns.py test suite, so I encourage to look at it first.
To sum it up, the idea is:

to always create a new unique project (and therefore leaderboard), each time the user creates and AutoML instance without specifying the project_name.
to allow cumulative reruns (keep adding new models to the leaderboard), if user is calling train multiple times on the same project name, with compatible data (same training_frame, same response column).
to allow reruns but with a new leaderboard, if user is changing the response column ~~or the training_frame~~ (previous leaderboard is still accessible by id).

More issues

https://0xdata.atlassian.net/browse/PUBDEV-6708

This issue is due to reruns as well.
Current behaviour allows changing the leaderboard_frame and the new models will still be appended to the existing leaderboard (ignoring the new frame), this is just plain wrong!

The leaderboard should be identified uniquely by project_name + leaderboard_frame, making this rerun logic still more complicated.

Rerun behaviour contract

see https://github.com/h2oai/h2o-3/pull/3907/files#diff-1281d5db9141adc08c3047885255f970

sebhrusen · 2019-09-23T17:35:45Z

@ledell will probably loosen the constraint on training_frame now that I fixed PUBDEV-6494 (#3910), and will create a new leaderboard iff response_column has changed, allowing the user to add new models to the leaderboard even when passing a different training_frame to the train method.

ledell

Great improvement. LGTM.

…onse column

…ing fields that should not change during lifetime of an automl object

…sic Java API to be able to read leaderboard

…trics, fixing remaining NPEs with empty leaderboards

… column for better leaderboard readability

…moval)

sebhrusen requested review from ledell and deil87 September 20, 2019 20:54

sebhrusen force-pushed the seb_pubdev-5975 branch 2 times, most recently from 65e62fc to 8a9a9f4 Compare October 3, 2019 19:40

sebhrusen force-pushed the seb_pubdev-5975 branch from a453042 to d561b68 Compare October 11, 2019 19:36

ledell approved these changes Oct 14, 2019

View reviewed changes

sebhrusen force-pushed the seb_pubdev-5975 branch from d561b68 to 87c47d8 Compare October 15, 2019 20:30

Sebastien Poirier added 17 commits October 16, 2019 13:54

create a proper default project name based on training frame and resp…

9a40a18

…onse column

defined AutoML rerun contract as Py tests

b908a83

contract proposal for reruns in Python

dcabece

fixed get_automl

d87a6ea

polishing

2f850c5

removed constraint on training_frame for reruns

7fca2c3

fixed R automl client

a3018f7

use standard naming convention in Leaderboard

d49162b

enforcing unique leaderboard_frame per leaderboard, and clearly defin…

c8825cc

…ing fields that should not change during lifetime of an automl object

fixed issues (NPEs) after renaming fields in Leaderboard + exposed ba…

9e46bd9

…sic Java API to be able to read leaderboard

cleanup Leaderboard code: naming conventions, refactored mess with me…

cba3aa5

…trics, fixing remaining NPEs with empty leaderboards

modified Py leaderboard tests after having put sorted metric as first…

d59d2ea

… column for better leaderboard readability

support change of sort_metric in reruns

d9b3dfc

clenup

e7fbb66

fixing issues in multinode + client mode (still issues with automl re…

db9f88d

…moval)

fixed automl cleanup from R client

c53c4ad

adding warnings if user is reusing a previous leaderboard

3b64b7e

sebhrusen force-pushed the seb_pubdev-5975 branch from 87c47d8 to 3b64b7e Compare October 16, 2019 11:54

sebhrusen merged commit b9e8f7d into master Oct 16, 2019

sebhrusen deleted the seb_pubdev-5975 branch March 16, 2020 17:34

h2o-ops mentioned this pull request May 14, 2023

real data type inferred for enum target in H2OAutoML #8924

Closed

hasithjp mentioned this pull request May 15, 2023

AutoML reruns (same project name, no project name...) #12823

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PUBDEV-5975: Proposal for a consistent behaviour of AutoML reruns #3907

PUBDEV-5975: Proposal for a consistent behaviour of AutoML reruns #3907

sebhrusen commented Sep 20, 2019 •

edited

Loading

sebhrusen commented Sep 23, 2019

ledell left a comment

PUBDEV-5975: Proposal for a consistent behaviour of AutoML reruns #3907

PUBDEV-5975: Proposal for a consistent behaviour of AutoML reruns #3907

Conversation

sebhrusen commented Sep 20, 2019 • edited Loading

More issues

Rerun behaviour contract

sebhrusen commented Sep 23, 2019

ledell left a comment

Choose a reason for hiding this comment

sebhrusen commented Sep 20, 2019 •

edited

Loading