Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AutoML reruns (same project name, no project name...) #12823

Closed
exalate-issue-sync bot opened this issue May 13, 2023 · 5 comments
Closed

AutoML reruns (same project name, no project name...) #12823

exalate-issue-sync bot opened this issue May 13, 2023 · 5 comments

Comments

@exalate-issue-sync
Copy link

multiple issues currently:

  • if not provided by user the project name is generated on client side: duplicate logic which exists only in Py + R client, nothing for the Java "API" -> the project name should be created on server side when creating the AutoML instance and returned to the client.
  • the logic implemented on client side only relies on the training frame. However, if the user runs AutoML with the same training frame but different target and/or predictors, we should not reuse the same AutoML instance. The project name should be created using the entire training set, for example as
    {{project_name = "{training_frame_id}_{hash(y)}_{hash(x)}"}}
  • the project name caching logic in {{AutoMLBuildSpec}} is confusing, apparently useless, if not plain wrong.
  • {{AutoML}} class uses a combination of {{buildSpec.build_control.project_name}} and {{projectName()}} method, we must ensure that there is only one entry point when consuming this name for consistency.

For an example of issue related with this project_name auto-generation, cf. https://groups.google.com/forum/#!topic/h2ostream/3KQSY4BNdvY

@exalate-issue-sync
Copy link
Author

Sebastien Poirier commented: [~accountid:557058:afd6e9a4-1891-4845-98ea-b5d34a2bc42c] do we want this one fixed for 3.22.0.x ?
There's been a few people complaining about this: cf. PUBDEV-5791 + the google groups link in the description.

@exalate-issue-sync
Copy link
Author

Erin LeDell commented: [~accountid:5b153fb1b0d76456f36daced] Maybe instead of
{{project_name = "{training_frame_id}_{hash(y)}_{hash(x)}"}}
we could use something a bit shorter than adding two extra hashes? What do you think of just adding a timestamp of the start time?

Since we already use the date + (seconds?) for auto-naming the models in the leaderboard, it seems like it would make sense to use the same timestamp. e.g

{{StackedEnsemble_AllModels_AutoML_20181127_075221"}}

Currently project name looks like this:
{{"automl_RTMP_sid_b6b2_4"}}

Also since we use {{"AutoML_{date}_{seconds}"}} with a capital "AutoML", maybe we should also change from lower case "automl".

Let me know your thoughts.

@exalate-issue-sync
Copy link
Author

Sebastien Poirier commented: [~accountid:557058:afd6e9a4-1891-4845-98ea-b5d34a2bc42c] after having a second look at this, we could use a simple default project_name (based on timestamp as you suggest) or a more complex one (based on hashes) depending on which constraints we want to impose to user... but I don't think that relying on project_name is enough anyway.

I'll try to sum up current behaviour and propose alternatives about how it should work.

Today:
{code:R}

no project name given

aml <- h2o.automl(x=x, y=y1, training_frame=train, max_models=3) #3+2 models visible in leaderboard
h2o.automl(x=x, y=y1, training_frame=train, max_models=3) #user reruns: 3 additional models visible in aml leaderboard
h2o.automl(x=x, y=y2, training_frame=train, max_models=3) #user reruns and change y: still 3 additional models visible in leaderboard!!!
{code}
as you guess, the same issue occurs if the user is passing its own same project_name in each run.
Worse! in this case, if user changes training_frame (as soon as it is compatible, but for example passing test frame instead), then it still works: the new models built with the new frame are added to the same project/leaderboard...

If we add a timestamp when generating the project_name (when not provided by user, or even adding it to user's provided name), then it will fix the issue, but at the cost of preventing user from doing reruns on the same AutoML instance...
We could also consider that users are adults, and if they provide a project name, and change target in a rerun, then they're just doing something silly: this would be reasonable with the R api, but much less with the Python one where users do
{code:Python}
aml = H2OAutoML(project_name='foo', max_models=3)
aml.train(y=y1, training_frame=train)
aml.train(y=y2, training_frame=train)
{code}
The API not only offers them the possibility to change predictors, target and frames after having created the AutoML instance, it almost encourages them to do so apparently without any consequence...

We have multiple alternatives to fix this:

  • systematically add a timestamp to project name for each run. Simple solution, but this will prevent reruns.
  • additionally to solution above, we add an explicit 'rerun' flag, sort of delegating the responsibility of strange behaviours to end user... besides this would look weird in the Python API.
  • let the user do reruns but detect early when he changes a critical param (training_frame, x, y) and either throw an exception if he does so, or automatically generate a new project: I would personally favour the exception to follow principle of least surprise. This has a drawback however: this should probably be the expected behaviour when user is passing its own project_name (in which case no hash is needed, a comparison is enough), but when the project_name is auto-generated, this looks weird, user would not understand why sometimes it works, sometimes it doesn't).
  • combine both solutions, and allow reruns only for projects that were given an explicit name (systematically add timestamps for autogenerated ones), and add a safety check during reruns, ensuring that user didn't change x-y-train.

The last one looks the most complete and predictable imo.
One last thing however: what is allowed to change during reruns? We mentioned x-y-train until now, but what if user only changes validation or leaderboard frame? or fold_column? or weights? should it be considered as a valid rerun?

The whole API looks broken for reruns in my opinion... the only parameters I would allow for reruns are things like max_models, max_runtime_secs, exclude_algos... basically those parameters that have none or minimal impact on how the individual models are built.
This would look like
{code:Python}
aml = H2OAutoML(project_name='foo', max_models=3)
aml.train(y=y1, training_frame=train) #even this looks broken imo
#user inspects leaderboard
aml.continue(max_models=10, exclude_algos=['DRF'])
{code}
{code:R}
aml <- h2o.automl(x=x, y=y1, training_frame=train, max_models=3)
aml$continue(max_models=10)
{code}

If reruns are popular, then I think API should be properly repaired. It's a bigger task however.
I would suggest to start with the combined fix mentioned earlier (we would still need to define in which case we should throw exception on rerun if user changed some parameter). Then the API extension could be added on top later.

Sorry for being so long :)

@exalate-issue-sync
Copy link
Author

Sebastien Poirier commented: [~accountid:557058:afd6e9a4-1891-4845-98ea-b5d34a2bc42c], I'm having third thoughts on this:
after all it seems reasonable to let the user change the set of predictors x, as soon as s/he's still using the same training sets and targets.
If we look at the question from https://groups.google.com/forum/#!topic/h2ostream/3KQSY4BNdvY, it could make sense to rerun AutoML, just changing some predictors and have the last AutoML run stacking all the previously computed models.
Which means that I don't know how we can fix this without potentially breaking existing usage:

  • some users are expecting that they will get a fresh new AutoML instance when changing predictors.
  • some other users may notice that it is only adding-up to the previous instance, and use this current behaviour in their scripts to make better predictions.

that's why for now, I think we should just fix the automatically generated project name to
{{project_name = "{training_frame_id}_{y}"}}

this would fix issue PUBDEV-5791, as it's really a bug in this case where AutoML instance was reused in spite of them changing the target.
but for the google question above, where use is only changing the predictors, then I would leave it as it is, but add more explanations for project_name param in the doc, maybe even add a section like "Recommended parameters" between mandatory and optional to make this more visible.
wdyt?

@hasithjp
Copy link
Member

JIRA Issue Migration Info

Jira Issue: PUBDEV-5975
Assignee: Sebastien Poirier
Reporter: Sebastien Poirier
State: Resolved
Fix Version: 3.28.0.1
Attachments: N/A
Development PRs: Available

Linked PRs from JIRA

#3907

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant