Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fault tolerant grid search #7784

Closed
exalate-issue-sync bot opened this issue May 11, 2023 · 1 comment
Closed

Fault tolerant grid search #7784

exalate-issue-sync bot opened this issue May 11, 2023 · 1 comment

Comments

@exalate-issue-sync
Copy link

Grid search:
save data (per algo detect what frames need to be saved)
save params
enable model checkpointing

On crash:
reload data
reload trained models
restart gridsearch with same params (grid will auto continue where we left off)

Proposed roadmap:
Stage 1 (this jira, end of January 2021):
• Introduce a generic API for automatic checkpointing and resuming from a checkpoint in H2O-3 - this would utilize existing building blocks in h2o
• SW will need to be able H2O cluster failure, dispose of the cluster, start anew one and ask H2O to resume from a checkpoint
• This solution will work for Grid Search and for algos that currently support checkpointing, for algos that do not support checkpointing (GLM) - the work will be seamlessly restarted from scratch

Stage 2 (Q1 2021):
• Add support for AutoML
• Add support for checkpointing to algos that do not currently support it (GLM, CoxPH, …) - based on booking preference

@h2o-ops
Copy link
Collaborator

h2o-ops commented May 14, 2023

JIRA Issue Migration Info

Jira Issue: PUBDEV-7859
Assignee: Jan Sterba
Reporter: Jan Sterba
State: Closed
Fix Version: 3.32.1.1
Attachments: N/A
Development PRs: Available

Linked PRs from JIRA

#5234
#5244
#5089
#5127
#5129

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant