Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PUBDEV-7859] grid.resume() #5234

Merged
merged 5 commits into from
Jan 22, 2021
Merged

Conversation

honzasterba
Copy link
Contributor

  • make grid resumable without providing any extra params
  • this means all necessary config and params must be stored with the grid
  • another piece to the puzzle of fault tolerance

*/
public class HyperParameters extends Iced<HyperParameters> {

private volatile Map<String, Object[]> values;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

volatile? Maybe transient instead? to make it clear it is not serialized?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, fixed

@@ -74,6 +72,27 @@ public void test_SequentialWalker() {
Scope.exit();
}
}

@Test
public void test_SequentialWalker_getHyperParams() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the resume work also with parallel grid search?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modified the test to test this as well, actually discovered a bug, in parallel grid search resume code

@@ -354,6 +354,22 @@ def train(self, x=None, y=None, training_frame=None, offset_column=None, fold_co
parms["x"] = x
self.build_model(parms)

def resume(self, recovery_dir=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool!

Copy link
Contributor

@Pscheidl Pscheidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I expected a different design. Expected H2O to have a default "dump folder" where everything is saved automatically and that dump folder is checked on startup for any checkpoints and other configuration. Such a folder could be configurable of course. And it's usage. For starters, it could be enabled on K8S only via a dedicated envvar. I'm just curious why didn't we go this way ?

  • Resumes (not autorestart) are testable via JUnit - the deserialized objects can be compared easily with the ones before the algorithm was interrupted. This would test the correctness of the serialization part.

  • Should work with parallel grid search nicely, as each model has a separate file. Only tests are missing - would be nice to have one as well.

h2o-py/h2o/grid/grid_search.py Show resolved Hide resolved
@honzasterba
Copy link
Contributor Author

  • I expected a different design. Expected H2O to have a default "dump folder" where everything is saved automatically

... yes, this is the next stel in the process, I have the code mostly ready but did not want to make this PR even larger

  • Should work with parallel grid search nicely, as each model has a separate file. Only tests are missing - would be nice to have one as well.

... will add

@Pscheidl
Copy link
Contributor

Pscheidl commented Jan 18, 2021

GS saves models as they're built. If a new folder is specified in resume function, shouldn't the content of the old folder be copied to the new one ? ( What if there will be another interruption, will H2O be able to load all the models from the new folder ?

print("models after first run:")
for x in sorted(loaded.model_ids):
print(x)
loaded.hyper_params = hyper_parameters
loaded.train(x=list(range(4)), y=4, training_frame=loaded_train)
loaded.resume()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test with the new target folder as well ?

@honzasterba
Copy link
Contributor Author

GS saves models as they're built. If a new folder is specified in resume function, shouldn't the content of the old folder be copied to the new one ? ( What if there will be another interruption, will H2O be able to load all the models from the new folder ?

very good catch, will work on that

- make grid resumable without providing any extra params
- this means all necessary config and params must be stored with the grid
- another piece to the puzzle of fault tolerance
- modified to test to use parallelism too
@honzasterba honzasterba force-pushed the honza/PUBDEV-7859/grid_resume_2 branch from a3c88aa to 6167312 Compare January 20, 2021 14:21
@honzasterba honzasterba force-pushed the honza/PUBDEV-7859/grid_resume_2 branch from 1281fdf to 83a4c0f Compare January 21, 2021 17:13
@honzasterba honzasterba merged commit abff884 into master Jan 22, 2021
@honzasterba honzasterba deleted the honza/PUBDEV-7859/grid_resume_2 branch January 22, 2021 12:03
flaviusburca pushed a commit to mware-solutions/h2o-3 that referenced this pull request Apr 21, 2021
- make grid resumable without providing any extra params
- this means all necessary config and params must be stored with the grid
- another piece to the puzzle of fault tolerance
- increase r cmd check timeout as it keeps timing out
- fixed resume with parallel grid search
- modified to test to use parallelism too
- make sure we load also saved models on grid recovery and save the previously trained models again

(cherry picked from commit abff884)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants