-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PUBDEV-7859] grid.resume() #5234
Conversation
honzasterba
commented
Jan 14, 2021
- make grid resumable without providing any extra params
- this means all necessary config and params must be stored with the grid
- another piece to the puzzle of fault tolerance
313651a
to
c5d1c46
Compare
*/ | ||
public class HyperParameters extends Iced<HyperParameters> { | ||
|
||
private volatile Map<String, Object[]> values; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
volatile? Maybe transient instead? to make it clear it is not serialized?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch, fixed
@@ -74,6 +72,27 @@ public void test_SequentialWalker() { | |||
Scope.exit(); | |||
} | |||
} | |||
|
|||
@Test | |||
public void test_SequentialWalker_getHyperParams() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does the resume work also with parallel grid search?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
modified the test to test this as well, actually discovered a bug, in parallel grid search resume code
@@ -354,6 +354,22 @@ def train(self, x=None, y=None, training_frame=None, offset_column=None, fold_co | |||
parms["x"] = x | |||
self.build_model(parms) | |||
|
|||
def resume(self, recovery_dir=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
I expected a different design. Expected H2O to have a default "dump folder" where everything is saved automatically and that dump folder is checked on startup for any checkpoints and other configuration. Such a folder could be configurable of course. And it's usage. For starters, it could be enabled on K8S only via a dedicated envvar. I'm just curious why didn't we go this way ?
-
Resumes (not autorestart) are testable via JUnit - the deserialized objects can be compared easily with the ones before the algorithm was interrupted. This would test the correctness of the serialization part.
-
Should work with parallel grid search nicely, as each model has a separate file. Only tests are missing - would be nice to have one as well.
... yes, this is the next stel in the process, I have the code mostly ready but did not want to make this PR even larger
... will add |
GS saves models as they're built. If a new folder is specified in resume function, shouldn't the content of the old folder be copied to the new one ? ( What if there will be another interruption, will H2O be able to load all the models from the new folder ? |
print("models after first run:") | ||
for x in sorted(loaded.model_ids): | ||
print(x) | ||
loaded.hyper_params = hyper_parameters | ||
loaded.train(x=list(range(4)), y=4, training_frame=loaded_train) | ||
loaded.resume() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test with the new target folder as well ?
very good catch, will work on that |
- make grid resumable without providing any extra params - this means all necessary config and params must be stored with the grid - another piece to the puzzle of fault tolerance
- modified to test to use parallelism too
…viously trained models again
a3c88aa
to
6167312
Compare
1281fdf
to
83a4c0f
Compare
- make grid resumable without providing any extra params - this means all necessary config and params must be stored with the grid - another piece to the puzzle of fault tolerance - increase r cmd check timeout as it keeps timing out - fixed resume with parallel grid search - modified to test to use parallelism too - make sure we load also saved models on grid recovery and save the previously trained models again (cherry picked from commit abff884)