[PUBDEV-7859] grid.resume() #5234

honzasterba · 2021-01-14T22:15:52Z

make grid resumable without providing any extra params
this means all necessary config and params must be stored with the grid
another piece to the puzzle of fault tolerance

michalkurka · 2021-01-18T14:42:46Z

h2o-core/src/main/java/hex/grid/HyperParameters.java

+ */
+public class HyperParameters extends Iced<HyperParameters> {
+
+    private volatile Map<String, Object[]> values;


volatile? Maybe transient instead? to make it clear it is not serialized?

good catch, fixed

michalkurka · 2021-01-18T14:44:15Z

h2o-algos/src/test/java/hex/grid/SequentialWalkerTest.java

@@ -74,6 +72,27 @@ public void test_SequentialWalker() {
            Scope.exit();
        }
    }
+
+    @Test
+    public void test_SequentialWalker_getHyperParams() {


does the resume work also with parallel grid search?

modified the test to test this as well, actually discovered a bug, in parallel grid search resume code

michalkurka · 2021-01-18T14:45:31Z

h2o-py/h2o/grid/grid_search.py

@@ -354,6 +354,22 @@ def train(self, x=None, y=None, training_frame=None, offset_column=None, fold_co
        parms["x"] = x
        self.build_model(parms)

+    def resume(self, recovery_dir=None):


Pscheidl

I expected a different design. Expected H2O to have a default "dump folder" where everything is saved automatically and that dump folder is checked on startup for any checkpoints and other configuration. Such a folder could be configurable of course. And it's usage. For starters, it could be enabled on K8S only via a dedicated envvar. I'm just curious why didn't we go this way ?
Resumes (not autorestart) are testable via JUnit - the deserialized objects can be compared easily with the ones before the algorithm was interrupted. This would test the correctness of the serialization part.
Should work with parallel grid search nicely, as each model has a separate file. Only tests are missing - would be nice to have one as well.

h2o-py/h2o/grid/grid_search.py

honzasterba · 2021-01-18T18:37:01Z

I expected a different design. Expected H2O to have a default "dump folder" where everything is saved automatically

... yes, this is the next stel in the process, I have the code mostly ready but did not want to make this PR even larger

Should work with parallel grid search nicely, as each model has a separate file. Only tests are missing - would be nice to have one as well.

... will add

Pscheidl · 2021-01-18T21:12:45Z

GS saves models as they're built. If a new folder is specified in resume function, shouldn't the content of the old folder be copied to the new one ? ( What if there will be another interruption, will H2O be able to load all the models from the new folder ?

h2o-core/src/main/java/hex/grid/HyperParameters.java

h2o-r/tests/testdir_algos/grid/runit_grid_recovery_dir.R

Pscheidl · 2021-01-18T21:23:24Z

h2o-hadoop-common/tests_fault_tolerance/python/test_grid_reload.py

            print("models after first run:")
            for x in sorted(loaded.model_ids):
                print(x)
-            loaded.hyper_params = hyper_parameters
-            loaded.train(x=list(range(4)), y=4, training_frame=loaded_train)
+            loaded.resume()


Test with the new target folder as well ?

honzasterba · 2021-01-19T13:09:13Z

GS saves models as they're built. If a new folder is specified in resume function, shouldn't the content of the old folder be copied to the new one ? ( What if there will be another interruption, will H2O be able to load all the models from the new folder ?

very good catch, will work on that

- make grid resumable without providing any extra params - this means all necessary config and params must be stored with the grid - another piece to the puzzle of fault tolerance

- modified to test to use parallelism too

…viously trained models again

- make grid resumable without providing any extra params - this means all necessary config and params must be stored with the grid - another piece to the puzzle of fault tolerance - increase r cmd check timeout as it keeps timing out - fixed resume with parallel grid search - modified to test to use parallelism too - make sure we load also saved models on grid recovery and save the previously trained models again (cherry picked from commit abff884)

honzasterba requested review from mn-mikke, Pscheidl, michalkurka and sebhrusen January 14, 2021 22:15

honzasterba self-assigned this Jan 14, 2021

honzasterba force-pushed the honza/PUBDEV-7859/grid_resume_2 branch from 313651a to c5d1c46 Compare January 18, 2021 10:48

michalkurka reviewed Jan 18, 2021

View reviewed changes

Pscheidl reviewed Jan 18, 2021

View reviewed changes

h2o-py/h2o/grid/grid_search.py Show resolved Hide resolved

honzasterba requested review from Pscheidl and michalkurka January 18, 2021 19:20

michalkurka approved these changes Jan 18, 2021

View reviewed changes

Pscheidl reviewed Jan 18, 2021

View reviewed changes

h2o-core/src/main/java/hex/grid/HyperParameters.java Show resolved Hide resolved

Pscheidl reviewed Jan 18, 2021

View reviewed changes

honzasterba requested review from Pscheidl and michalkurka January 20, 2021 13:37

honzasterba added 4 commits January 20, 2021 15:20

[PUBDEV-7859] grid.resume()

0fa7df7

- make grid resumable without providing any extra params - this means all necessary config and params must be stored with the grid - another piece to the puzzle of fault tolerance

increase r cmd check timeout as it keeps timing out

4e744d5

fixed resume with parallel grid search

de22c7c

- modified to test to use parallelism too

make sure we load also saved models on grid recovery and save the pre…

6167312

…viously trained models again

honzasterba force-pushed the honza/PUBDEV-7859/grid_resume_2 branch from a3c88aa to 6167312 Compare January 20, 2021 14:21

fix tests

83a4c0f

honzasterba force-pushed the honza/PUBDEV-7859/grid_resume_2 branch from 1281fdf to 83a4c0f Compare January 21, 2021 17:13

Pscheidl approved these changes Jan 22, 2021

View reviewed changes

honzasterba merged commit abff884 into master Jan 22, 2021

honzasterba deleted the honza/PUBDEV-7859/grid_resume_2 branch January 22, 2021 12:03

h2o-ops mentioned this pull request May 14, 2023

Fault tolerant grid search #7784

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PUBDEV-7859] grid.resume() #5234

[PUBDEV-7859] grid.resume() #5234

honzasterba commented Jan 14, 2021

michalkurka Jan 18, 2021

honzasterba Jan 18, 2021

michalkurka Jan 18, 2021

honzasterba Jan 18, 2021

michalkurka Jan 18, 2021

Pscheidl left a comment

honzasterba commented Jan 18, 2021

Pscheidl commented Jan 18, 2021 •

edited

Loading

Pscheidl Jan 18, 2021

honzasterba commented Jan 19, 2021

[PUBDEV-7859] grid.resume() #5234

[PUBDEV-7859] grid.resume() #5234

Conversation

honzasterba commented Jan 14, 2021

michalkurka Jan 18, 2021

Choose a reason for hiding this comment

honzasterba Jan 18, 2021

Choose a reason for hiding this comment

michalkurka Jan 18, 2021

Choose a reason for hiding this comment

honzasterba Jan 18, 2021

Choose a reason for hiding this comment

michalkurka Jan 18, 2021

Choose a reason for hiding this comment

Pscheidl left a comment

Choose a reason for hiding this comment

honzasterba commented Jan 18, 2021

Pscheidl commented Jan 18, 2021 • edited Loading

Pscheidl Jan 18, 2021

Choose a reason for hiding this comment

honzasterba commented Jan 19, 2021

Pscheidl commented Jan 18, 2021 •

edited

Loading