Changes for enabling checkpoint syncing for hyperopt #2115

ShreyaR · 2022-06-08T08:07:46Z

Changes necessary in order to use a KubernetesSyncer in order to sync checkpoints across pods.

The proposed changes use the storage on the node running the main process as the default storage location where all checkpoints are synced to.

I've left some comments in there right now mostly for debugging, which I plan to remove before merging this PR.

github-actions · 2022-06-08T08:40:57Z

Unit Test Results

      6 files ±0       6 suites ±0 2h 16m 6s ⏱️ + 9m 37s
2 832 tests ±0 2 798 ✔️ ±0   34 💤 ±0 0 ❌ ±0
8 496 runs ±0 8 390 ✔️ ±0 106 💤 ±0 0 ❌ ±0

Results for commit 7d448b4. ± Comparison against base commit 6cc49ae.

♻️ This comment has been updated with latest results.

ludwig/hyperopt/execution.py

tgaddair · 2022-06-15T23:28:53Z

ludwig/hyperopt/execution.py

+
+            return (node_name, trial_dir)
+
+    @lru_cache(maxsize=1)


Can the node address change? I imagine if the ray train job can be made to be fault tolerance to head failure, it could.

This is an interesting question. I expect that fault tolerance to head node failure for tune jobs is the only scenario in which head node IP will change.

However, in that case I would expect the HyperoptExecutor class to be initialized again, and so would have the correct head node address?

Re: caching here -- the returned value of this function doesn't depend on the head node IP at all, and so even if the head node were to change and we use the cached return val of this function, we'd still have the correct behavior.

Let me know if this makes sense or if there's something I can do to make this more robust.

I think that sounds reasonable.

… for hyperopt

ShreyaR requested a review from tgaddair June 8, 2022 08:07

tgaddair reviewed Jun 8, 2022

View reviewed changes

ludwig/hyperopt/execution.py Outdated Show resolved Hide resolved

ShreyaR force-pushed the hyperopt-checkpointing branch 7 times, most recently from 300bb16 to 1315130 Compare June 15, 2022 22:29

tgaddair reviewed Jun 15, 2022

View reviewed changes

ShreyaR force-pushed the hyperopt-checkpointing branch from d587f92 to f60e0ab Compare June 15, 2022 23:48

tgaddair mentioned this pull request Jun 17, 2022

[discussion] What's the best practice to run ludwig job with a remote Ray cluster on Kubernetes? #2153

Closed

tgaddair approved these changes Jun 17, 2022

View reviewed changes

ShreyaR force-pushed the hyperopt-checkpointing branch 2 times, most recently from 178fc8d to 69dd174 Compare June 21, 2022 23:11

ShreyaR added 9 commits June 22, 2022 22:56

Changes for using a kubernetes syncer for enabling checkpoint syncing…

7fd1a28

… for hyperopt

Temporary changes

371d124

Ignore dot files

0c5bad3

Cleaning up execution.py

ca8a0de

Removed checkpoints lock import

68ab479

Removed dependence on kubernetes python package

b449fde

Responded to comments

8c2a967

Made changes to be consistent with ray 1.14

54e7f24

Fixed type hint for _get_remote_checkpoint_dir

7d448b4

ShreyaR force-pushed the hyperopt-checkpointing branch from 4d93657 to 7d448b4 Compare June 23, 2022 06:33

ShreyaR merged commit 038dbc5 into master Jun 23, 2022

ShreyaR deleted the hyperopt-checkpointing branch June 23, 2022 07:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changes for enabling checkpoint syncing for hyperopt #2115

Changes for enabling checkpoint syncing for hyperopt #2115

ShreyaR commented Jun 8, 2022 •

edited

Loading

github-actions bot commented Jun 8, 2022 •

edited

Loading

tgaddair Jun 15, 2022

ShreyaR Jun 15, 2022

tgaddair Jun 17, 2022

Changes for enabling checkpoint syncing for hyperopt #2115

Changes for enabling checkpoint syncing for hyperopt #2115

Conversation

ShreyaR commented Jun 8, 2022 • edited Loading

github-actions bot commented Jun 8, 2022 • edited Loading

Unit Test Results

tgaddair Jun 15, 2022

Choose a reason for hiding this comment

ShreyaR Jun 15, 2022

Choose a reason for hiding this comment

tgaddair Jun 17, 2022

Choose a reason for hiding this comment

ShreyaR commented Jun 8, 2022 •

edited

Loading

github-actions bot commented Jun 8, 2022 •

edited

Loading