Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes for enabling checkpoint syncing for hyperopt #2115

Merged
merged 9 commits into from
Jun 23, 2022

Conversation

ShreyaR
Copy link
Contributor

@ShreyaR ShreyaR commented Jun 8, 2022

Changes necessary in order to use a KubernetesSyncer in order to sync checkpoints across pods.

The proposed changes use the storage on the node running the main process as the default storage location where all checkpoints are synced to.

I've left some comments in there right now mostly for debugging, which I plan to remove before merging this PR.

@ShreyaR ShreyaR requested a review from tgaddair June 8, 2022 08:07
@github-actions
Copy link

github-actions bot commented Jun 8, 2022

Unit Test Results

       6 files  ±0         6 suites  ±0   2h 16m 6s ⏱️ + 9m 37s
2 832 tests ±0  2 798 ✔️ ±0    34 💤 ±0  0 ±0 
8 496 runs  ±0  8 390 ✔️ ±0  106 💤 ±0  0 ±0 

Results for commit 7d448b4. ± Comparison against base commit 6cc49ae.

♻️ This comment has been updated with latest results.

@ShreyaR ShreyaR force-pushed the hyperopt-checkpointing branch 7 times, most recently from 300bb16 to 1315130 Compare June 15, 2022 22:29
ludwig/hyperopt/execution.py Outdated Show resolved Hide resolved

return (node_name, trial_dir)

@lru_cache(maxsize=1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the node address change? I imagine if the ray train job can be made to be fault tolerance to head failure, it could.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting question. I expect that fault tolerance to head node failure for tune jobs is the only scenario in which head node IP will change.

However, in that case I would expect the HyperoptExecutor class to be initialized again, and so would have the correct head node address?

Re: caching here -- the returned value of this function doesn't depend on the head node IP at all, and so even if the head node were to change and we use the cached return val of this function, we'd still have the correct behavior.

Let me know if this makes sense or if there's something I can do to make this more robust.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that sounds reasonable.

@ShreyaR ShreyaR force-pushed the hyperopt-checkpointing branch 2 times, most recently from 178fc8d to 69dd174 Compare June 21, 2022 23:11
@ShreyaR ShreyaR merged commit 038dbc5 into master Jun 23, 2022
@ShreyaR ShreyaR deleted the hyperopt-checkpointing branch June 23, 2022 07:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants