-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes for enabling checkpoint syncing for hyperopt #2115
Conversation
300bb16
to
1315130
Compare
|
||
return (node_name, trial_dir) | ||
|
||
@lru_cache(maxsize=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the node address change? I imagine if the ray train job can be made to be fault tolerance to head failure, it could.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an interesting question. I expect that fault tolerance to head node failure for tune jobs is the only scenario in which head node IP will change.
However, in that case I would expect the HyperoptExecutor
class to be initialized again, and so would have the correct head node address?
Re: caching here -- the returned value of this function doesn't depend on the head node IP at all, and so even if the head node were to change and we use the cached return val of this function, we'd still have the correct behavior.
Let me know if this makes sense or if there's something I can do to make this more robust.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that sounds reasonable.
d587f92
to
f60e0ab
Compare
178fc8d
to
69dd174
Compare
4d93657
to
7d448b4
Compare
Changes necessary in order to use a KubernetesSyncer in order to sync checkpoints across pods.
The proposed changes use the storage on the node running the main process as the default storage location where all checkpoints are synced to.
I've left some comments in there right now mostly for debugging, which I plan to remove before merging this PR.