-
Notifications
You must be signed in to change notification settings - Fork 419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[checkpoint v2] add remote uploader class #3303
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid race conditions with symlink and checkpoints, we do num_workers=1
. Should we just remove multi worker functionality (remove public arg)?
Otherwise, @eracah does the redesign have a way to fix this?
it will fix this because we will upload the symlink synchronously after the file(s) in the upload_worker |
78e90f5
to
99bd9b7
Compare
This version just does exact same thing as existing remote_uploader_downloader. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good start, @bigning ! I think it could be a lot simpler if we move to a design where on each upload_file_async
call a new executor.submit call is called. This would make it even simpler because we wouldn't need a file queue or an is_closed_event
. See the design doc implementation section for more details
Right, but in the design doc, I designed a feature where the symlink gets uploaded synchronously after the file. Is that intended for a follow-up PR? |
@eracah , that's actually my initial implementation (look at this), but it doesn't work unless we create a new object_store everytime we call upload_file. This is because, with mutli-process, it needs to pickle the input to the function, the object_store is not pickable. Does that make sense ? |
I'll leave the reviewing to others but adding myself to cc |
ffbc0b6
to
fab8278
Compare
discussed offline, this is a separate PR to add the new checkpoint saver |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
What does this PR do?
Add a RemoteUploader class
Compared to RemoteUploaderDownloader, it
test
unit test
pytest tests/utils/test_remote_uploader.py
e2e test
see PR #3320