-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] [checkpoint v2] Use remote uploader v2 for checkpointing #3320
[WIP] [checkpoint v2] Use remote uploader v2 for checkpointing #3320
Conversation
…2_for_checkpointing
…2_for_checkpointing
78e90f5
to
99bd9b7
Compare
…2_for_checkpointing
6e9612e
to
deaec50
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we default to v1 still?
composer/checkpoint/state_dict.py
Outdated
def get_metadata_state_dict( | ||
model: Optional[Union[ComposerModel, nn.Module]] = None, | ||
sharded_state_dict: Optional[bool] = None, | ||
precision: Optional[Union[str, torch.dtype]] = None, | ||
device: Optional[Device] = None, | ||
device_train_microbatch_size: Optional[int] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this copied from Evan's PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh it got messed up. will clean up
if self.use_remote_uploader_v2 and self.remote_uploader is not None: | ||
self.remote_uploader.upload_file_async( | ||
remote_file_name=symlink_name, | ||
file_path=symlink_filename, | ||
overwrite=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok this makes sense for a PR, where we just simplify the remoteuploader and don't change any functionality, but eventually we want to have the remoteuploader manage symlinks. That way we can actually avoid race conditions
Discussed with @eracah , we're going to add a new checkpoint saver, instead of changing current one. |
what?
Composer change to use the new RemoteUploader for saving checkpoing
Test
llm-foundry PR: mosaicml/llm-foundry#1237
1-node small model OCI test
test-uploader-wfA8Vz
1-node small model S3 test
save: test-uploader-10udLt e2e time: 3 min 40s
resume: test-uploader-TDGgTs
use old remote_uploader_downloader
test-uploader-racGuC
e2e time: 3 min 42s