Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] [checkpoint v2] Use remote uploader v2 for checkpointing #3320

Conversation

bigning
Copy link
Contributor

@bigning bigning commented May 24, 2024

what?

Composer change to use the new RemoteUploader for saving checkpoing

Test

llm-foundry PR: mosaicml/llm-foundry#1237

1-node small model OCI test

test-uploader-wfA8Vz

1-node small model S3 test

save: test-uploader-10udLt e2e time: 3 min 40s
resume: test-uploader-TDGgTs

use old remote_uploader_downloader test-uploader-racGuC e2e time: 3 min 42s

@bigning bigning changed the title [WIP] Use remote uploader v2 for checkpointing [checkpoint v2] Use remote uploader v2 for checkpointing May 29, 2024
@bigning bigning force-pushed the use_remote_uploader_v2_for_checkpointing branch from 6e9612e to deaec50 Compare May 30, 2024 04:04
@bigning bigning marked this pull request as ready for review May 30, 2024 04:05
Copy link
Contributor

@mvpatel2000 mvpatel2000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we default to v1 still?

Comment on lines 159 to 164
def get_metadata_state_dict(
model: Optional[Union[ComposerModel, nn.Module]] = None,
sharded_state_dict: Optional[bool] = None,
precision: Optional[Union[str, torch.dtype]] = None,
device: Optional[Device] = None,
device_train_microbatch_size: Optional[int] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this copied from Evan's PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh it got messed up. will clean up

@bigning bigning changed the title [checkpoint v2] Use remote uploader v2 for checkpointing [WIP] [checkpoint v2] Use remote uploader v2 for checkpointing May 30, 2024
Comment on lines +552 to +556
if self.use_remote_uploader_v2 and self.remote_uploader is not None:
self.remote_uploader.upload_file_async(
remote_file_name=symlink_name,
file_path=symlink_filename,
overwrite=True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok this makes sense for a PR, where we just simplify the remoteuploader and don't change any functionality, but eventually we want to have the remoteuploader manage symlinks. That way we can actually avoid race conditions

@bigning
Copy link
Contributor Author

bigning commented May 30, 2024

Discussed with @eracah , we're going to add a new checkpoint saver, instead of changing current one.

@bigning bigning closed this May 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants