Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix fsdp checkpoint strategy #1734

Merged
merged 13 commits into from
Nov 18, 2022
Merged

Conversation

bcui19
Copy link
Contributor

@bcui19 bcui19 commented Nov 16, 2022

What does this PR do?

This PR allows us to now use method 2 for activation checkpointing. Basically FSDP recursively dives down to see if there are modules that should have activation checkpointing. However, we shouldn't be trying to wrap 'FullyShardedDataParallel' or 'FlattenedParamsWrappers' modules with the 'CheckpointWrapper'. By trying to wrap these two methods, it leads to a strange assert

Ran with 7b param gpt-2 style models:
run with method 1: https://wandb.ai/mosaic-ml/fsdp-activation-checkpointing/runs/17oacy01
run with method 2: https://wandb.ai/mosaic-ml/fsdp-activation-checkpointing/runs/1f1ij64d

What issue(s) does this change relate to?

Fixing CO-1377

Before submitting

  • Have you read the contributor guidelines?
  • Is this change a documentation change or typo fix? If so, skip the rest of this checklist.
  • Was this change discussed/approved in a GitHub issue first? It is much more likely to be merged if so.
  • Did you update any related docs and document your change?
  • Did you update any related tests and add any new tests related to your change? (see testing)
  • Did you run the tests locally to make sure they pass?
  • Did you run pre-commit on your change? (see the pre-commit section of prerequisites)

@bcui19 bcui19 marked this pull request as ready for review November 17, 2022 00:00
Copy link
Member

@abhi-mosaic abhi-mosaic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the quick fix! I think there is some duplicate code to cleanup, but approving now to unblock.

composer/trainer/dist_strategy.py Outdated Show resolved Hide resolved
composer/trainer/dist_strategy.py Show resolved Hide resolved
@bcui19 bcui19 enabled auto-merge (squash) November 18, 2022 00:34
@bcui19 bcui19 merged commit a571bf9 into mosaicml:dev Nov 18, 2022
bmosaicml pushed a commit to bmosaicml/composer that referenced this pull request Dec 13, 2022
* Adding in performance registration

* clearing optimizer param groups before FSDP wrapping

* fixing comments

* Adding a conditional to ignore checkpointing FSDP and FlattenedWrappers

* removing print comments

* Adding in print for my own sanity

* removing extraneous print statement

* removing duplicate code due to a merge

Co-authored-by: Brandon Cui <bcui@Brandons-MBP.hsd1.ca.comcast.net>
@bcui19 bcui19 deleted the fix_FSDP_checkpoint_strategy branch March 10, 2023 18:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants