-
Notifications
You must be signed in to change notification settings - Fork 9
Fix for invalid sharding and add self-contained example for Llama3 model from TorchTitan #22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
technically, we can support uneven sharding, can't we? But for now I am ok with this. Also @XilunWu / @zpcore are going to put this logic into DTensor so we can remove it from here once that's ready.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now in DTensor sharding prop, we only check if
tensor_dim_size >= mesh_sizein some rules as in https://github.com/pytorch/pytorch/blob/0decd966af9cdcb7ab4410cf475d2fc09f2dea0c/torch/distributed/tensor/_ops/_matrix_ops.py#L78-L80 .Francisco's case is a bit different: what he's doing is to filter out uneven sharding. I would like you guys to clarify on this a bit.
I think there're 3 concepts:
is_tensor_shardablewhich meanstensor_dim_size >= mesh_size.And there're some TODOs:
unify (2) and (3). The consequence is we won't be able to support uneven sharding. Wanchao has a TODO on unifying (2) and (3) ( see https://github.com/pytorch/pytorch/blob/main/torch/distributed/tensor/_ops/utils.py#L147-L148 ) but I think this requires us to enable "always-on padding" (sorry I forgot the exact term we've been using, but the meaning is to always pad DTensor's local tensor rather than only pad/unpad around collectives) otherwise this unification rejects all uneven DTensors in sharding prop.
enforce
is_tensor_shardablecheck in every op's sharding prop.My question to @Francescaaa is, why AutoParallel wants to avoid (1)? If this is a hard blocker, then we can consider upstreaming a property to OpStrategy that outputs all even sharding strategies, because filtering out all uneven sharding strategies directly in sharding prop may be problematic for current DTensor status (i.e. we need to support always-on padding).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can ultimately support uneven sharding. But it might be better to focus on the "simple case" first and have everything be evenly shardable.
In the end, we should filter out everything which is not possible, so something like
tensor_dim_size >= mesh_sizeindeed. But to make things simpler to reason about I'm filtering a bit more things.For now I believe keeping only even sharding will probably be best to ensure the rest of the stack is working as expected, as it will be easier to reason about (i.e., we don't need to check the outcome of many different traces to understand what changed between them).
But when supporting dynamic shapes (which will be important for Ads models), we will need to re-enable uneven sharding indeed