Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

share inflight registry between PartitionedParameterCoordinators #3462

Merged
merged 7 commits into from May 15, 2023

Conversation

HeyangQin
Copy link
Contributor

@HeyangQin HeyangQin commented May 5, 2023

This is a collaborative effort with the Lightning team to solve #3068 and #3156. More discussion at Lightning-AI/pytorch-lightning#17523

There could be multiple PartitionedParameterCoordinator instances, yet they currently manage the parameters in a standalone manner. Let's say we have PartitionedParameterCoordinator A and B. When A puts some parameters inflight, B is not aware of that and when B tries to use these parameters it will just error out. This PR addresses this issue by making the __InflightParamRegistry shared among all PartitionedParameterCoordinator instances. Different from the #3380, this PR would bind the registry to the model so it doesn't break multi-model training

@HeyangQin HeyangQin marked this pull request as ready for review May 12, 2023 18:14
@awaelchli
Copy link
Contributor

I verified this works great together with our patch on the Lightning side (ready to be merged soon) 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants