-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG]: TypeError: __init__() got an unexpected keyword argument 'strict_ddp_mode' #2590
Comments
Can you try upgrading your colossalAI to 0.2.0 and pulling the latest changes to the gpt example accordingly? |
Thanks @JThh . It works after I have upgraded to 0.2.0. Can you please update the README.md here accordingly? Also, Can you help me with the definitions and recommendations of each of the distplans ["CAI_ZeRO1", "CAI_ZeRO2", "CAI_Gemini", "Pytorch_DDP", "Pytorch_ZeRO"]? I could not find any compiled information, thank you! |
Thank you for pointing this out. I will seek to update soon. "CAI_ZeRO1", "CAI_ZeRO2" refer to two types of zero redundancy optimizers, with their differences being detailed in this paper (on page 2). For "CAI_Gemini" (or "zero3"), please see this our tutorial. And "Pytorch_DDP", "Pytorch_ZeRO" literally mean the implementation of data parallelism and ZeroRedundancyOptimizer by PyTorch. You may trace the function calls here and see how they differentiate. You are advised to try out our |
Thank you @JThh for the references. I am playing with different distplans listed and making a recommendation list that suggests when to use which type of plan for a given dataset and hardware capacity since it is clearly not available at one place. If you already have this then you can help me with it. This will save me a lot of time. Thank you once again! |
Thank you. If you are willing to share your findings, feel free to post here! |
🐛 Describe the bug
I am able to run training job with zero1 and zero2 but I am facing this issue with distplan 'colossalai':
The text was updated successfully, but these errors were encountered: