New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG]: ZeRO without using shard_param #1082
Comments
If I modified the code as following, it actually worked.
|
I do not know how to save the ZeRO model params. When using the save_checkpoint API , the saved file is pretty small. |
From the ZeRO paper
If using ZeRO without Parameter Partitioning, it has same communication volume as DP, but the training speed is lower than DP. The specific results is shown as following:
|
Yes, you are right. |
ZeRO saves fp16 parameters now, so the size of the saved checkpoint is the half of the normal ones. |
Thanks. But i checked the saved pt file, only bn params, other params are []. Maybe I maked some mistakes, i will further check it .
|
In theory, you are right. However, we don't optimize ZeRO without sharded parameters now. If you don't need shard parameters, you can just use DDP instead of ZeRO. We are implementing a new ZeRO, which is faster than current implementation. You can also try this when our work is done. |
Thanks. I have another question. As the tabel shown above, even I used shard params, the GPU memory only drop from 9141M to |
Could you tell the size of |
the ir18 model is as following:
|
Hi, ir18 model has only 45783552 elements, which is 87MB. I think it's normal, since it's a small model. Sharding parameters won't save much GPU memory. The activation takes the most GPU memory. You can try activation offload in this case. |
Thanks. Will your new ZeRO optimize ZeRO without sharded parameters ? |
Yes,we will |
Can't wait to see your new version of ZeRO. |
Hello, I met the same problem, I don't know how to save model's params properly when using ZeRO. I used the save_checkpoint API, but when I check the .pt file manually, I found all params are of value '[]' except some biases. How can I fix this? |
🐛 Describe the bug
🐛 Describe the bug
When i use ZeRO without shard_params, it occurs the following problems
My init code is:
my config is
Environment
pip install colossalai==0.1.5+torch1.10cu11.1 -f https://release.colossalai.org
ubuntu 18.04
Environment
pip install colossalai==0.1.5+torch1.10cu11.1 -f https://release.colossalai.org
ubuntu 18.04
The text was updated successfully, but these errors were encountered: