New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG]: Memory consumption by fp16 is not normal #1083
Comments
The difference between my origin pytorch implementation and colossai is convert_to_amp API which using TorchAMPModel to decorate the origin model. 1 using torch.cuda.amp.autocast(True) inside model forward function:
2 using @torch.cuda.amp.autocast()
3 using TorchAMPModel
The first two are normal and only need 5769M GPU memory, but the third one needs 8703M GPU memory |
TorchAMPModel of ColossaiAI is the same as @torch.cuda.amp.autocast(cache_enable=True). |
The problem now is that using the same model, using pytorch+fp16 takes less memory than colossai+fp16. (5769M vs 8703M) |
I try to reproduce your problem but failed. Could you provide you code for reference?
|
My test code is as following, colossalai run --nproc_per_node 1 train_debug.py --config_dir PATH_TO_YOUR_CONFIG the config file is :
train_debug.py
|
use_colossai_engine = True. During the training process, the gpu memory is 8200M. |
See this line. So you are training the torch version without amp. Is that what you want?
|
I have verified that using TorchAMPModel or directly setting @torch.cuda.amp.autocast(cache_enable=True) is the same. So the conclusion of #1083 (comment) is wrong. I want training the torch version with amp, as you can seen in this line.
|
That makes sense. Thanks for helping us to check the functionality of AMP. I close this issue. |
That was not what I meant. What I want to express is that defining fp16 in config and then initializing the model with colossalai.initialize will consume about 8G of memory. Instead of using colossalai.initialize to initialize the model, training directly with TorchAMPModel only consumes 5.7G of memory. I would like to know what is the reason for this extra memory consumption. Is it a problem with using the Engine? You can test the code above. |
As we have checked that AMP is correct, so I close the issue. |
The issue #1082 is related to ZeRO and has been resolved. |
I can not run your code. What is your config.py? How did you inspect the cuda memory? |
config.py
I inspected the cuda memory using |
the command is
|
Hi, did you apply |
@FrankLeeeee is right. In the not use colossalai engine situation, you should also wrap the optimizer. That will align the memory usage. |
colossalai.initialize will automaticly use colossalai.amp.convert_to_torch_amp to wrap model, optimizer and criterion. As you can seen in the colossalai.initialize
|
Yes, I mean in your experiment without using |
of course.
without using colossalai.initialize, the training process only consume 5700M gpu memory. If using colossalai.initialize, it takes 8000M gpu memory. |
I inserted a print func into the colossalai.initialize and it actually printed what i wants.
console output:;
This means that the conversion of colossalai.amp.convert_to_torch_amp is indeed done through the initialization of colossalai.initialize . But it consumes more gpu memory. |
I think there is some misunderstanding. What I mean is that you should not only wrap your model with torch amp but also your optimzier. In AMP, we need to handle gradient with mixed precision as well. Wrapping only your model but not your optimizer will lose the gradient handling part. That's why you see a lower memory usage when
|
I do not think so, i have wraped criterion.
And the code below is the a pytorch way to wrap optimizer.
In this way, it still comsumes less gpu memory. But this is same as TorchAMPOptimizer(ColossalaiOptimizer). |
I may not have expressed clearly, as you said above, use colossalai.amp.convert_to_torch_amp to wrap the model, optimizer and criterion,
and then train normally, which also only consumes 4700M of memory.
But if you use colossalai.initialize to initialize, it will consume 7700M of memory. But we did see that by reading the fp16 parameter in config, in the initialization code of
|
I see, I am writing a script to reproduce this problem. Can you open a new issue so that we can move our discussion there? |
OK, i will open a new issue #1095 |
🐛 Describe the bug
When i used pytorch origin amp, the gpu memory is much smaller than colossai, why?
the config is
Environment
No response
The text was updated successfully, but these errors were encountered: