New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-GPU model in pytorch0.3 consumes much more memory than pytorch0.1 version #31
Comments
@ZhengRui when I ran the demo script (latest version of master, pytorch 0.3), it consumes the same amount of memory as the original implementation (within ~100MB). Do you notice the memory increase when you run the demo script? Can you give me specific steps to replicate?
|
@gpleiss I just ran some more tests. The memory differences is indeed similar in single GPU case. When i compare it on my desktop with 4 gpus for multi gpu case, the memory usage is quite different: old densenet201 model can support batch size 440 per gpu (even in training mode with bp): In [1]: import torch; from torch.autograd import Variable; from densenet_efficient_multi_gpu import DenseNetEfficientMulti; net = torch.nn.DataParallel(DenseNe
...: tEfficientMulti(growth_rate=32, block_config=[6,12,48,32], num_init_features=64, num_classes=128, cifar=False)).cuda(); net.eval()
In [2]: o = net(Variable(torch.rand(1760,3,224,224), volatile=True).cuda()) memory usage:
new densenet201 model can not support that batch size (even for inference), tested in both pytorch0.2 and pytorch0.3, similar memory usage: In [1]: import torch; from torch.autograd import Variable; from densenet_efficient_pth3 import DenseNetEfficient; net = torch.nn.DataParallel(DenseNetEfficient
...: (growth_rate=32, block_config=[6,12,48,32], num_init_features=64, num_classes=128, small_inputs=False)).cuda(); net.eval()
In [2]: o = net(Variable(torch.rand(1700,3,224,224), volatile=True).cuda())
For the 2nd issue I had above, i think it is because of myself, I added some conditions in |
Thanks @ZhengRui for the profiling! I'll look into this later this week. |
This will be fixed with the PyTorch 0.4 updates in #35 |
great, will test it soon 😀 |
Closed by #35 |
@gpleiss, can you share some memory comparison results between the new pytorch0.4 model and old models, i found the pytorch0.4 model consumes way more memory even in single gpu. Here is a simple test of densenet201 on 1080Ti (11G), can only afford batch size around 100, while the old model can support batch size around 600: import torch
from densenet import DenseNet
net = DenseNet(growth_rate=32, block_config=[6,12,48,32], num_init_features=64, num_classes=128, small_inputs=False, efficient=True).cuda()
net.eval()
o = net(torch.rand(100,3,224,224).cuda()) |
my bad, should have read the 0.4 migration guide more carefully, the proper inference code should be like import torch
from densenet import DenseNet
net = DenseNet(growth_rate=32, block_config=[6,12,48,32], num_init_features=64, num_classes=128, small_inputs=False, efficient=True).cuda()
net.eval()
with torch.no_grad():
o = net(torch.rand(600,3,224,224).cuda()) not the batch size can also support 600 with the pytorch0.4 model and memory usage is more stable than before, nice update. |
Just tried the new implementation in pytorch0.3, but it consumes much more memory than old implementation. Some issues:
when the model runs on a single gpu, it still allocates shared storage on all the gpus, i think the
for device_idx in range(torch.cuda.device_count())
part in_SharedAllocation()
part requires some modification and optimization.when the model runs on multi gpu, the batch size it can afford is much less than the batch size of single gpu times number of gpu. From my test it can only afford same size as single gpu version.
The text was updated successfully, but these errors were encountered: