Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-GPU model in pytorch0.3 consumes much more memory than pytorch0.1 version #31

Closed
ZhengRui opened this issue Mar 20, 2018 · 8 comments
Labels

Comments

@ZhengRui
Copy link
Contributor

ZhengRui commented Mar 20, 2018

Just tried the new implementation in pytorch0.3, but it consumes much more memory than old implementation. Some issues:

  1. when the model runs on a single gpu, it still allocates shared storage on all the gpus, i think the for device_idx in range(torch.cuda.device_count()) part in _SharedAllocation() part requires some modification and optimization.

  2. when the model runs on multi gpu, the batch size it can afford is much less than the batch size of single gpu times number of gpu. From my test it can only afford same size as single gpu version.

@gpleiss
Copy link
Owner

gpleiss commented Mar 20, 2018

@ZhengRui when I ran the demo script (latest version of master, pytorch 0.3), it consumes the same amount of memory as the original implementation (within ~100MB). Do you notice the memory increase when you run the demo script? Can you give me specific steps to replicate?

  1. "when the model runs on a single gpu, it still allocates shared storage on all the gpus" - if you don't want memory allocated on all GPUs, run the script with CUDA_VISIBLE_DEVICES=___.

  2. "when the model runs on multi gpu, the batch size it can afford is much less than the batch size of single gpu times number of gpu" - again, can you provide specific numbers, and an experiment to reproduce? From my experiments, the multi-GPU efficient model had about the same overhead as a normal multi-GPU model.

@ZhengRui
Copy link
Contributor Author

ZhengRui commented Mar 20, 2018

@gpleiss I just ran some more tests. The memory differences is indeed similar in single GPU case. When i compare it on my desktop with 4 gpus for multi gpu case, the memory usage is quite different:

old densenet201 model can support batch size 440 per gpu (even in training mode with bp):

In [1]: import torch; from torch.autograd import Variable; from densenet_efficient_multi_gpu import DenseNetEfficientMulti; net = torch.nn.DataParallel(DenseNe
   ...: tEfficientMulti(growth_rate=32, block_config=[6,12,48,32], num_init_features=64, num_classes=128, cifar=False)).cuda(); net.eval()
In [2]: o = net(Variable(torch.rand(1760,3,224,224), volatile=True).cuda())

memory usage:

Mon Mar 19 23:53:19 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.111                Driver Version: 384.111                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:17:00.0 Off |                  N/A |
| 28%   58C    P8    19W / 250W |   7668MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:65:00.0 Off |                  N/A |
| 24%   56C    P8    19W / 250W |   6532MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 108...  Off  | 00000000:B5:00.0  On |                  N/A |
| 16%   56C    P2    79W / 250W |   7225MiB / 11170MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 108...  Off  | 00000000:B6:00.0 Off |                  N/A |
| 21%   55C    P8    16W / 250W |   6868MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     37915      C   python                                      7657MiB |
|    1     37915      C   python                                      6521MiB |
|    2      1145      G   /usr/lib/xorg/Xorg                           256MiB |
|    2      1991      G   compiz                                        59MiB |
|    2      2326      G   ...-token=C676692CF525BEB157863C635C1C3915    47MiB |
|    2     37915      C   python                                      6857MiB |
|    3     37915      C   python                                      6857MiB |
+-----------------------------------------------------------------------------+

new densenet201 model can not support that batch size (even for inference), tested in both pytorch0.2 and pytorch0.3, similar memory usage:

In [1]: import torch; from torch.autograd import Variable; from densenet_efficient_pth3 import DenseNetEfficient; net = torch.nn.DataParallel(DenseNetEfficient
   ...: (growth_rate=32, block_config=[6,12,48,32], num_init_features=64, num_classes=128, small_inputs=False)).cuda(); net.eval()
In [2]: o = net(Variable(torch.rand(1700,3,224,224), volatile=True).cuda())
Mon Mar 19 23:59:33 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.111                Driver Version: 384.111                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:17:00.0 Off |                  N/A |
| 37%   61C    P8    20W / 250W |  10686MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:65:00.0 Off |                  N/A |
| 35%   60C    P8    20W / 250W |   9912MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 108...  Off  | 00000000:B5:00.0  On |                  N/A |
| 14%   57C    P2    79W / 250W |  10482MiB / 11170MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 108...  Off  | 00000000:B6:00.0 Off |                  N/A |
| 31%   58C    P8    17W / 250W |  10238MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     39547      C   python                                     10675MiB |
|    1     39547      C   python                                      9901MiB |
|    2      1145      G   /usr/lib/xorg/Xorg                           178MiB |
|    2      2326      G   ...-token=C676692CF525BEB157863C635C1C3915    71MiB |
|    2     39547      C   python                                     10225MiB |
|    2     40103      G   compiz                                         3MiB |
|    3     39547      C   python                                     10227MiB |
+-----------------------------------------------------------------------------+

For the 2nd issue I had above, i think it is because of myself, I added some conditions in _SharedAllocation() to differentiate single gpu and multi gpu cases which might caused the second issue i had.

@gpleiss
Copy link
Owner

gpleiss commented Mar 21, 2018

Thanks @ZhengRui for the profiling! I'll look into this later this week.

@gpleiss gpleiss changed the title New model in pytorch0.3 consumes much more memory and seems buggy Multi-GPU model in pytorch0.3 consumes much more memory than pytorch0.1 version Mar 21, 2018
@gpleiss gpleiss added the bug label Mar 21, 2018
@gpleiss
Copy link
Owner

gpleiss commented Apr 26, 2018

This will be fixed with the PyTorch 0.4 updates in #35

@ZhengRui
Copy link
Contributor Author

great, will test it soon 😀

@gpleiss
Copy link
Owner

gpleiss commented Apr 27, 2018

Closed by #35

@gpleiss gpleiss closed this as completed Apr 27, 2018
@ZhengRui
Copy link
Contributor Author

@gpleiss, can you share some memory comparison results between the new pytorch0.4 model and old models, i found the pytorch0.4 model consumes way more memory even in single gpu. Here is a simple test of densenet201 on 1080Ti (11G), can only afford batch size around 100, while the old model can support batch size around 600:

import torch
from densenet import DenseNet
net = DenseNet(growth_rate=32, block_config=[6,12,48,32], num_init_features=64, num_classes=128, small_inputs=False, efficient=True).cuda()
net.eval()
o = net(torch.rand(100,3,224,224).cuda())

@ZhengRui
Copy link
Contributor Author

ZhengRui commented May 13, 2018

my bad, should have read the 0.4 migration guide more carefully, the proper inference code should be like

import torch
from densenet import DenseNet
net = DenseNet(growth_rate=32, block_config=[6,12,48,32], num_init_features=64, num_classes=128, small_inputs=False, efficient=True).cuda()
net.eval()
with torch.no_grad():
    o = net(torch.rand(600,3,224,224).cuda())

not the batch size can also support 600 with the pytorch0.4 model and memory usage is more stable than before, nice update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants