Out of memory errors no matter what parameters with deep speed #145

afiaka87 · 2021-03-31T02:41:01Z

Using these fairly lightweight parameters:

BATCH_SIZE = 8
LEARNING_RATE = 3e-4

MODEL_DIM = 512
TEXT_SEQ_LEN = 128
DEPTH = 4
HEADS = 4
DIM_HEAD = 64
REVERSIBLE = True
LOSS_IMG_WEIGHT = 7

A single V100 GPU only needs 6356MB of RAM.

[0] Tesla V100-SXM2-16GB | 57'C, 81 % | 6356 / 16160 MB |

When run with deepspeed - memory usage immediately balloons to filling up each GPU's 16 GiB of RAM until finally running out of memory before a single iteration completes.

Aside - please dont take these personal ha - we have pinned versions and what not - just trying to be thorough so I can come back and try to fix them myself.

Traceback (most recent call last):
File "train_dalle.py", line 271, in
loss = distr_dalle(text, images, mask = mask, return_loss = True)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 914, in forward
loss = self.module(*inputs, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/DALLE-pytorch/dalle_pytorch/dalle_pytorch.py", line 495, in forward
loss_img = F.cross_entropy(logits[:, :, self.text_seq_len:], labels[:, self.text_seq_len:], ignore_index=0)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/functional.py", line 2422, in cross_entropy
return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/functional.py", line 1591, in log_softmax
ret = input.log_softmax(dim)
RuntimeError: CUDA out of memory. Tried to allocate 394.00 MiB (GPU 0; 15.78 GiB total capacity; 1.80 GiB already allocated; 178.75

The text was updated successfully, but these errors were encountered:

afiaka87 · 2021-03-31T02:43:44Z

[0] Tesla V100-SXM2-16GB | 49'C,  39 % | 10539 / 16160 MB |
[1] Tesla V100-SXM2-16GB | 44'C,   0 % |     3 / 16160 MB |
[2] Tesla V100-SXM2-16GB | 41'C,   0 % |     3 / 16160 MB |
[3] Tesla V100-SXM2-16GB | 42'C,   0 % |     3 / 16160 MB |
[4] Tesla V100-SXM2-16GB | 41'C,   0 % |     3 / 16160 MB |
[5] Tesla V100-SXM2-16GB | 40'C,   0 % |     3 / 16160 MB |
[6] Tesla V100-SXM2-16GB | 41'C,   0 % |     3 / 16160 MB |

[0] Tesla V100-SXM2-16GB | 48'C,   0 % | 14533 / 16160 MB |
[1] Tesla V100-SXM2-16GB | 45'C,   8 % |  1098 / 16160 MB |
[2] Tesla V100-SXM2-16GB | 43'C,   2 % |   894 / 16160 MB |
[3] Tesla V100-SXM2-16GB | 43'C,   2 % |  1088 / 16160 MB |
[4] Tesla V100-SXM2-16GB | 42'C,   2 % |  1112 / 16160 MB |
[5] Tesla V100-SXM2-16GB | 41'C,   2 % |   998 / 16160 MB |
[6] Tesla V100-SXM2-16GB | 42'C,   2 % |  1002 / 16160 MB |
[7] Tesla V100-SXM2-16GB | 48'C,   2 % |  1338 / 16160 MB |

0] Tesla V100-SXM2-16GB | 49'C,   0 % | 15825 / 16160 MB |
[1] Tesla V100-SXM2-16GB | 45'C,   0 % |  3534 / 16160 MB |
[2] Tesla V100-SXM2-16GB | 43'C,   0 % |  3534 / 16160 MB |
[3] Tesla V100-SXM2-16GB | 44'C,   0 % |  3534 / 16160 MB |
[4] Tesla V100-SXM2-16GB | 42'C,   0 % |  3534 / 16160 MB |
[5] Tesla V100-SXM2-16GB | 42'C,   0 % |  3534 / 16160 MB |
[6] Tesla V100-SXM2-16GB | 43'C,   0 % |  3534 / 16160 MB |
[7] Tesla V100-SXM2-16GB | 48'C,   0 % |  3534 / 16160 MB |

Traceback (most recent call last):
  File "train_dalle.py", line 271, in <module>
    loss = distr_dalle(text, images, mask = mask, return_loss = True)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 914, in forward
    loss = self.module(*inputs, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/root/DALLE-pytorch/dalle_pytorch/dalle_pytorch.py", line 495, in forward
    loss_img = F.cross_entropy(logits[:, :, self.text_seq_len:], labels[:, self.text_seq_len:], ignore_index=0)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/functional.py", line 2422, in cross_entropy
    return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/functional.py", line 1591, in log_softmax
    ret = input.log_softmax(dim)
RuntimeError: CUDA out of memory. Tried to allocate 394.00 MiB (GPU 0; 15.78 GiB total capacity; 1.80 GiB already allocated; 178.75 MiB free; 1.90 GiB reserved in total by PyTorch)

[0] Tesla V100-SXM2-16GB | 49'C,   0 % |     0 / 16160 MB |
[1] Tesla V100-SXM2-16GB | 47'C,   0 % |     0 / 16160 MB |
[2] Tesla V100-SXM2-16GB | 44'C,   0 % |     0 / 16160 MB |
[3] Tesla V100-SXM2-16GB | 45'C,   0 % |     0 / 16160 MB |
[4] Tesla V100-SXM2-16GB | 44'C,   0 % |     0 / 16160 MB |
[5] Tesla V100-SXM2-16GB | 43'C,   0 % |     0 / 16160 MB |
[6] Tesla V100-SXM2-16GB | 44'C,   0 % |     0 / 16160 MB |
[7] Tesla V100-SXM2-16GB | 50'C,   0 % |     0 / 16160 MB |

afiaka87 · 2021-03-31T03:24:10Z

I'm trying to find why this would be happening in the code but can't - everything looks normal. Each of the "distributed dalles" should be getting the same parameters...

afiaka87 · 2021-03-31T23:35:58Z

Hey @janEbert - have you had any luck with this? @lucidrains do we have another contributor who can help on this one? I'm quite uneducated with regard to multi-GPU at the moment - but this is going to essentially unlock options we didn't have before. Even with deepspeed's "naive" implementation - we've estimated the CUB results to be something like ~200$ of 8xV100's running for 24 hours on vast.ai (assuming higher than average prices).

Still a lot - but it's definitely more accessible than the "FEED ME INTO 512 GPUS" situation I was afraid that it may require.

janEbert · 2021-04-01T06:23:44Z

Hey @afiaka87, sadly I wasn't able to get to this yet. I'll probably not be able to get to it until the middle of next week. My only suggestion would be to turn some knobs in the deepspeed_config dictionary. Sadly, their two documentation websites (https://www.deepspeed.ai/ and https://deepspeed.readthedocs.io/) are really sparse with a lot of magic going on in the background they don't explain, even in their guides.

It's also the first time for me using DeepSpeed so I have some figuring out and diving into the source code left to do to fix these kinds of issues.

Currently, I see that DeepSpeed is not able to handle some things inside the models which is why half-precision training won't work out of the box (according to their documentation, it should). Again, I need more time to get into it which I sadly only have in a week or so. As I'm still pretty clueless about DeepSpeed, maybe someone else with experience will be even faster. ;)

On a sidenote, HuggingFace Transformers also have DeepSpeed support (see this PR). Maybe they have had similar issues.

Some other nice links:

afiaka87 · 2021-04-01T09:16:42Z

I'm gonna have a look into Horovod. I'll let you know if that works any better. @janEbert

janEbert · 2021-04-01T09:28:50Z

It absolutely will, I'm sure! :)
I actually thought about supporting both, we could rename deepspeed_utils to distributed_utils or something and wrap both, Horovod and DeepSpeed (and whatever else).
We have a lot more experience with Horovod, too, but DeepSpeed just offers more possibilities. Please hit me up if you encounter difficulties.

afiaka87 · 2021-04-01T10:19:09Z

This was a dependency issue. I'll have (very specific, unfortunately) instructions for how to get the correct environment installed shortly - but as it stands the code works fine.

Good work @janEbert - you nailed it first try.

afiaka87 · 2021-04-01T10:21:43Z

@janEbert I also wouldn't mind eventually switching over to horovod. The way I got this code working was actually with horovod's documentation ha. They have very good documentation and a much simpler deployment story.

For now - i'm just happy this is working. There are a few optimizations to be made that I'll keep looking at however.

2hq53hs8tur · 2021-04-11T12:58:00Z

Wait, @afiaka87, what was the fix?

janEbert · 2021-04-12T07:50:51Z

The discussion continues at #161. @afiaka87 managed to solve this problem by adding torch.cuda.empty_cache() calls in the training loop but those negatively influenced the performance (see #158 and #174).

If you also have this issue, please submit an issue to the DeepSpeed repo and/or provide some information to help us figure out what causes the error.

janEbert · 2021-04-12T10:56:59Z

@ex4sperans found out what caused the issue: #161 (comment)

afiaka87 closed this as completed Apr 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Out of memory errors no matter what parameters with deep speed #145

Out of memory errors no matter what parameters with deep speed #145

afiaka87 commented Mar 31, 2021 •

edited

afiaka87 commented Mar 31, 2021 •

edited

afiaka87 commented Mar 31, 2021

afiaka87 commented Mar 31, 2021 •

edited

janEbert commented Apr 1, 2021 •

edited

afiaka87 commented Apr 1, 2021 •

edited

janEbert commented Apr 1, 2021

afiaka87 commented Apr 1, 2021 •

edited

afiaka87 commented Apr 1, 2021

2hq53hs8tur commented Apr 11, 2021

janEbert commented Apr 12, 2021

janEbert commented Apr 12, 2021

Out of memory errors no matter what parameters with deep speed #145

Out of memory errors no matter what parameters with deep speed #145

Comments

afiaka87 commented Mar 31, 2021 • edited

afiaka87 commented Mar 31, 2021 • edited

afiaka87 commented Mar 31, 2021

afiaka87 commented Mar 31, 2021 • edited

janEbert commented Apr 1, 2021 • edited

afiaka87 commented Apr 1, 2021 • edited

janEbert commented Apr 1, 2021

afiaka87 commented Apr 1, 2021 • edited

afiaka87 commented Apr 1, 2021

2hq53hs8tur commented Apr 11, 2021

janEbert commented Apr 12, 2021

janEbert commented Apr 12, 2021

afiaka87 commented Mar 31, 2021 •

edited

afiaka87 commented Mar 31, 2021 •

edited

afiaka87 commented Mar 31, 2021 •

edited

janEbert commented Apr 1, 2021 •

edited

afiaka87 commented Apr 1, 2021 •

edited

afiaka87 commented Apr 1, 2021 •

edited