DeepSpeed GPU 0 parameter server OOM with 8xV100 #154

CalogeroZarbo · 2020-03-19T08:09:54Z

Good morning,
I'm opening this issue because I have a small doubt.

I trained my 24 layers deep Reformer, with 768 nodes per layer, using DeepSpeed on 2xP100 successfully.
I tried the same setup on 8xV100 and the GPU 0 crashes due to OOM.
My understanding is that the GPU 0 needs to handle all the results computed also by the other GPUs (acting as parameter-server, I guess) and this lead the GPU 0 to have in memory, in addition to the architecture and the batches to compute, also the parameters data to aggregate.

Is that correct? Is it possible to mitigate this effect somehow?

The only way I found to make 8xV100 to work was to reduce the nodes from 768 to 512, which is counterintuitive since with 8xV100 I should be able to train the same architecture as per 2xP100 but four times faster.

Can you please put some light on this?

Thank you very much in advance for your time!
Cal

tjruwase · 2020-03-19T12:59:43Z

A couple of issues here:

Your understanding is not exactly correct. Using your parameter-server analogy, with ZeRO, GPU 0 is acting as parameter server for only 1/N of the parameters in a N-degree data parallelism configuration.
ZeRO (stage 1) deduplicates the optimizer states across the data parallelism nodes, and therefore should always reduce memory usage when enabled compared to when disabled. So can you verify this property, perhaps using nvidia-smi.
I don't have a P100 handy, so can you report how much RAM it has? I know that the V100 has 16GB RAM. It could be that 8X16GB is still not enough aggregated RAM to support your model, and that you need to use more GPUs.

Actually, after my above comments, I just noticed that your comparison is 2xP100 vs 8xV100, and it is unlikely that 2xP100 has more aggregated RAM than 8xV100. This is indeed strange. But please address the questions above to help us investigate this issue. Thanks!

CalogeroZarbo · 2020-03-19T14:22:15Z

Sure thing @tjruwase I'm glad to help.

I feel I have to explain better the situation:

I'm not training a normal Transformer, I'm using a Reformer (https://github.com/lucidrains/reformer-pytorch) which showed to work with DeepSpeed, and to decrease the complexity of the architecture.
This is the configuration file I used:

{
    "train_batch_size": 4,
    "steps_per_print": 2000,
    "fp16": {
      "enabled": true
    },
    "zero_optimization": false,
    "wall_clock_breakdown": false
  }

Since ZeRO Optimization option is set to false, for both hardware configuration 2xP100 (16GB RAM) vs 8xV100 (16GB RAM), what you explained me in the point 1 and 2 does still hold?

To be more specific about what happened during the training, by using nvidia-smi I see:

Here I can see 2 processes on GPU:0 and 1 on GPU:1.

When I do the same thing on 8xV100 I see 8 processes on GPU:0 and 1 for each GPU:1, GPU:2, ... , GPU:7. In the latter case, 7 out of 8 processes on GPU:0 have 1.3GB RAM usage, and all the other that are distributed across the GPUs are of the same size of about 10-13GB RAM usage.

This led me to think that GPU:0 is handling some workload that comes from the other GPUs as well, and I thought it was due to the aggregation of the parameters, however as you say this is not the case.
Could you please explain to me what those processes are how can I distribute them across the GPUs to do not crash the memory of the GPU:0? Maybe only by activating ZeRO?
I will try that, and I will also answer to my other issue about RangerLars #153 , since this refers to the same training.

As always, thank you for your time and help. It's much appreciated!

Cheers,
Cal

CalogeroZarbo · 2020-03-21T10:18:23Z

Hello @tjruwase
To be more complete on the explanation of the previous message I'm attaching here a screenshot of the same situation described above but on 8xV100:

Thank you and have a nice weekend!

Cheers,
Cal

NoOneUST · 2020-03-28T16:06:06Z

Hello @tjruwase
To be more complete on the explanation of the previous message I'm attaching here a screenshot of the same situation described above but on 8xV100:

Thank you and have a nice weekend!

Cheers,
Cal

I also have such problems, 8*2080Ti

tjruwase · 2020-04-01T15:48:18Z

@CalogeroZarbo I have some cycles to look into this. Can you please share your deepspeed port of reformer with me? Thanks!

CalogeroZarbo · 2020-04-01T16:50:28Z

@tjruwase Sure! It's all published on GitHub: https://github.com/CalogeroZarbo/bioshield
The file I run is the train_seq2seq.py.

Let me know if I can help.

Cheers,
Cal

tjruwase · 2020-04-01T16:55:38Z

@CalogeroZarbo Thanks so much. That is great!

tjruwase · 2020-04-08T06:40:54Z

@CalogeroZarbo I found the source of problem of GPU0 using more memory than other GPUs. It is because this line sets device to current device, i.e. cuda:0, since device ordinal is not specified. And so all the subsequent .to(device) calls, such as this, were allocating GPU0 memory regardless of the rank.

In reality, with deepspeed, you don't need to manipulate device in this way, as deepspeed.initialize() will move the model and parameters to the correct device. And so the fix is to remove all the definition and uses of device. Now, on 4xV100-16GB I get the following

Can you please verify the fix, so we can close this issue? Thanks!

CalogeroZarbo · 2020-04-08T10:52:46Z

Hi @tjruwase! Thank you for your help! This is a great fix! Sorry to bother you, it was difficult for me to understand that the problem was on my side.

Cheers,
Cal

* Squash stage3 v1 (#146) Co-authored-by: Samyam <samyamr@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com> Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com> Co-authored-by: eltonzheng <eltonz@microsoft.com> * Fix correctness bug (#147) * formatting fix (#150) * stage3 bugfix (API) update and simplified FP16 Z3 tests (#151) * fp16 Z3 API update and bugfix * revert debug change * ZeRO-3 detach and race condition bugfixes (#149) * trying out ZeRO-3 race condition fix * CUDA sync instead of stream * reduction stream sync * remove commented code * Fix optimizer state_dict KeyError (#148) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * fix for smaller SGS sizes, ensures each grad is backed by unique tensors (#152) * Simplifying the logic for getting averaged gradients (#153) * skip for now * Z3 Docs redux (#154) * removing some TODOs and commented code (#155) * New Z3 defaults (#156) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * formatting * megatron external params Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com> Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com> Co-authored-by: eltonzheng <eltonz@microsoft.com>

CalogeroZarbo closed this as completed Apr 8, 2020

samyam pushed a commit that referenced this issue Mar 8, 2021

Z3 Docs redux (#154)

8013615

ex4sperans mentioned this issue Apr 12, 2021

Huge GPU memory consumption when using DeepSpeed lucidrains/DALLE-pytorch#161

Closed

tjruwase mentioned this issue May 18, 2021

Distilling with DeepSpeed #1083

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DeepSpeed GPU 0 parameter server OOM with 8xV100 #154

DeepSpeed GPU 0 parameter server OOM with 8xV100 #154

CalogeroZarbo commented Mar 19, 2020

tjruwase commented Mar 19, 2020 •

edited

Loading

CalogeroZarbo commented Mar 19, 2020

CalogeroZarbo commented Mar 21, 2020

NoOneUST commented Mar 28, 2020

tjruwase commented Apr 1, 2020

CalogeroZarbo commented Apr 1, 2020

tjruwase commented Apr 1, 2020

tjruwase commented Apr 8, 2020 •

edited

Loading

CalogeroZarbo commented Apr 8, 2020

DeepSpeed GPU 0 parameter server OOM with 8xV100 #154

DeepSpeed GPU 0 parameter server OOM with 8xV100 #154

Comments

CalogeroZarbo commented Mar 19, 2020

tjruwase commented Mar 19, 2020 • edited Loading

CalogeroZarbo commented Mar 19, 2020

CalogeroZarbo commented Mar 21, 2020

NoOneUST commented Mar 28, 2020

tjruwase commented Apr 1, 2020

CalogeroZarbo commented Apr 1, 2020

tjruwase commented Apr 1, 2020

tjruwase commented Apr 8, 2020 • edited Loading

CalogeroZarbo commented Apr 8, 2020

tjruwase commented Mar 19, 2020 •

edited

Loading

tjruwase commented Apr 8, 2020 •

edited

Loading