Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DeepSpeed GPU 0 parameter server OOM with 8xV100 #154

Closed
CalogeroZarbo opened this issue Mar 19, 2020 · 9 comments
Closed

DeepSpeed GPU 0 parameter server OOM with 8xV100 #154

CalogeroZarbo opened this issue Mar 19, 2020 · 9 comments

Comments

@CalogeroZarbo
Copy link
Contributor

Good morning,
I'm opening this issue because I have a small doubt.

I trained my 24 layers deep Reformer, with 768 nodes per layer, using DeepSpeed on 2xP100 successfully.
I tried the same setup on 8xV100 and the GPU 0 crashes due to OOM.
My understanding is that the GPU 0 needs to handle all the results computed also by the other GPUs (acting as parameter-server, I guess) and this lead the GPU 0 to have in memory, in addition to the architecture and the batches to compute, also the parameters data to aggregate.

Is that correct? Is it possible to mitigate this effect somehow?

The only way I found to make 8xV100 to work was to reduce the nodes from 768 to 512, which is counterintuitive since with 8xV100 I should be able to train the same architecture as per 2xP100 but four times faster.

Can you please put some light on this?

Thank you very much in advance for your time!
Cal

@tjruwase
Copy link
Contributor

tjruwase commented Mar 19, 2020

A couple of issues here:

  1. Your understanding is not exactly correct. Using your parameter-server analogy, with ZeRO, GPU 0 is acting as parameter server for only 1/N of the parameters in a N-degree data parallelism configuration.

  2. ZeRO (stage 1) deduplicates the optimizer states across the data parallelism nodes, and therefore should always reduce memory usage when enabled compared to when disabled. So can you verify this property, perhaps using nvidia-smi.

  3. I don't have a P100 handy, so can you report how much RAM it has? I know that the V100 has 16GB RAM. It could be that 8X16GB is still not enough aggregated RAM to support your model, and that you need to use more GPUs.

Actually, after my above comments, I just noticed that your comparison is 2xP100 vs 8xV100, and it is unlikely that 2xP100 has more aggregated RAM than 8xV100. This is indeed strange. But please address the questions above to help us investigate this issue. Thanks!

@CalogeroZarbo
Copy link
Contributor Author

Sure thing @tjruwase I'm glad to help.

I feel I have to explain better the situation:

  • I'm not training a normal Transformer, I'm using a Reformer (https://github.com/lucidrains/reformer-pytorch) which showed to work with DeepSpeed, and to decrease the complexity of the architecture.
  • This is the configuration file I used:
{
    "train_batch_size": 4,
    "steps_per_print": 2000,
    "fp16": {
      "enabled": true
    },
    "zero_optimization": false,
    "wall_clock_breakdown": false
  }

Since ZeRO Optimization option is set to false, for both hardware configuration 2xP100 (16GB RAM) vs 8xV100 (16GB RAM), what you explained me in the point 1 and 2 does still hold?

To be more specific about what happened during the training, by using nvidia-smi I see:
Screenshot 2020-03-19 at 15 01 37
Here I can see 2 processes on GPU:0 and 1 on GPU:1.

When I do the same thing on 8xV100 I see 8 processes on GPU:0 and 1 for each GPU:1, GPU:2, ... , GPU:7. In the latter case, 7 out of 8 processes on GPU:0 have 1.3GB RAM usage, and all the other that are distributed across the GPUs are of the same size of about 10-13GB RAM usage.

This led me to think that GPU:0 is handling some workload that comes from the other GPUs as well, and I thought it was due to the aggregation of the parameters, however as you say this is not the case.
Could you please explain to me what those processes are how can I distribute them across the GPUs to do not crash the memory of the GPU:0? Maybe only by activating ZeRO?
I will try that, and I will also answer to my other issue about RangerLars #153 , since this refers to the same training.

As always, thank you for your time and help. It's much appreciated!

Cheers,
Cal

@CalogeroZarbo
Copy link
Contributor Author

Hello @tjruwase
To be more complete on the explanation of the previous message I'm attaching here a screenshot of the same situation described above but on 8xV100:

Screenshot 2020-03-21 at 11 15 38

Thank you and have a nice weekend!

Cheers,
Cal

@NoOneUST
Copy link

Hello @tjruwase
To be more complete on the explanation of the previous message I'm attaching here a screenshot of the same situation described above but on 8xV100:

Screenshot 2020-03-21 at 11 15 38

Thank you and have a nice weekend!

Cheers,
Cal

I also have such problems, 8*2080Ti
image

@tjruwase
Copy link
Contributor

tjruwase commented Apr 1, 2020

@CalogeroZarbo I have some cycles to look into this. Can you please share your deepspeed port of reformer with me? Thanks!

@CalogeroZarbo
Copy link
Contributor Author

@tjruwase Sure! It's all published on GitHub: https://github.com/CalogeroZarbo/bioshield
The file I run is the train_seq2seq.py.

Let me know if I can help.

Cheers,
Cal

@tjruwase
Copy link
Contributor

tjruwase commented Apr 1, 2020

@CalogeroZarbo Thanks so much. That is great!

@tjruwase
Copy link
Contributor

tjruwase commented Apr 8, 2020

@CalogeroZarbo I found the source of problem of GPU0 using more memory than other GPUs. It is because this line sets device to current device, i.e. cuda:0, since device ordinal is not specified. And so all the subsequent .to(device) calls, such as this, were allocating GPU0 memory regardless of the rank.

In reality, with deepspeed, you don't need to manipulate device in this way, as deepspeed.initialize() will move the model and parameters to the correct device. And so the fix is to remove all the definition and uses of device. Now, on 4xV100-16GB I get the following
image

Can you please verify the fix, so we can close this issue? Thanks!

@CalogeroZarbo
Copy link
Contributor Author

Hi @tjruwase! Thank you for your help! This is a great fix! Sorry to bother you, it was difficult for me to understand that the problem was on my side.

Cheers,
Cal

samyam pushed a commit that referenced this issue Mar 8, 2021
samyam added a commit that referenced this issue Mar 8, 2021
* Squash stage3 v1 (#146)

Co-authored-by: Samyam <samyamr@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com>
Co-authored-by: eltonzheng <eltonz@microsoft.com>

* Fix correctness bug (#147)

* formatting fix (#150)

* stage3 bugfix (API) update and simplified FP16 Z3 tests (#151)

* fp16 Z3 API update and bugfix

* revert debug change

* ZeRO-3 detach and race condition bugfixes (#149)

* trying out ZeRO-3 race condition fix

* CUDA sync instead of stream

* reduction stream sync

* remove commented code

* Fix optimizer state_dict KeyError (#148)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* fix for smaller SGS sizes, ensures each grad is backed by unique tensors (#152)

* Simplifying the logic for getting averaged gradients (#153)

* skip for now

* Z3 Docs redux (#154)

* removing some TODOs and commented code (#155)

* New Z3 defaults (#156)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* formatting

* megatron external params

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com>
Co-authored-by: eltonzheng <eltonz@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants