Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Training hangs when each process is trained on different number of batches #2223

Closed
iamsimha opened this issue Aug 16, 2022 · 2 comments
Labels
bug Something isn't working

Comments

@iamsimha
Copy link

Describe the bug
I am trying to reproduce the work "Training Bert on academic budget" using the codebase provided by the authors. https://github.com/IntelLabs/academic-budget-bert.

However when training, the model gets stuck after a few epochs. The root cause is — each process trains on batches of different size and its gets stuck during the ".step()" call on deepspeed engine object. Specifically the line: https://github.com/IntelLabs/academic-budget-bert/blob/04f6da685acf4dfc47b85b42307e17340e87fde3/run_pretraining.py#L219

Similar behaviour is also observed in this issue Lightning-AI/pytorch-lightning#13498

This issue completely disappears, if same number of batches are provided to all the processes.

To Reproduce
Steps to reproduce the behavior:

  1. Use the steps detailed in https://github.com/IntelLabs/academic-budget-bert
  2. However reproducing is a lengthy process. I can create a minimal example, if you are unable to reproduce.
  3. Issue happens, when a process has large number of batches. In this case, Rank 2 process gets stuck.
    Rank = 0, Number of batches 4
    Rank = 1 Number of batches 4
    Rank = 2, Number of batches 8
    Rank = 3, Number of batches 4

Expected behavior

Training should not hang

ds_report output


DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja
ninja .................. [OKAY]

op name ................ installed .. compatible

cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
[WARNING] using untested triton version (1.1.1), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-devel package with yum
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]

DeepSpeed general environment info:
torch install path ............... ['/home/xx/anaconda3/envs/bert_apex/lib/python3.9/site-packages/torch']
torch version .................... 1.12.1
torch cuda version ............... 11.3
torch hip version ................ None
nvcc version ..................... 11.3
deepspeed install path ........... ['/home/xx/anaconda3/envs/bert_apex/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.7.0, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.12, cuda 11.3

Screenshots
If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

  • Ubuntu 18.04
  • 8
  • Interconnects (if applicable) [e.g., two machines connected with 100 Gbps IB]
  • 3.9

Launcher context
deepspeed run_pretraining.py

@iamsimha iamsimha added the bug Something isn't working label Aug 16, 2022
@tjruwase
Copy link
Contributor

The common practice with data parallel training is for the gpus to process the same batch size. The hang you are experiencing is unsurprising since imbalance in amount of work done by the gpus will break the tightly coordinated synchronizations. Also, the computation will break since gradient reduction assumes each gpu processes the same amount of data.

Can you please explain the motivation for wanting the gpus to process different amounts of data?

@iamsimha
Copy link
Author

Thanks for your quick response. I (and other people) encountered this issue when trying to reproduce a result. I dont see a strong motivation for each gpu doing different amount of work.

Thanks again for the clarification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants