[BUG] Training hangs when each process is trained on different number of batches #2223

iamsimha · 2022-08-16T12:33:29Z

Describe the bug
I am trying to reproduce the work "Training Bert on academic budget" using the codebase provided by the authors. https://github.com/IntelLabs/academic-budget-bert.

However when training, the model gets stuck after a few epochs. The root cause is — each process trains on batches of different size and its gets stuck during the ".step()" call on deepspeed engine object. Specifically the line: https://github.com/IntelLabs/academic-budget-bert/blob/04f6da685acf4dfc47b85b42307e17340e87fde3/run_pretraining.py#L219

Similar behaviour is also observed in this issue Lightning-AI/pytorch-lightning#13498

This issue completely disappears, if same number of batches are provided to all the processes.

To Reproduce
Steps to reproduce the behavior:

Use the steps detailed in https://github.com/IntelLabs/academic-budget-bert
However reproducing is a lengthy process. I can create a minimal example, if you are unable to reproduce.
Issue happens, when a process has large number of batches. In this case, Rank 2 process gets stuck.
Rank = 0, Number of batches 4
Rank = 1 Number of batches 4
Rank = 2, Number of batches 8
Rank = 3, Number of batches 4

Expected behavior

Training should not hang

ds_report output

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja
ninja .................. [OKAY]

op name ................ installed .. compatible

cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
[WARNING] using untested triton version (1.1.1), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-devel package with yum
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]

DeepSpeed general environment info:
torch install path ............... ['/home/xx/anaconda3/envs/bert_apex/lib/python3.9/site-packages/torch']
torch version .................... 1.12.1
torch cuda version ............... 11.3
torch hip version ................ None
nvcc version ..................... 11.3
deepspeed install path ........... ['/home/xx/anaconda3/envs/bert_apex/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.7.0, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.12, cuda 11.3

Screenshots
If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

Ubuntu 18.04
8
Interconnects (if applicable) [e.g., two machines connected with 100 Gbps IB]
3.9

Launcher context
deepspeed run_pretraining.py

tjruwase · 2022-08-16T13:31:29Z

The common practice with data parallel training is for the gpus to process the same batch size. The hang you are experiencing is unsurprising since imbalance in amount of work done by the gpus will break the tightly coordinated synchronizations. Also, the computation will break since gradient reduction assumes each gpu processes the same amount of data.

Can you please explain the motivation for wanting the gpus to process different amounts of data?

iamsimha · 2022-08-16T14:36:00Z

Thanks for your quick response. I (and other people) encountered this issue when trying to reproduce a result. I dont see a strong motivation for each gpu doing different amount of work.

Thanks again for the clarification.

iamsimha added the bug Something isn't working label Aug 16, 2022

iamsimha closed this as completed Aug 16, 2022

iamsimha mentioned this issue Aug 16, 2022

The training process will get stuck after training for one epoch IntelLabs/academic-budget-bert#26

Open

LinB203 mentioned this issue Dec 17, 2023

deepspeed zero3 is not supported yet PKU-YuanGroup/Video-LLaVA#48

Closed

jpthu17 mentioned this issue Apr 23, 2024

Inference Time Issue PKU-YuanGroup/Chat-UniVi#32

Closed

jpthu17 mentioned this issue Jun 18, 2024

【Bug】Training becomes pending if the training dataset contains text data. PKU-YuanGroup/Chat-UniVi#45

Closed

jpthu17 mentioned this issue Jul 9, 2024

ModuleNotFoundError: No module named 'ChatUniVi.model.language_model.phi' PKU-YuanGroup/Chat-UniVi#49

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Training hangs when each process is trained on different number of batches #2223

[BUG] Training hangs when each process is trained on different number of batches #2223

iamsimha commented Aug 16, 2022

tjruwase commented Aug 16, 2022

iamsimha commented Aug 16, 2022

[BUG] Training hangs when each process is trained on different number of batches #2223

[BUG] Training hangs when each process is trained on different number of batches #2223

Comments

iamsimha commented Aug 16, 2022

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

tjruwase commented Aug 16, 2022

iamsimha commented Aug 16, 2022

NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja
ninja .................. [OKAY]