You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
However reproducing is a lengthy process. I can create a minimal example, if you are unable to reproduce.
Issue happens, when a process has large number of batches. In this case, Rank 2 process gets stuck.
Rank = 0, Number of batches 4
Rank = 1 Number of batches 4 Rank = 2, Number of batches 8
Rank = 3, Number of batches 4
Expected behavior
Training should not hang
ds_report output
DeepSpeed C++/CUDA extension op report
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
[WARNING] using untested triton version (1.1.1), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-devel package with yum
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
DeepSpeed general environment info:
torch install path ............... ['/home/xx/anaconda3/envs/bert_apex/lib/python3.9/site-packages/torch']
torch version .................... 1.12.1
torch cuda version ............... 11.3
torch hip version ................ None
nvcc version ..................... 11.3
deepspeed install path ........... ['/home/xx/anaconda3/envs/bert_apex/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.7.0, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.12, cuda 11.3
Screenshots
If applicable, add screenshots to help explain your problem.
System info (please complete the following information):
Ubuntu 18.04
8
Interconnects (if applicable) [e.g., two machines connected with 100 Gbps IB]
3.9
Launcher context
deepspeed run_pretraining.py
The text was updated successfully, but these errors were encountered:
The common practice with data parallel training is for the gpus to process the same batch size. The hang you are experiencing is unsurprising since imbalance in amount of work done by the gpus will break the tightly coordinated synchronizations. Also, the computation will break since gradient reduction assumes each gpu processes the same amount of data.
Can you please explain the motivation for wanting the gpus to process different amounts of data?
Thanks for your quick response. I (and other people) encountered this issue when trying to reproduce a result. I dont see a strong motivation for each gpu doing different amount of work.
Describe the bug
I am trying to reproduce the work "Training Bert on academic budget" using the codebase provided by the authors. https://github.com/IntelLabs/academic-budget-bert.
However when training, the model gets stuck after a few epochs. The root cause is — each process trains on batches of different size and its gets stuck during the ".step()" call on deepspeed engine object. Specifically the line: https://github.com/IntelLabs/academic-budget-bert/blob/04f6da685acf4dfc47b85b42307e17340e87fde3/run_pretraining.py#L219
Similar behaviour is also observed in this issue Lightning-AI/pytorch-lightning#13498
This issue completely disappears, if same number of batches are provided to all the processes.
To Reproduce
Steps to reproduce the behavior:
Rank = 0, Number of batches 4
Rank = 1 Number of batches 4
Rank = 2, Number of batches 8
Rank = 3, Number of batches 4
Expected behavior
Training should not hang
ds_report output
DeepSpeed C++/CUDA extension op report
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
JIT compiled ops requires ninja
ninja .................. [OKAY]
op name ................ installed .. compatible
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
[WARNING] using untested triton version (1.1.1), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-devel package with yum
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
DeepSpeed general environment info:
torch install path ............... ['/home/xx/anaconda3/envs/bert_apex/lib/python3.9/site-packages/torch']
torch version .................... 1.12.1
torch cuda version ............... 11.3
torch hip version ................ None
nvcc version ..................... 11.3
deepspeed install path ........... ['/home/xx/anaconda3/envs/bert_apex/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.7.0, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.12, cuda 11.3
Screenshots
If applicable, add screenshots to help explain your problem.
System info (please complete the following information):
Launcher context
deepspeed run_pretraining.py
The text was updated successfully, but these errors were encountered: