-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bert training model failed when add --deepspeed_transformer_kernel #1155
Comments
Hi @garvct Can you please share the config that you are running this with so that I can repro on my side? |
python3 train.py --grad_accum_dtype float16 2>&1 | tee stdouterr_$${ "wall_clock_breakdown": false, "fp16": { |
Thanks for sharing the script and config. Can I ask what hidden_size, number of heads, and number of layers you are using here? It seems you are using hidden size as 2048, however, I don't understand where the m=6144 is coming from! Is this a multiplicand of sequence-length (512) or number of heads!?
Also, this is a 1-GPU run, right? |
DeepSpeed Transformer config is {'layer_id': 23, 'batch_size': 4, 'hidden_size': 2048, 'intermediate_size': 8192, 'heads': 32, 'attn_dropout_ratio': 0.1, 'hidden_dropout_ratio': 0.1, 'num_hidden_layers': 24, 'initializer_range': 0.02, 'fp16': True, 'pre_layer_norm': True, 'local_rank': -1, 'seed': 42, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'layer_norm_eps': 1e-12, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': False, 'huggingface': False} This was an attempt to run on 16 A100 |
Hi @garvct I am still not able to repro this issue that you are seeing.
Thanks, |
pytest tests/unit/test_cuda_backward.py::test_backward[4-2048-512-32-24-True-True-0.05] ========================================= no tests ran in 3.32s ========================================= pytest tests/unit/test_cuda_backward.py tests/unit/test_cuda_backward.py ..FF. [100%] =================================== FAILURES =================================== batch_size = 8, hidden_size = 1600, seq_len = 128, heads = 2, num_layers = 3
tests/unit/test_cuda_backward.py:297: tests/unit/test_cuda_backward.py:253: in run_backward first = [[tensor([[[-1.8193, 0.4900, -0.9331, ..., 0.5176, -0.6211, -0.6309],
E AssertionError: tests/unit/test_cuda_backward.py:73: AssertionError batch_size = 3, hidden_size = 1024, seq_len = 119, heads = 16, num_layers = 24
tests/unit/test_cuda_backward.py:297: tests/unit/test_cuda_backward.py:253: in run_backward first = [[tensor([[[-0.4278, 0.1847, 0.0466, ..., 0.1023, -0.1683, 0.0696],
E AssertionError: tests/unit/test_cuda_backward.py:73: AssertionError -- Docs: https://docs.pytest.org/en/stable/warnings.html |
Hi @garvct Sorry, the issue that the test could not run was that it was not among the unit tests. I have added it in this branch. Let's use this branch to solve some of these issues.
I am wondering if this issue is related to the Torch version. I am using the following environment:
Thanks, |
I am seeing you are using torch1.9+CUDA11.3. Is this a nightly version of Torch? |
pytest tests/unit/test_cuda_backward.py::test_backward[4-2048-512-32-24-True-True-0.05] tests/unit/test_cuda_backward.py F [100%] =================================== FAILURES =================================== batch_size = 4, hidden_size = 2048, seq_len = 512, heads = 32, num_layers = 24
tests/unit/test_cuda_backward.py:298: tests/unit/test_cuda_backward.py:253: in run_backward first = [[tensor([[[ 0.2712, 0.0586, -0.1754, ..., -0.0591, -0.2847, -0.0381],
E AssertionError: tests/unit/test_cuda_backward.py:73: AssertionError -- Docs: https://docs.pytest.org/en/stable/warnings.html I am using a modified nvidia pytorch 21.05-py3 container. |
Environment
A100x8 GPU's
Using container nvcr.io#nvidia/pytorch:21.05-py3
apt update
pip3 install nvidia-pyindex
pip3 install nvidia-tensorflow
pip3 install numpy --upgrade
export TORCH_CUDA_ARCH_LIST="6.0 6.1 7.0 7.5 8.0 8.6+PTX"
DS_BUILD_OPS=1 pip3 install deepspeed
pip3 install mpi4py
root@x8a100-0000:/workspace# ds_report
DeepSpeed C++/CUDA extension op report
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
JIT compiled ops requires ninja
ninja .................. [OKAY]
op name ................ installed .. compatible
cpu_adam ............... [YES] ...... [OKAY]
fused_adam ............. [YES] ...... [OKAY]
fused_lamb ............. [YES] ...... [OKAY]
sparse_attn ............ [YES] ...... [OKAY]
transformer ............ [YES] ...... [OKAY]
stochastic_transformer . [YES] ...... [OKAY]
async_io ............... [NO] ....... [OKAY]
transformer_inference .. [YES] ...... [OKAY]
utils .................. [YES] ...... [OKAY]
quantizer .............. [YES] ...... [OKAY]
DeepSpeed general environment info:
torch install path ............... ['/opt/conda/lib/python3.8/site-packages/torch']
torch version .................... 1.9.0a0+2ecb2c7
torch cuda version ............... 11.3
nvcc version ..................... 11.3
deepspeed install path ........... ['/opt/conda/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.4.0, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.9, cuda 11.3
root@x8a100-0000:/workspace#
Without --deepspeed_transformer_kernel training job runs fine on multiple A100 GPU's, but when I add --deepspeed_transformer_kernel
!!!! kernel execution error. (m: 6144, n: 2048, k: 2048, error: 13)
!!!! kernel execution error. (m: 2048, n: 2048, k: 8192, error: 13)
!!!! kernel execution error. (m: 6144, n: 2048, k: 2048, error: 13)
!!!! kernel execution error. (m: 512, n: 512, k: 64, error: 13)
!!!! kernel execution error. (m: 64, n: 512, k: 512, error: 13)
Traceback (most recent call last):
File "train.py", line 519, in
main()
File "train.py", line 511, in main
run(args, model, optimizer)
File "train.py", line 482, in run
train(args, model, optimizer)
File "train.py", line 180, in train
validation(args, global_data_samples, model)
File "train.py", line 102, in validation
_, (tmp_mlm_loss, tmp_nsp_loss) = model.network(batch, log=False)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1015, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1086, in forward
loss = self.module(*inputs, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1015, in _call_impl
return forward_call(*input, **kwargs)
File "/workspace/nfs2/pndall/bert/src/bert/pytorch/nvidia/modelingpreln.py", line 1156, in forward
sequence_output, pooled_output = self.bert(
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1015, in _call_impl
return forward_call(*input, **kwargs)
File "/workspace/nfs2/pndall/bert/src/bert/pytorch/nvidia/modelingpreln.py", line 981, in forward
encoded_layers = self.encoder(
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1015, in _call_impl
return forward_call(*input, **kwargs)
File "/workspace/nfs2/pndall/bert/src/bert/pytorch/nvidia/modelingpreln.py", line 602, in forward
hidden_states = layer_module(hidden_states, attention_mask)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1015, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/deepspeed/ops/transformer/transformer.py", line 592, in forward
return DeepSpeedTransformerFunction.apply(hidden_states,
File "/opt/conda/lib/python3.8/site-packages/deepspeed/ops/transformer/transformer.py", line 208, in forward
layer_norm_mean) = forward_func(config.layer_id,
RuntimeError: /home/scratch.efomenko_sw/ml/wip/cask.wip/xmma/cask_plugin/src/gemm/runner.cu:107: cudaFuncSetAttribute(kernel_entry, cudaFuncAttributeMaxDynamicSharedMemorySize, integer_cast<int32_t>(launch_configs[0].smemSizeInBytes)): an illegal memory access was encountered
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered
Do you have any suggestions on how I can fix this?
Thank you.
The text was updated successfully, but these errors were encountered: