deepspeed hybrid-engine support bloom model with zero3? #497

null-test-7 · 2023-05-08T06:18:04Z

We use deepspeed-chat to train step3 rlhf, and used bloom model instead of opt model as actor model, enabled hybrid-engine and zero3. Then we got this error.

Traceback (most recent call last):
  File "main.py", line 518, in <module>
    main()
  File "main.py", line 427, in main
    out = trainer.generate_experience(prompts)
  File "/data/nlp/public_data/deepspeed-chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 109, in generate_experience
    seq = self._generate_sequence(prompts)
  File "/data/nlp/public_data/deepspeed-chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 76, in _generate_sequence
    seq = self.actor_model.module.generate(prompts,
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/hybrid_engine.py", line 254, in generate
    generate_ret_vals = self._generate(*inputs, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/generation/utils.py", line 1563, in generate
    return self.sample(
  File "/usr/local/lib/python3.8/dist-packages/transformers/generation/utils.py", line 2610, in sample
    outputs = self(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1570, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/bloom/modeling_bloom.py", line 913, in forward
    transformer_outputs = self.transformer(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1570, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/bloom/modeling_bloom.py", line 786, in forward
    outputs = block(0
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1570, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/hybrid_engine.py", line 340, in run_forward
    with GatheredParameters(non_active_params):
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 1649, in __enter__
    self.params[0].all_gather(param_list=self.params)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 873, in all_gather
    return self._all_gather(param_list, async_op=async_op, hierarchy=hierarchy)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 1074, in _all_gather
    ret_value = self._allgather_params_coalesced(all_gather_list, hierarchy)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 1286, in _allgather_params_coalesced
    flat_tensor = torch.empty(tensor_size, dtype=param_list[0].dtype, device=self.local_device).view(-1)
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

We have tried hybrid-engine + zero2, and disable hybrid-engine + zero3, and the trianing is running normally

The text was updated successfully, but these errors were encountered:

wang990099 · 2023-05-11T02:30:41Z

@null-test-7 I have tried to run step3 for bloom on hybrid-engine，but failed. I found deepspeed bloom container didn't support hybrid feature. could you tell me how to run? thanks in advance

zhangzhenyu13 · 2023-06-08T10:37:44Z

I also encountered such errors. Anyone help?

qinzhiliang · 2023-06-08T15:43:41Z

me too

jomayeri added deespeed chat DeepSpeed Chat new-config A modified config from the given example labels May 8, 2023

jomayeri assigned xiexbing May 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deepspeed hybrid-engine support bloom model with zero3? #497

deepspeed hybrid-engine support bloom model with zero3? #497

null-test-7 commented May 8, 2023 •

edited

Loading

wang990099 commented May 11, 2023

zhangzhenyu13 commented Jun 8, 2023

qinzhiliang commented Jun 8, 2023

deepspeed hybrid-engine support bloom model with zero3? #497

deepspeed hybrid-engine support bloom model with zero3? #497

Comments

null-test-7 commented May 8, 2023 • edited Loading

wang990099 commented May 11, 2023

zhangzhenyu13 commented Jun 8, 2023

qinzhiliang commented Jun 8, 2023

null-test-7 commented May 8, 2023 •

edited

Loading