You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We use deepspeed-chat to train step3 rlhf, and used bloom model instead of opt model as actor model, enabled hybrid-engine and zero3. Then we got this error.
Traceback (most recent call last):
File "main.py", line 518, in <module>
main()
File "main.py", line 427, in main
out = trainer.generate_experience(prompts)
File "/data/nlp/public_data/deepspeed-chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 109, in generate_experience
seq = self._generate_sequence(prompts)
File "/data/nlp/public_data/deepspeed-chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 76, in _generate_sequence
seq = self.actor_model.module.generate(prompts,
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/hybrid_engine.py", line 254, in generate
generate_ret_vals = self._generate(*inputs, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/transformers/generation/utils.py", line 1563, in generate
return self.sample(
File "/usr/local/lib/python3.8/dist-packages/transformers/generation/utils.py", line 2610, in sample
outputs = self(
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1570, in _call_impl
result = forward_call(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/transformers/models/bloom/modeling_bloom.py", line 913, in forward
transformer_outputs = self.transformer(
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1570, in _call_impl
result = forward_call(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/transformers/models/bloom/modeling_bloom.py", line 786, in forward
outputs = block(0
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1570, in _call_impl
result = forward_call(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/hybrid_engine.py", line 340, in run_forward
with GatheredParameters(non_active_params):
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 1649, in __enter__
self.params[0].all_gather(param_list=self.params)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 873, in all_gather
return self._all_gather(param_list, async_op=async_op, hierarchy=hierarchy)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 1074, in _all_gather
ret_value = self._allgather_params_coalesced(all_gather_list, hierarchy)
File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 1286, in _allgather_params_coalesced
flat_tensor = torch.empty(tensor_size, dtype=param_list[0].dtype, device=self.local_device).view(-1)
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
We have tried hybrid-engine + zero2, and disable hybrid-engine + zero3, and the trianing is running normally
The text was updated successfully, but these errors were encountered:
@null-test-7 I have tried to run step3 for bloom on hybrid-engine,but failed. I found deepspeed bloom container didn't support hybrid feature. could you tell me how to run? thanks in advance
We use deepspeed-chat to train step3 rlhf, and used bloom model instead of opt model as actor model, enabled hybrid-engine and zero3. Then we got this error.
We have tried hybrid-engine + zero2, and disable hybrid-engine + zero3, and the trianing is running normally
The text was updated successfully, but these errors were encountered: