Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deepspeed hybrid-engine support bloom model with zero3? #497

Open
null-test-7 opened this issue May 8, 2023 · 3 comments
Open

deepspeed hybrid-engine support bloom model with zero3? #497

null-test-7 opened this issue May 8, 2023 · 3 comments
Assignees
Labels
deespeed chat DeepSpeed Chat new-config A modified config from the given example

Comments

@null-test-7
Copy link

null-test-7 commented May 8, 2023

We use deepspeed-chat to train step3 rlhf, and used bloom model instead of opt model as actor model, enabled hybrid-engine and zero3. Then we got this error.

Traceback (most recent call last):
  File "main.py", line 518, in <module>
    main()
  File "main.py", line 427, in main
    out = trainer.generate_experience(prompts)
  File "/data/nlp/public_data/deepspeed-chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 109, in generate_experience
    seq = self._generate_sequence(prompts)
  File "/data/nlp/public_data/deepspeed-chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 76, in _generate_sequence
    seq = self.actor_model.module.generate(prompts,
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/hybrid_engine.py", line 254, in generate
    generate_ret_vals = self._generate(*inputs, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/generation/utils.py", line 1563, in generate
    return self.sample(
  File "/usr/local/lib/python3.8/dist-packages/transformers/generation/utils.py", line 2610, in sample
    outputs = self(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1570, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/bloom/modeling_bloom.py", line 913, in forward
    transformer_outputs = self.transformer(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1570, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/bloom/modeling_bloom.py", line 786, in forward
    outputs = block(0
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1570, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/hybrid_engine.py", line 340, in run_forward
    with GatheredParameters(non_active_params):
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 1649, in __enter__
    self.params[0].all_gather(param_list=self.params)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 873, in all_gather
    return self._all_gather(param_list, async_op=async_op, hierarchy=hierarchy)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 1074, in _all_gather
    ret_value = self._allgather_params_coalesced(all_gather_list, hierarchy)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 1286, in _allgather_params_coalesced
    flat_tensor = torch.empty(tensor_size, dtype=param_list[0].dtype, device=self.local_device).view(-1)
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

We have tried hybrid-engine + zero2, and disable hybrid-engine + zero3, and the trianing is running normally

@jomayeri jomayeri added deespeed chat DeepSpeed Chat new-config A modified config from the given example labels May 8, 2023
@wang990099
Copy link

@null-test-7 I have tried to run step3 for bloom on hybrid-engine,but failed. I found deepspeed bloom container didn't support hybrid feature. could you tell me how to run? thanks in advance

@zhangzhenyu13
Copy link

I also encountered such errors. Anyone help?

@qinzhiliang
Copy link

me too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deespeed chat DeepSpeed Chat new-config A modified config from the given example
Projects
None yet
Development

No branches or pull requests

6 participants