Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error set num-experts>1 when running the generate_test.sh #337

Open
jrt-20 opened this issue Jan 22, 2024 · 0 comments
Open

Error set num-experts>1 when running the generate_test.sh #337

jrt-20 opened this issue Jan 22, 2024 · 0 comments

Comments

@jrt-20
Copy link

jrt-20 commented Jan 22, 2024

I am running the examples_deepspeed/generate_text.sh.
From now on, I can success run this script with 1 node 8 gpus when experts=1.
But, when I set experts = 8, errors happen.
The compete errors are as follows :

using world size: 8, data-parallel-size: 8, sequence-parallel size: 1, tensor-model-parallel size: 1, pipeline-model-parallel size: 1 
> building GPT2BPETokenizer tokenizer ...
 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
> initializing torch distributed ...
[2024-01-22 08:37:16,393] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-01-22 08:37:16,399] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-01-22 08:37:16,399] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-01-22 08:37:16,648] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-01-22 08:37:16,668] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-01-22 08:37:16,669] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-01-22 08:37:16,687] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-01-22 08:37:16,852] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-01-22 08:37:16,856] [INFO] [comm.py:637:init_distributed] cdb=None
> initialized tensor model parallel with size 1
> initialized pipeline model parallel with size 1
> setting random seeds to 1234 ...
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
> compiling dataset index builder ...
make: Entering directory '/home/ai/jrtPain/Megatron-DeepSpeed/megatron/data'
make: Nothing to be done for 'default'.
make: Leaving directory '/home/ai/jrtPain/Megatron-DeepSpeed/megatron/data'
>>> done with dataset index builder. Compilation time: 0.061 seconds
WARNING: constraints for invoking optimized fused softmax kernel are not met. We default back to unfused kernel invocations.
> compiling and loading fused kernels ...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/ai/jrtPain/Megatron-DeepSpeed/megatron/fused_kernels/build/build.ninja...
Building extension module scaled_upper_triang_masked_softmax_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module scaled_upper_triang_masked_softmax_cuda...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/ai/jrtPain/Megatron-DeepSpeed/megatron/fused_kernels/build/build.ninja...
Building extension module scaled_masked_softmax_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module scaled_masked_softmax_cuda...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/ai/jrtPain/Megatron-DeepSpeed/megatron/fused_kernels/build/build.ninja...
Building extension module scaled_softmax_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module scaled_softmax_cuda...
>>> done with compiling and loading fused kernels. Compilation time: 2.871 seconds
building GPT model ...
[2024-01-22 08:37:20,393] [INFO] [logging.py:96:log_dist] [Rank 0] Creating MoE layer with num_experts: 8 | num_local_experts: 8 | expert_parallel_size: 1
[2024-01-22 08:37:20,430] [INFO] [logging.py:96:log_dist] [Rank 0] Creating MoE layer with num_experts: 8 | num_local_experts: 8 | expert_parallel_size: 1
[2024-01-22 08:37:20,463] [INFO] [logging.py:96:log_dist] [Rank 0] Creating MoE layer with num_experts: 8 | num_local_experts: 8 | expert_parallel_size: 1
[2024-01-22 08:37:20,488] [INFO] [logging.py:96:log_dist] [Rank 0] Creating MoE layer with num_experts: 8 | num_local_experts: 8 | expert_parallel_size: 1
[2024-01-22 08:37:20,518] [INFO] [logging.py:96:log_dist] [Rank 0] Creating MoE layer with num_experts: 8 | num_local_experts: 8 | expert_parallel_size: 1
[2024-01-22 08:37:20,547] [INFO] [logging.py:96:log_dist] [Rank 0] Creating MoE layer with num_experts: 8 | num_local_experts: 8 | expert_parallel_size: 1
[2024-01-22 08:37:20,577] [INFO] [logging.py:96:log_dist] [Rank 0] Creating MoE layer with num_experts: 8 | num_local_experts: 8 | expert_parallel_size: 1
[2024-01-22 08:37:20,620] [INFO] [logging.py:96:log_dist] [Rank 0] Creating MoE layer with num_experts: 8 | num_local_experts: 8 | expert_parallel_size: 1
[2024-01-22 08:37:20,654] [INFO] [logging.py:96:log_dist] [Rank 0] Creating MoE layer with num_experts: 8 | num_local_experts: 8 | expert_parallel_size: 1
[2024-01-22 08:37:20,686] [INFO] [logging.py:96:log_dist] [Rank 0] Creating MoE layer with num_experts: 8 | num_local_experts: 8 | expert_parallel_size: 1
[2024-01-22 08:37:20,719] [INFO] [logging.py:96:log_dist] [Rank 0] Creating MoE layer with num_experts: 8 | num_local_experts: 8 | expert_parallel_size: 1
[2024-01-22 08:37:20,749] [INFO] [logging.py:96:log_dist] [Rank 0] Creating MoE layer with num_experts: 8 | num_local_experts: 8 | expert_parallel_size: 1

Emitting ninja build file /home/ai/.cache/torch_extensions/py310_cu121/transformer_inference/build.ninja...
Building extension module transformer_inference...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.11441206932067871 seconds
Traceback (most recent call last):
  File "/home/ai/jrtPain/Megatron-DeepSpeed/tools/generate_samples_gpt.py", line 178, in <module>
    main()
  File "/home/ai/jrtPain/Megatron-DeepSpeed/tools/generate_samples_gpt.py", line 141, in main
    model = ds_inference(model, args)
  File "/home/ai/jrtPain/Megatron-DeepSpeed/tools/generate_samples_gpt.py", line 164, in ds_inference
    engine = deepspeed.init_inference(model=model,
  File "/home/ai/miniconda3/envs/jrt-singlegpu-success/lib/python3.10/site-packages/deepspeed/__init__.py", line 342, in init_inference
    engine = InferenceEngine(model, config=ds_inference_config)
  File "/home/ai/miniconda3/envs/jrt-singlegpu-success/lib/python3.10/site-packages/deepspeed/inference/engine.py", line 158, in __init__
    self._apply_injection_policy(config)
  File "/home/ai/miniconda3/envs/jrt-singlegpu-success/lib/python3.10/site-packages/deepspeed/inference/engine.py", line 418, in _apply_injection_policy
    replace_transformer_layer(client_module, self.module, checkpoint, config, self.config)
  File "/home/ai/miniconda3/envs/jrt-singlegpu-success/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 342, in replace_transformer_layer
    replaced_module = replace_module(model=model,
  File "/home/ai/miniconda3/envs/jrt-singlegpu-success/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 586, in replace_module
    replaced_module, _ = _replace_module(model, policy, state_dict=sd)
  File "/home/ai/miniconda3/envs/jrt-singlegpu-success/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 646, in _replace_module
    _, layer_id = _replace_module(child,
  File "/home/ai/miniconda3/envs/jrt-singlegpu-success/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 646, in _replace_module
    _, layer_id = _replace_module(child,
  File "/home/ai/miniconda3/envs/jrt-singlegpu-success/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 646, in _replace_module
    _, layer_id = _replace_module(child,
  [Previous line repeated 1 more time]
  File "/home/ai/miniconda3/envs/jrt-singlegpu-success/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 622, in _replace_module
    replaced_module = policies[child.__class__][0](child,
  File "/home/ai/miniconda3/envs/jrt-singlegpu-success/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 298, in replace_fn
    new_module = replace_with_policy(child,
  File "/home/ai/miniconda3/envs/jrt-singlegpu-success/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 250, in replace_with_policy
    _container.transpose()
  File "/home/ai/miniconda3/envs/jrt-singlegpu-success/lib/python3.10/site-packages/deepspeed/module_inject/containers/features/megatron.py", line 28, in transpose
    super().transpose()
  File "/home/ai/miniconda3/envs/jrt-singlegpu-success/lib/python3.10/site-packages/deepspeed/module_inject/containers/base.py", line 286, in transpose
    self.transpose_mlp()
  File "/home/ai/miniconda3/envs/jrt-singlegpu-success/lib/python3.10/site-packages/deepspeed/module_inject/containers/base.py", line 295, in transpose_mlp
    self._h4h_w = self.transpose_impl(self._h4h_w.data)
AttributeError: 'list' object has no attribute 'data'

the version of my environment is :

deepspeed 0.12.6
torch 2.1.1
transformers 4.25.0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant