-
Notifications
You must be signed in to change notification settings - Fork 974
Description
Describe the bug
当把Qwen3-next-80B-A3B-Instruct从HF转为MCore后,尝试用megatron sft训练,但卡在“ValueError: No dot product attention backend is available for the provided inputs. Please run with NVTE_DEBUG=1 NVTE_DEBUG_LEVEL=2 to find out the reasons for disabling all backends.”怎么解决
运行命令是:
PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True'
NPROC_PER_NODE=6
CUDA_VISIBLE_DEVICES=2,3,4,5,6,7
NVTE_DEBUG=1 NVTE_DEBUG_LEVEL=2 megatron sft
--load /model/Qwen3-Next-80B-A3B-Instruct-mcore
--dataset '/high_quality_data_0101_1022_265w_converted.json'
--train_type lora
--lora_rank 8
--lora_alpha 32
--target_modules all-linear
--expert_model_parallel_size 2
--moe_permute_fusion true
--moe_grouped_gemm true
--moe_shared_expert_overlap true
--moe_aux_loss_coeff 1e-3
--micro_batch_size 2
--global_batch_size 12
--recompute_granularity full
--recompute_method uniform
--recompute_num_layers 1
--max_epochs 3
--finetune true
--cross_entropy_loss_fusion true
--lr 1e-4
--lr_warmup_fraction 0.05
--min_lr 1e-5
--save /megatron_output/Qwen3-Next-80B-A3B-Instruct
--save_interval 82976
--max_length 2048
--num_workers 8
--dataset_num_proc 8
--no_save_optim true
--no_save_rng true
--sequence_parallel true
--attention_backend flash
--model_author swift
--model_name swift-robot
报错信息:
DEBUG:DotProductAttention:Disabling UnfusedDotProductAttention for qkv_format = thd
DEBUG:DotProductAttention:Disabling FusedAttention as no backend supports the provided input
WARNING:DotProductAttention:flash-attn v3 may provide important feature support or performance improvement. Please install flash-attn v3 by
(1) git clone https://github.com/Dao-AILab/flash-attention.git
(2) cd flash-attention/ && git checkout 3ba6f82 && git submodule update --init && cd hopper/ && python setup.py install
(3) python_path=python -c "import site; print(site.getsitepackages()[0])"
(4) mkdir -p $python_path/flash_attn_3
(5) cp flash_attn_interface.py $python_path/flash_attn_3/flash_attn_interface.py
DEBUG:DotProductAttention:Available backends = {FlashAttention=False, FusedAttention=False, UnfusedDotProductAttention=False}
DEBUG:DotProductAttention:Selected backend = NoBackend
WARNING:megatron.core.utils:No dot product attention backend is available for the provided inputs. Please run with NVTE_DEBUG=1 NVTE_DEBUG_LEVEL=2 to find out the reasons for disabling all backends.
['Traceback (most recent call last):\n', ' File "/ssd4/nietianyu/workspace/ms-swift/swift/megatron/trainers/trainer.py", line 149, in forward_step\n output_tensor = model(**data)\n', ' File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n', ' File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl\n return forward_call(*args, **kwargs)\n', ' File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/distributed/data_parallel_base.py", line 22, in forward\n return self.module(*inputs, **kwargs)\n', ' File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n', ' File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl\n return forward_call(*args, **kwargs)\n', ' File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/transformer/module.py", line 237, in forward\n outputs = self.module(*inputs, **kwargs)\n', ' File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n', ' File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl\n return forward_call(*args, **kwargs)\n', ' File "/ssd4/nietianyu/workspace/ms-swift/swift/megatron/model/gpt_model.py", line 228, in forward\n hidden_states = self.decoder(\n', ' File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n', ' File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl\n return forward_call(*args, **kwargs)\n', ' File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/transformer/transformer_block.py", line 563, in forward\n hidden_states = self._checkpointed_forward(\n', ' File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/transformer/transformer_block.py", line 433, in _checkpointed_forward\n hidden_states, context = checkpoint_handler(\n', ' File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/transformer/transformer_block.py", line 417, in checkpoint_handler\n return tensor_parallel.checkpoint(\n', ' File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/tensor_parallel/random.py", line 477, in checkpoint\n return CheckpointFunction.apply(function, distribute_saved_activations, *args)\n', ' File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/autograd/function.py", line 575, in apply\n return super().apply(*args, **kwargs) # type: ignore[misc]\n', ' File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/tensor_parallel/random.py", line 423, in forward\n outputs = run_function(*args)\n', ' File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/transformer/transformer_block.py", line 388, in custom_forward\n hidden_states, context = layer(\n', ' File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/transformer/transformer_layer.py", line 875, in call\n return super(MegatronModule, self).call(*args, **kwargs)\n', ' File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n', ' File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl\n return forward_call(args, **kwargs)\n', ' File "/ssd4/nietianyu/workspace/ms-swift/swift/megatron/init.py", line 435, in forward\n hidden_states, context = self._forward_attention(_args, **kwargs)\n', ' File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/transformer/transformer_layer.py", line 501, in _forward_attention\n attention_output_with_bias = self.self_attention(\n', ' File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n', ' File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl\n return forward_call(*args, **kwargs)\n', ' File "/ssd4/nietianyu/workspace/ms-swift/swift/megatron/model/gpt/qwen3_next.py", line 282, in forward\n core_attn_out = self.core_attention(\n', ' File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n', ' File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl\n return forward_call(*args, **kwargs)\n', ' File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/extensions/transformer_engine.py", line 931, in forward\n core_attn_out = super().forward(\n', ' File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/transformer_engine/pytorch/attention/dot_product_attention/dot_product_attention.py", line 1356, in forward\n raise ValueError(\n', 'ValueError: No dot product attention backend is available for the provided inputs. Please run with NVTE_DEBUG=1 NVTE_DEBUG_LEVEL=2 to find out the reasons for disabling all backends.\n']
[INFO:swift] images_dir: /ssd4/nietianyu/workspace/ms-swift/megatron_output/Qwen3-Next-80B-A3B-Instruct/v17-20251030-203029/images
DEBUG:DotProductAttention:Running with config={'transformer_engine_version': '2.8.0', 'compute_capability': 'sm90', 'flash_attn_version': 'not installed', 'flash_attn_3_version': 'not installed', 'cudnn_version': '9.1.0', 'qkv_type': <class 'torch.Tensor'>, 'qkv_dtype': torch.bfloat16, 'qkv_layout': 'thd_thd_thd', 'batch_size': 2, 'num_heads': 16, 'num_gqa_groups': 2, 'max_seqlen_q': tensor(87, device='cuda:3', dtype=torch.int32), 'max_seqlen_kv': tensor(87, device='cuda:3', dtype=torch.int32), 'head_dim_qk': 256, 'head_dim_v': 256, 'attn_mask_type': 'padding_causal', 'window_size': (-1, 0), 'alibi_slopes_shape': None, 'core_attention_bias_type': 'no_bias', 'core_attention_bias_shape': None, 'core_attention_bias_requires_grad': False, 'pad_between_seqs': False, 'attention_dropout': 0.0, 'context_parallel': False, 'cp_comm_type': 'p2p', 'deterministic': False, 'is_training': True, 'fp8': False, 'fp8_meta': {'fp8_checkpoint': False, 'fp8_group': None}, 'inference_params': None, 'softmax_type': 'vanilla'}
DEBUG:DotProductAttention:Disabling UnfusedDotProductAttention for qkv_format = thd
DEBUG:DotProductAttention:Disabling FusedAttention as no backend supports the provided input
DEBUG:DotProductAttention:Available backends = {FlashAttention=False, FusedAttention=False, UnfusedDotProductAttention=False}
DEBUG:DotProductAttention:Selected backend = NoBackend
WARNING:megatron.core.utils:No dot product attention backend is available for the provided inputs. Please run with NVTE_DEBUG=1 NVTE_DEBUG_LEVEL=2 to find out the reasons for disabling all backends.
['Traceback (most recent call last):\n', ' File "/ssd4/nietianyu/workspace/ms-swift/swift/megatron/trainers/trainer.py", line 149, in forward_step\n output_tensor = model(**data)\n', ' File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n', ' File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl\n return forward_call(*args, **kwargs)\n', ' File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/distributed/data_parallel_base.py", line 22, in forward\n return self.module(*inputs, **kwargs)\n', ' File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n', ' File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl\n return forward_call(*args, **kwargs)\n', ' File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/transformer/module.py", line 237, in forward\n outputs = self.module(*inputs, **kwargs)\n', ' File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n', ' File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl\n return forward_call(*args, **kwargs)\n', ' File "/ssd4/nietianyu/workspace/ms-swift/swift/megatron/model/gpt_model.py", line 228, in forward\n hidden_states = self.decoder(\n', ' File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n', ' File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl\n return forward_call(*args, **kwargs)\n', ' File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/transformer/transformer_block.py", line 563, in forward\n hidden_states = self._checkpointed_forward(\n', ' File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/transformer/transformer_block.py", line 433, in _checkpointed_forward\n hidden_states, context = checkpoint_handler(\n', ' File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/transformer/transformer_block.py", line 417, in checkpoint_handler\n return tensor_parallel.checkpoint(\n', ' File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/tensor_parallel/random.py", line 477, in checkpoint\n return CheckpointFunction.apply(function, distribute_saved_activations, *args)\n', ' File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/autograd/function.py", line 575, in apply\n return super().apply(*args, **kwargs) # type: ignore[misc]\n', ' File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/tensor_parallel/random.py", line 423, in forward\n outputs = run_function(*args)\n', ' File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/transformer/transformer_block.py", line 388, in custom_forward\n hidden_states, context = layer(\n', ' File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/transformer/transformer_layer.py", line 875, in call\n return super(MegatronModule, self).call(*args, **kwargs)\n', ' File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n', ' File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl\n return forward_call(args, **kwargs)\n', ' File "/ssd4/nietianyu/workspace/ms-swift/swift/megatron/init.py", line 435, in forward\n hidden_states, context = self._forward_attention(_args, **kwargs)\n', ' File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/transformer/transformer_layer.py", line 501, in _forward_attention\n attention_output_with_bias = self.self_attention(\n', ' File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n', ' File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl\n return forward_call(*args, **kwargs)\n', ' File "/ssd4/nietianyu/workspace/ms-swift/swift/megatron/model/gpt/qwen3_next.py", line 282, in forward\n core_attn_out = self.core_attention(\n', ' File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n', ' File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl\n return forward_call(*args, **kwargs)\n', ' File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/extensions/transformer_engine.py", line 931, in forward\n core_attn_out = super().forward(\n', ' File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/transformer_engine/pytorch/attention/dot_product_attention/dot_product_attention.py", line 1356, in forward\n raise ValueError(\n', 'ValueError: No dot product attention backend is available for the provided inputs. Please run with NVTE_DEBUG=1 NVTE_DEBUG_LEVEL=2 to find out the reasons for disabling all backends.\n']
[rank3]: Traceback (most recent call last):
[rank3]: File "/ssd4/nietianyu/workspace/ms-swift/swift/cli/_megatron/sft.py", line 5, in
[rank3]: megatron_sft_main()
[rank3]: File "/ssd4/nietianyu/workspace/ms-swift/swift/megatron/train/sft.py", line 79, in megatron_sft_main
[rank3]: return MegatronSft(args).main()
[rank3]: File "/ssd4/nietianyu/workspace/ms-swift/swift/llm/base.py", line 49, in main
[rank3]: result = self.run()
[rank3]: File "/ssd4/nietianyu/workspace/ms-swift/swift/megatron/train/sft.py", line 69, in run
[rank3]: self.trainer.train(train_dataset, val_dataset, data_collator)
[rank3]: File "/ssd4/nietianyu/workspace/ms-swift/swift/megatron/trainers/base.py", line 774, in train
[rank3]: pretrain(
[rank3]: File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/training/training.py", line 864, in pretrain
[rank3]: iteration, num_floating_point_operations_so_far = train(
[rank3]: File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/training/training.py", line 2279, in train
[rank3]: ) = train_step(
[rank3]: File "/ssd4/nietianyu/workspace/ms-swift/swift/megatron/trainers/base.py", line 327, in train_step
[rank3]: return self._origin_train_step(forward_step_func, new_data_iterator, model, optimizer, opt_param_scheduler,
[rank3]: File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/training/training.py", line 1395, in train_step
[rank3]: losses_reduced = forward_backward_func(
[rank3]: File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 518, in forward_backward_no_pipelining
[rank3]: output_tensor, num_tokens = forward_step(
[rank3]: File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 289, in forward_step
[rank3]: output_tensor, loss_func = forward_step_func(data_iterator, model)
[rank3]: File "/ssd4/nietianyu/workspace/ms-swift/swift/megatron/trainers/trainer.py", line 149, in forward_step
[rank3]: output_tensor = model(**data)
[rank3]: File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank3]: return self._call_impl(*args, **kwargs)
[rank3]: File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[rank3]: return forward_call(*args, **kwargs)
[rank3]: File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/distributed/data_parallel_base.py", line 22, in forward
[rank3]: return self.module(*inputs, **kwargs)
[rank3]: File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank3]: return self._call_impl(*args, **kwargs)
[rank3]: File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[rank3]: return forward_call(*args, **kwargs)
[rank3]: File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/transformer/module.py", line 237, in forward
[rank3]: outputs = self.module(*inputs, **kwargs)
[rank3]: File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank3]: return self._call_impl(*args, **kwargs)
[rank3]: File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[rank3]: return forward_call(*args, **kwargs)
[rank3]: File "/ssd4/nietianyu/workspace/ms-swift/swift/megatron/model/gpt_model.py", line 228, in forward
[rank3]: hidden_states = self.decoder(
[rank3]: File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank3]: return self._call_impl(*args, **kwargs)
[rank3]: File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[rank3]: return forward_call(*args, **kwargs)
[rank3]: File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/transformer/transformer_block.py", line 563, in forward
[rank3]: hidden_states = self._checkpointed_forward(
[rank3]: File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/transformer/transformer_block.py", line 433, in _checkpointed_forward
[rank3]: hidden_states, context = checkpoint_handler(
[rank3]: File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/transformer/transformer_block.py", line 417, in checkpoint_handler
[rank3]: return tensor_parallel.checkpoint(
[rank3]: File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/tensor_parallel/random.py", line 477, in checkpoint
[rank3]: return CheckpointFunction.apply(function, distribute_saved_activations, *args)
[rank3]: File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/autograd/function.py", line 575, in apply
[rank3]: return super().apply(*args, **kwargs) # type: ignore[misc]
[rank3]: File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/tensor_parallel/random.py", line 423, in forward
[rank3]: outputs = run_function(*args)
[rank3]: File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/transformer/transformer_block.py", line 388, in custom_forward
[rank3]: hidden_states, context = layer(
[rank3]: File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/transformer/transformer_layer.py", line 875, in call
[rank3]: return super(MegatronModule, self).call(*args, **kwargs)
[rank3]: File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank3]: return self._call_impl(*args, **kwargs)
[rank3]: File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[rank3]: return forward_call(args, **kwargs)
[rank3]: File "/ssd4/nietianyu/workspace/ms-swift/swift/megatron/init.py", line 435, in forward
[rank3]: hidden_states, context = self._forward_attention(_args, **kwargs)
[rank3]: File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/transformer/transformer_layer.py", line 501, in _forward_attention
[rank3]: attention_output_with_bias = self.self_attention(
[rank3]: File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank3]: return self._call_impl(*args, **kwargs)
[rank3]: File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[rank3]: return forward_call(*args, **kwargs)
[rank3]: File "/ssd4/nietianyu/workspace/ms-swift/swift/megatron/model/gpt/qwen3_next.py", line 282, in forward
[rank3]: core_attn_out = self.core_attention(
[rank3]: File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank3]: return self._call_impl(*args, **kwargs)
[rank3]: File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[rank3]: return forward_call(*args, **kwargs)
[rank3]: File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/extensions/transformer_engine.py", line 931, in forward
[rank3]: core_attn_out = super().forward(
[rank3]: File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/transformer_engine/pytorch/attention/dot_product_attention/dot_product_attention.py", line 1356, in forward
[rank3]: raise ValueError(
[rank3]: ValueError: No dot product attention backend is available for the provided inputs. Please run with NVTE_DEBUG=1 NVTE_DEBUG_LEVEL=2 to find out the reasons for disabling all backends.
DEBUG:DotProductAttention:Running with config={'transformer_engine_version': '2.8.0', 'compute_capability': 'sm90', 'flash_attn_version': 'not installed', 'flash_attn_3_version': 'not installed', 'cudnn_version': '9.1.0', 'qkv_type': <class 'torch.Tensor'>, 'qkv_dtype': torch.bfloat16, 'qkv_layout': 'thd_thd_thd', 'batch_size': 2, 'num_heads': 16, 'num_gqa_groups': 2, 'max_seqlen_q': tensor(79, device='cuda:2', dtype=torch.int32), 'max_seqlen_kv': tensor(79, device='cuda:2', dtype=torch.int32), 'head_dim_qk': 256, 'head_dim_v': 256, 'attn_mask_type': 'padding_causal', 'window_size': (-1, 0), 'alibi_slopes_shape': None, 'core_attention_bias_type': 'no_bias', 'core_attention_bias_shape': None, 'core_attention_bias_requires_grad': False, 'pad_between_seqs': False, 'attention_dropout': 0.0, 'context_parallel': False, 'cp_comm_type': 'p2p', 'deterministic': False, 'is_training': True, 'fp8': False, 'fp8_meta': {'fp8_checkpoint': False, 'fp8_group': None}, 'inference_params': None, 'softmax_type': 'vanilla'}
DEBUG:DotProductAttention:Disabling UnfusedDotProductAttention for qkv_format = thd
DEBUG:DotProductAttention:Disabling FusedAttention as no backend supports the provided input
DEBUG:DotProductAttention:Available backends = {FlashAttention=False, FusedAttention=False, UnfusedDotProductAttention=False}
DEBUG:DotProductAttention:Selected backend = NoBackend
WARNING:megatron.core.utils:No dot product attention backend is available for the provided inputs. Please run with NVTE_DEBUG=1 NVTE_DEBUG_LEVEL=2 to find out the reasons for disabling all backends.
['Traceback (most recent call last):\n', ' File "/ssd4/nietianyu/workspace/ms-swift/swift/megatron/trainers/trainer.py", line 149, in forward_step\n output_tensor = model(**data)\n', ' File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n', ' File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl\n return forward_call(*args, **kwargs)\n', ' File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/distributed/data_parallel_base.py", line 22, in forward\n return self.module(*inputs, **kwargs)\n', ' File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n', ' File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl\n return forward_call(*args, **kwargs)\n', ' File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/transformer/module.py", line 237, in forward\n outputs = self.module(*inputs, **kwargs)\n', ' File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n', ' File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl\n return forward_call(*args, **kwargs)\n', ' File "/ssd4/nietianyu/workspace/ms-swift/swift/megatron/model/gpt_model.py", line 228, in forward\n hidden_states = self.decoder(\n', ' File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n', ' File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl\n return forward_call(*args, **kwargs)\n', ' File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/transformer/transformer_block.py", line 563, in forward\n hidden_states = self._checkpointed_forward(\n', ' File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/transformer/transformer_block.py", line 433, in _checkpointed_forward\n hidden_states, context = checkpoint_handler(\n', ' File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/transformer/transformer_block.py", line 417, in checkpoint_handler\n return tensor_parallel.checkpoint(\n', ' File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/tensor_parallel/random.py", line 477, in checkpoint\n return CheckpointFunction.apply(function, distribute_saved_activations, *args)\n', ' File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/autograd/function.py", line 575, in apply\n return super().apply(*args, **kwargs) # type: ignore[misc]\n', ' File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/tensor_parallel/random.py", line 423, in forward\n outputs = run_function(*args)\n', ' File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/transformer/transformer_block.py", line 388, in custom_forward\n hidden_states, context = layer(\n', ' File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/transformer/transformer_layer.py", line 875, in call\n return super(MegatronModule, self).call(*args, **kwargs)\n', ' File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n', ' File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl\n return forward_call(args, **kwargs)\n', ' File "/ssd4/nietianyu/workspace/ms-swift/swift/megatron/init.py", line 435, in forward\n hidden_states, context = self._forward_attention(_args, **kwargs)\n', ' File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/transformer/transformer_layer.py", line 501, in _forward_attention\n attention_output_with_bias = self.self_attention(\n', ' File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n', ' File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl\n return forward_call(*args, **kwargs)\n', ' File "/ssd4/nietianyu/workspace/ms-swift/swift/megatron/model/gpt/qwen3_next.py", line 282, in forward\n core_attn_out = self.core_attention(\n', ' File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n', ' File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl\n return forward_call(*args, **kwargs)\n', ' File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/extensions/transformer_engine.py", line 931, in forward\n core_attn_out = super().forward(\n', ' File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/transformer_engine/pytorch/attention/dot_product_attention/dot_product_attention.py", line 1356, in forward\n raise ValueError(\n', 'ValueError: No dot product attention backend is available for the provided inputs. Please run with NVTE_DEBUG=1 NVTE_DEBUG_LEVEL=2 to find out the reasons for disabling all backends.\n']
[rank2]: Traceback (most recent call last):
[rank2]: File "/ssd4/nietianyu/workspace/ms-swift/swift/cli/_megatron/sft.py", line 5, in
[rank2]: megatron_sft_main()
[rank2]: File "/ssd4/nietianyu/workspace/ms-swift/swift/megatron/train/sft.py", line 79, in megatron_sft_main
[rank2]: return MegatronSft(args).main()
[rank2]: File "/ssd4/nietianyu/workspace/ms-swift/swift/llm/base.py", line 49, in main
[rank2]: result = self.run()
[rank2]: File "/ssd4/nietianyu/workspace/ms-swift/swift/megatron/train/sft.py", line 69, in run
[rank2]: self.trainer.train(train_dataset, val_dataset, data_collator)
[rank2]: File "/ssd4/nietianyu/workspace/ms-swift/swift/megatron/trainers/base.py", line 774, in train
[rank2]: pretrain(
[rank2]: File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/training/training.py", line 864, in pretrain
[rank2]: iteration, num_floating_point_operations_so_far = train(
[rank2]: File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/training/training.py", line 2279, in train
[rank2]: ) = train_step(
[rank2]: File "/ssd4/nietianyu/workspace/ms-swift/swift/megatron/trainers/base.py", line 327, in train_step
[rank2]: return self._origin_train_step(forward_step_func, new_data_iterator, model, optimizer, opt_param_scheduler,
[rank2]: File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/training/training.py", line 1395, in train_step
[rank2]: losses_reduced = forward_backward_func(
[rank2]: File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 518, in forward_backward_no_pipelining
[rank2]: output_tensor, num_tokens = forward_step(
[rank2]: File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 289, in forward_step
[rank2]: output_tensor, loss_func = forward_step_func(data_iterator, model)
[rank2]: File "/ssd4/nietianyu/workspace/ms-swift/swift/megatron/trainers/trainer.py", line 149, in forward_step
[rank2]: output_tensor = model(**data)
[rank2]: File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank2]: return self._call_impl(*args, **kwargs)
[rank2]: File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[rank2]: return forward_call(*args, **kwargs)
[rank2]: File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/distributed/data_parallel_base.py", line 22, in forward
[rank2]: return self.module(*inputs, **kwargs)
[rank2]: File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank2]: return self._call_impl(*args, **kwargs)
[rank2]: File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[rank2]: return forward_call(*args, **kwargs)
[rank2]: File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/transformer/module.py", line 237, in forward
[rank2]: outputs = self.module(*inputs, **kwargs)
[rank2]: File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank2]: return self._call_impl(*args, **kwargs)
[rank2]: File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[rank2]: return forward_call(*args, **kwargs)
[rank2]: File "/ssd4/nietianyu/workspace/ms-swift/swift/megatron/model/gpt_model.py", line 228, in forward
[rank2]: hidden_states = self.decoder(
[rank2]: File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank2]: return self._call_impl(*args, **kwargs)
[rank2]: File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[rank2]: return forward_call(*args, **kwargs)
[rank2]: File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/transformer/transformer_block.py", line 563, in forward
[rank2]: hidden_states = self._checkpointed_forward(
[rank2]: File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/transformer/transformer_block.py", line 433, in _checkpointed_forward
[rank2]: hidden_states, context = checkpoint_handler(
[rank2]: File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/transformer/transformer_block.py", line 417, in checkpoint_handler
[rank2]: return tensor_parallel.checkpoint(
[rank2]: File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/tensor_parallel/random.py", line 477, in checkpoint
[rank2]: return CheckpointFunction.apply(function, distribute_saved_activations, *args)
[rank2]: File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/autograd/function.py", line 575, in apply
[rank2]: return super().apply(*args, **kwargs) # type: ignore[misc]
[rank2]: File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/tensor_parallel/random.py", line 423, in forward
[rank2]: outputs = run_function(*args)
[rank2]: File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/transformer/transformer_block.py", line 388, in custom_forward
[rank2]: hidden_states, context = layer(
[rank2]: File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/transformer/transformer_layer.py", line 875, in call
[rank2]: return super(MegatronModule, self).call(*args, **kwargs)
[rank2]: File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank2]: return self._call_impl(*args, **kwargs)
[rank2]: File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[rank2]: return forward_call(args, **kwargs)
[rank2]: File "/ssd4/nietianyu/workspace/ms-swift/swift/megatron/init.py", line 435, in forward
[rank2]: hidden_states, context = self._forward_attention(_args, **kwargs)
[rank2]: File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/transformer/transformer_layer.py", line 501, in _forward_attention
[rank2]: attention_output_with_bias = self.self_attention(
[rank2]: File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank2]: return self._call_impl(*args, **kwargs)
[rank2]: File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[rank2]: return forward_call(*args, **kwargs)
[rank2]: File "/ssd4/nietianyu/workspace/ms-swift/swift/megatron/model/gpt/qwen3_next.py", line 282, in forward
[rank2]: core_attn_out = self.core_attention(
[rank2]: File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank2]: return self._call_impl(*args, **kwargs)
[rank2]: File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[rank2]: return forward_call(*args, **kwargs)
[rank2]: File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/extensions/transformer_engine.py", line 931, in forward
[rank2]: core_attn_out = super().forward(
[rank2]: File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/transformer_engine/pytorch/attention/dot_product_attention/dot_product_attention.py", line 1356, in forward
[rank2]: raise ValueError(
[rank2]: ValueError: No dot product attention backend is available for the provided inputs. Please run with NVTE_DEBUG=1 NVTE_DEBUG_LEVEL=2 to find out the reasons for disabling all backends.
[rank0]: Traceback (most recent call last):
[rank0]: File "/ssd4/nietianyu/workspace/ms-swift/swift/cli/_megatron/sft.py", line 5, in
[rank0]: megatron_sft_main()
[rank0]: File "/ssd4/nietianyu/workspace/ms-swift/swift/megatron/train/sft.py", line 79, in megatron_sft_main
[rank0]: return MegatronSft(args).main()
[rank0]: File "/ssd4/nietianyu/workspace/ms-swift/swift/llm/base.py", line 49, in main
[rank0]: result = self.run()
[rank0]: File "/ssd4/nietianyu/workspace/ms-swift/swift/megatron/train/sft.py", line 69, in run
[rank0]: self.trainer.train(train_dataset, val_dataset, data_collator)
[rank0]: File "/ssd4/nietianyu/workspace/ms-swift/swift/megatron/trainers/base.py", line 774, in train
[rank0]: pretrain(
[rank0]: File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/training/training.py", line 864, in pretrain
[rank0]: iteration, num_floating_point_operations_so_far = train(
[rank0]: File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/training/training.py", line 2279, in train
[rank0]: ) = train_step(
[rank0]: File "/ssd4/nietianyu/workspace/ms-swift/swift/megatron/trainers/base.py", line 327, in train_step
[rank0]: return self._origin_train_step(forward_step_func, new_data_iterator, model, optimizer, opt_param_scheduler,
[rank0]: File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/training/training.py", line 1395, in train_step
[rank0]: losses_reduced = forward_backward_func(
[rank0]: File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 518, in forward_backward_no_pipelining
[rank0]: output_tensor, num_tokens = forward_step(
[rank0]: File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/pipeline_parallel/schedules.py", line 289, in forward_step
[rank0]: output_tensor, loss_func = forward_step_func(data_iterator, model)
[rank0]: File "/ssd4/nietianyu/workspace/ms-swift/swift/megatron/trainers/trainer.py", line 149, in forward_step
[rank0]: output_tensor = model(**data)
[rank0]: File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/distributed/data_parallel_base.py", line 22, in forward
[rank0]: return self.module(*inputs, **kwargs)
[rank0]: File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/transformer/module.py", line 237, in forward
[rank0]: outputs = self.module(*inputs, **kwargs)
[rank0]: File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/ssd4/nietianyu/workspace/ms-swift/swift/megatron/model/gpt_model.py", line 228, in forward
[rank0]: hidden_states = self.decoder(
[rank0]: File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/transformer/transformer_block.py", line 563, in forward
[rank0]: hidden_states = self._checkpointed_forward(
[rank0]: File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/transformer/transformer_block.py", line 433, in _checkpointed_forward
[rank0]: hidden_states, context = checkpoint_handler(
[rank0]: File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/transformer/transformer_block.py", line 417, in checkpoint_handler
[rank0]: return tensor_parallel.checkpoint(
[rank0]: File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/tensor_parallel/random.py", line 477, in checkpoint
[rank0]: return CheckpointFunction.apply(function, distribute_saved_activations, *args)
[rank0]: File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/autograd/function.py", line 575, in apply
[rank0]: return super().apply(*args, **kwargs) # type: ignore[misc]
[rank0]: File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/tensor_parallel/random.py", line 423, in forward
[rank0]: outputs = run_function(*args)
[rank0]: File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/transformer/transformer_block.py", line 388, in custom_forward
[rank0]: hidden_states, context = layer(
[rank0]: File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/transformer/transformer_layer.py", line 875, in call
[rank0]: return super(MegatronModule, self).call(*args, **kwargs)
[rank0]: File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[rank0]: return forward_call(args, **kwargs)
[rank0]: File "/ssd4/nietianyu/workspace/ms-swift/swift/megatron/init.py", line 435, in forward
[rank0]: hidden_states, context = self._forward_attention(_args, **kwargs)
[rank0]: File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/transformer/transformer_layer.py", line 501, in _forward_attention
[rank0]: attention_output_with_bias = self.self_attention(
[rank0]: File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/ssd4/nietianyu/workspace/ms-swift/swift/megatron/model/gpt/qwen3_next.py", line 282, in forward
[rank0]: core_attn_out = self.core_attention(
[rank0]: File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/ssd4/nietianyu/.cache/modelscope/hub/_github/Megatron-LM/megatron/core/extensions/transformer_engine.py", line 931, in forward
[rank0]: core_attn_out = super().forward(
[rank0]: File "/ssd4/nietianyu/.conda/envs/ms_swift/lib/python3.10/site-packages/transformer_engine/pytorch/attention/dot_product_attention/dot_product_attention.py", line 1356, in forward
[rank0]: raise ValueError(
[rank0]: ValueError: No dot product attention backend is available for the provided inputs. Please run with NVTE_DEBUG=1 NVTE_DEBUG_LEVEL=2 to find out the reasons for disabling all backends.
Your hardware and system info
Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息,如CUDA版本,系统,GPU型号和torch版本等)
cuda:12.2
accelerate 1.10.1
aiofiles 24.1.0
aiohappyeyeballs 2.6.1
aiohttp 3.12.15
aiosignal 1.4.0
airportsdata 20250811
annotated-types 0.7.0
antlr4-python3-runtime 4.9.3
anyio 4.10.0
astor 0.8.1
asttokens 3.0.0
async-timeout 5.0.1
attrs 25.3.0
audioread 3.0.1
autopep8 2.3.2
av 14.2.0
bitsandbytes 0.48.0.dev0
blake3 1.0.5
cachetools 6.1.0
cbor2 5.7.0
certifi 2025.8.3
cffi 1.17.1
charset-normalizer 3.4.3
click 8.2.1
cloudpickle 3.1.1
comm 0.2.3
compressed-tensors 0.10.2
contourpy 1.3.2
cupy-cuda12x 13.5.1
cut-cross-entropy 25.1.1
cycler 0.12.1
datasets 3.6.0
debugpy 1.8.17
decorator 5.2.1
deepspeed 0.16.4
Deprecated 1.2.18
depyf 0.19.0
device-smi 0.4.1
diffusers 0.35.1
dill 0.3.8
diskcache 5.6.3
distro 1.9.0
dnspython 2.7.0
docstring_parser 0.17.0
einops 0.8.1
email_validator 2.2.0
et_xmlfile 2.0.0
exceptiongroup 1.3.0
executing 2.2.1
fastapi 0.116.1
fastapi-cli 0.0.8
fastapi-cloud-cli 0.1.5
fastrlock 0.8.3
ffmpeg 1.4
ffmpy 0.6.1
filelock 3.18.0
fire 0.7.0
flash_attn 2.8.1
fonttools 4.59.0
frozenlist 1.7.0
fsspec 2025.3.0
gekko 1.3.0
gguf 0.17.1
googleapis-common-protos 1.70.0
gptqmodel 4.1.0.dev0
gradio 5.31.0
gradio_client 1.10.1
groovy 0.1.2
grpcio 1.74.0
h11 0.16.0
hf_transfer 0.1.9
hf-xet 1.1.7
hjson 3.1.0
httpcore 1.0.9
httptools 0.6.4
httpx 0.28.1
huggingface-hub 0.34.4
idna 3.10
importlib_metadata 8.0.0
iniconfig 2.1.0
interegular 0.3.3
ipykernel 7.0.1
ipython 8.37.0
jedi 0.19.2
jieba 0.42.1
Jinja2 3.1.6
jiter 0.10.0
joblib 1.5.1
jsonschema 4.25.0
jsonschema-specifications 2025.4.1
jupyter_client 8.6.3
jupyter_core 5.9.1
kiwisolver 1.4.9
lark 1.2.2
lazy_loader 0.4
librosa 0.11.0
llamafactory 0.9.4.dev0 /ssd4/nietianyu/workspace/LLaMA-Factory
llguidance 0.7.30
llvmlite 0.44.0
lm-format-enforcer 0.10.12
logbar 0.0.4
markdown-it-py 3.0.0
MarkupSafe 3.0.2
matplotlib 3.10.5
matplotlib-inline 0.1.7
maturin 1.9.4
mdurl 0.1.2
mistral_common 1.8.3
modelscope 1.28.2
mpmath 1.3.0
msgpack 1.1.1
msgspec 0.19.0
multidict 6.6.3
multiprocess 0.70.16
nest_asyncio 1.6.0
networkx 3.4.2
ninja 1.13.0
nltk 3.9.1
numba 0.61.2
numpy 2.2.6
nvidia-cublas-cu12 12.8.3.14
nvidia-cuda-cupti-cu12 12.8.57
nvidia-cuda-nvrtc-cu12 12.8.61
nvidia-cuda-runtime-cu12 12.8.57
nvidia-cudnn-cu12 9.7.1.26
nvidia-cufft-cu12 11.3.3.41
nvidia-cufile-cu12 1.13.0.11
nvidia-curand-cu12 10.3.9.55
nvidia-cusolver-cu12 11.7.2.55
nvidia-cusparse-cu12 12.5.7.53
nvidia-cusparselt-cu12 0.6.3
nvidia-ml-py 13.580.65
nvidia-nccl-cu12 2.26.2
nvidia-nvjitlink-cu12 12.8.61
nvidia-nvtx-cu12 12.8.55
omegaconf 2.3.0
openai 1.99.9
openai-harmony 0.0.4
opencv-python-headless 4.12.0.88
openpyxl 3.1.5
opentelemetry-api 1.26.0
opentelemetry-exporter-otlp 1.26.0
opentelemetry-exporter-otlp-proto-common 1.26.0
opentelemetry-exporter-otlp-proto-grpc 1.26.0
opentelemetry-exporter-otlp-proto-http 1.26.0
opentelemetry-proto 1.26.0
opentelemetry-sdk 1.26.0
opentelemetry-semantic-conventions 0.47b0
opentelemetry-semantic-conventions-ai 0.4.12
optimum 1.27.0
orjson 3.11.1
outlines 0.1.11
outlines_core 0.2.10
packaging 25.0
pandas 2.3.1
parso 0.8.5
partial-json-parser 0.2.1.1.post6
peft 0.15.2
pexpect 4.9.0
pickleshare 0.7.5
pillow 11.3.0
pip 25.1
platformdirs 4.3.8
pluggy 1.6.0
pooch 1.8.2
prometheus_client 0.22.1
prometheus-fastapi-instrumentator 7.1.0
prompt_toolkit 3.0.52
propcache 0.3.2
protobuf 6.32.0
psutil 7.0.0
ptyprocess 0.7.0
pure_eval 0.2.3
py-cpuinfo 9.0.0
py-spy 0.4.1
pyarrow 21.0.0
pybase64 1.4.2
pycodestyle 2.14.0
pycountry 24.6.1
pycparser 2.22
pydantic 2.11.7
pydantic_core 2.33.2
pydantic-extra-types 2.10.5
pydub 0.25.1
Pygments 2.19.2
pyparsing 3.2.3
pytest 8.4.1
python-dateutil 2.9.0.post0
python-dotenv 1.1.1
python-json-logger 3.3.0
python-multipart 0.0.20
pytz 2025.2
PyYAML 6.0.2
pyzmq 27.0.2
random_word 1.0.13
ray 2.48.0
referencing 0.36.2
regex 2025.7.34
requests 2.32.4
rich 14.1.0
rich-toolkit 0.15.0
rignore 0.6.4
rouge 1.0.1
rouge-chinese 1.0.3
rpds-py 0.27.0
ruff 0.12.8
safehttpx 0.1.6
safetensors 0.6.2
scikit-learn 1.7.1
scipy 1.15.3
semantic-version 2.10.0
sentence-transformers 5.1.0
sentencepiece 0.2.0
sentry-sdk 2.34.1
setproctitle 1.3.6
setuptools 78.1.1
shellingham 1.5.4
shtab 1.7.2
six 1.17.0
sniffio 1.3.1
some-package 0.1
soundfile 0.13.1
soxr 0.5.0.post1
sse-starlette 3.0.2
stack_data 0.6.3
starlette 0.47.2
sympy 1.14.0
termcolor 3.1.0
threadpoolctl 3.6.0
tiktoken 0.11.0
tokenicer 0.0.5
tokenizers 0.22.0
tomli 2.2.1
tomlkit 0.13.3
torch 2.7.1+cu128
torchao 0.12.0
torchaudio 2.7.1
torchvision 0.22.1+cu128
tornado 6.5.2
tqdm 4.67.1
traitlets 5.14.3
transformers 4.57.0
transformers-v4.55.0-GLM-4.5V-preview 4.56.0.dev0
triton 3.3.1
trl 0.9.6
typer 0.16.0
typing_extensions 4.14.1
typing-inspection 0.4.1
tyro 0.8.14
tzdata 2025.2
unsloth 2025.8.9
unsloth_zoo 2025.8.8
urllib3 2.5.0
uvicorn 0.35.0
uvloop 0.21.0
vllm 0.10.1
watchfiles 1.1.0
wcwidth 0.2.14
websockets 15.0.1
wheel 0.45.1
wrapt 1.17.3
xformers 0.0.31
xgrammar 0.1.21
xxhash 3.5.0
yarl 1.20.1
zipp 3.23.0
zstandard 0.25.0
Additional context
Add any other context about the problem here(在这里补充其他信息)