qkv format error in GPT Eval when bs=1 and using fused rope kernel #8590

cuichenx · 2024-03-05T21:36:02Z

Describe the bug

Opening a github issue for public tracking purposes. (internal #4489257)

megatron_gpt_eval.py throws the following error:

...
    core_attn_out = super().forward(
  File "/usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/attention.py", line 2099, in forward
    qkv_layout, query_layer, key_layer, value_layer = _get_qkv_layout(
  File "/usr/local/lib/python3.10/dist-packages/transformer_engine/pytorch/attention.py", line 1209, in _get_qkv_layout
    raise Exception("The provided qkv memory layout is not supported!")
Exception: The provided qkv memory layout is not supported!
This only happens when all of the following are true:

micro batch size = 1 (mbs = 2 works)
apply_rope_fusion = True
rope fusion is available (i.e. container has latest Apex)
running an inference workload – megatron_gpt_eval.py, or megatron_gpt_generate.py, or validation loop during training. (training with mbs = 1 and mbs = 2 both work)

Steps/Code to reproduce bug

run inference with any gpt-style model (e.g. llama2)

torchrun --nproc_per_node=1 /opt/NeMo/examples/nlp/language_modeling/megatron_gpt_eval.py \
    gpt_model_file=<path to model> \
    inference.greedy=True \
    inference.add_BOS=True \
    trainer.devices=1 \
    trainer.num_nodes=1 \
    tensor_model_parallel_size=1 \
    pipeline_model_parallel_size=1 \
    prompts=["deep learning is"]

Environment overview (please complete the following information)

nvcr.io/nvidia/nemo:24.01.framework

The text was updated successfully, but these errors were encountered:

github-actions · 2024-04-05T01:44:15Z

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

cuichenx · 2024-04-05T13:52:58Z

fixed in nemo 24.03 release / megatron core 0.6.0 release

cuichenx added the bug Something isn't working label Mar 5, 2024

This was referenced Mar 5, 2024

Change rope fusion default #8591

Closed

Change rope fusion default #8594

Merged

github-actions bot added the stale label Apr 5, 2024

cuichenx closed this as completed Apr 5, 2024

cuichenx mentioned this issue Apr 5, 2024

LoRA merge script clean up #8834

Merged

8 tasks

cuichenx mentioned this issue May 2, 2024

Cleanup deprecated files and temporary changes #9088

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

qkv format error in GPT Eval when bs=1 and using fused rope kernel #8590

qkv format error in GPT Eval when bs=1 and using fused rope kernel #8590

cuichenx commented Mar 5, 2024

github-actions bot commented Apr 5, 2024

cuichenx commented Apr 5, 2024

qkv format error in GPT Eval when bs=1 and using fused rope kernel #8590

qkv format error in GPT Eval when bs=1 and using fused rope kernel #8590

Comments

cuichenx commented Mar 5, 2024

github-actions bot commented Apr 5, 2024

cuichenx commented Apr 5, 2024