Skip to content

[webgpu] Add NTC layout support for CausalConvWithState#28504

Open
xiaofeihan1 wants to merge 1 commit into
mainfrom
xfh/causal-conv-with-state-ntc
Open

[webgpu] Add NTC layout support for CausalConvWithState#28504
xiaofeihan1 wants to merge 1 commit into
mainfrom
xfh/causal-conv-with-state-ntc

Conversation

@xiaofeihan1
Copy link
Copy Markdown
Contributor

@xiaofeihan1 xiaofeihan1 commented May 14, 2026

Adds an optional data_format attribute (default "NCT") to com.microsoft.CausalConvWithState. When set to "NTC", the input/output tensor layout is channels-last [B, T, C] instead of channels-first [B, C, T].

Motivation:

  • WebGPU coalesced reads favor channel as the innermost (contiguous) dim.
  • For Qwen3.5 / Mamba-style models, the conv is wrapped between two Transposes (NTC->NCT before, NCT->NTC after) in the HuggingFace reference. Supporting NTC natively lets the model builder skip both Transposes (48 nodes removed for Qwen3.5-4B).
  • Measured +7.4% gen TPS / +5.8% e2e on Qwen3.5-4B int4 (RTX 5080, prefill-1000, max_tokens=100).

Scope:

  • WebGPU EP: supports both NCT and NTC. Layout is part of CacheHint so the two paths get separate compiled shaders.
  • CPU / CUDA EP: NCT only. Explicitly reject NTC at kernel construction to fail loudly rather than silently mis-compute.
  • Default value is "NCT" - existing models without the attribute behave unchanged.

Adds an optional `data_format` attribute (default "NCT") to
com.microsoft.CausalConvWithState. When set to "NTC", the input/output
tensor layout is channels-last [B, T, C] instead of channels-first
[B, C, T].

Motivation:
- WebGPU coalesced reads favor channel as the innermost (contiguous) dim.
- For Qwen3.5 / Mamba-style models, the conv is wrapped between two
  Transposes (NTC->NCT before, NCT->NTC after) in the HuggingFace
  reference. Supporting NTC natively lets the model builder skip both
  Transposes (48 nodes removed for Qwen3.5-4B).
- Measured +7.4% gen TPS / +5.8% e2e on Qwen3.5-4B int4 (RTX 5080,
  prefill-1000, max_tokens=100).

Scope:
- WebGPU EP: supports both NCT and NTC. Layout is part of CacheHint so
  the two paths get separate compiled shaders.
- CPU / CUDA EP: NCT only. Explicitly reject NTC at kernel construction
  to fail loudly rather than silently mis-compute.
- Default value is "NCT" - existing models without the attribute behave
  unchanged.
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can commit the suggested changes from lintrunner.

std::string data_format = info.GetAttrOrDefault<std::string>("data_format", "NCT");
ORT_ENFORCE(data_format == "NCT",
"CPU CausalConvWithState only supports data_format='NCT' currently. "
"Got: ", data_format);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Got: ", data_format);
"Got: ",
data_format);

std::string data_format = info.GetAttrOrDefault<std::string>("data_format", "NCT");
ORT_ENFORCE(data_format == "NCT",
"CUDA CausalConvWithState only supports data_format='NCT' currently. "
"Got: ", data_format);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Got: ", data_format);
"Got: ",
data_format);

@guschmue guschmue added the ep:WebGPU ort-web webgpu provider label May 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ep:WebGPU ort-web webgpu provider

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants