Skip to content

WebGPU fragment shader optimization#8733

Merged
mvaligursky merged 2 commits into
playcanvas:mainfrom
cabanier:webgpu_fragment_shader_optimization
May 18, 2026
Merged

WebGPU fragment shader optimization#8733
mvaligursky merged 2 commits into
playcanvas:mainfrom
cabanier:webgpu_fragment_shader_optimization

Conversation

@cabanier
Copy link
Copy Markdown
Contributor

Emit WebGPU fragment builtins only when the processed source references the corresponding pc* globals. This avoids carrying unused front-facing, primitive-index, position, and sample-index inputs through material fragment shaders; in particular sample_index is no longer requested unless a shader actually needs pcSampleIndex.

Refactor the WGSL clustered-light hot path to avoid mutating a large ClusterLightData through ptr helper calls. Core light decode now returns value data, and optional spot, area, shadow, cookie, and omni-atlas data is decoded into smaller values at the point of use to reduce register pressure and potential function-memory spills.

Emit WebGPU fragment builtins only when the processed source references the corresponding pc* globals. This avoids carrying unused front-facing, primitive-index, position, and sample-index inputs through material fragment shaders; in particular sample_index is no longer requested unless a shader actually needs pcSampleIndex.

Refactor the WGSL clustered-light hot path to avoid mutating a large ClusterLightData through ptr<function> helper calls. Core light decode now returns value data, and optional spot, area, shadow, cookie, and omni-atlas data is decoded into smaller values at the point of use to reduce register pressure and potential function-memory spills.
@cabanier
Copy link
Copy Markdown
Contributor Author

The fragment shader overhead was mostly from generated WGSL asking the compiler to carry data the shader did not actually use.

Before the fix, the WGSL processor always emitted these fragment inputs/globals for WebGPU:

@Builtin(position) position : vec4f,
@Builtin(front_facing) frontFacing : bool,
@Builtin(sample_index) sampleIndex : u32,
@Builtin(primitive_index) primitiveIndex : u32,

and copied them into private globals:

pcPosition = input.position;
pcFrontFacing = input.frontFacing;
pcSampleIndex = input.sampleIndex;
pcPrimitiveIndex = input.primitiveIndex;

For the material shaders we compared, only pcPosition was used for fog. pcFrontFacing, pcPrimitiveIndex, and usually pcSampleIndex were dead plumbing. In particular, sample_index can be expensive because requesting it may force sample-rate fragment shading on MSAA targets, which is much more work than pixel-rate shading.

The clustered-light path also had WGSL-specific overhead: it decoded light data into a large ClusterLightData local and passed it through helpers as ptr<function, ClusterLightData>. That makes the hot per-light loop look like mutable function-memory traffic to the compiler. It can increase register pressure or cause spills, especially because the struct included fields only needed by optional spot/shadow/cookie/area paths.

The fix was:

  • emit position, front_facing, sample_index, and primitive_index only when the final fragment source references pcPosition, pcFrontFacing, pcSampleIndex,
    or pcPrimitiveIndex;
  • change clustered-light helpers to return smaller value structs/vectors instead of mutating a large pointer-passed ClusterLightData;
  • remove the half precision conversion churn, which was adding lots of half(...), half3(...), and f32(...) casts around ordinary lighting math.

@cabanier cabanier changed the title Webgpu fragment shader optimization WebGPU fragment shader optimization May 15, 2026
@willeastcott willeastcott added performance Relating to load times or frame rate area: graphics Graphics related issue area: xr XR related issue labels May 17, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes WebGPU fragment shader inputs and WGSL clustered lighting to reduce unnecessary built-ins and lower register/function-memory pressure in material fragment shaders.

Changes:

  • Emits WebGPU fragment built-ins only when corresponding pc* globals are referenced.
  • Refactors clustered light decoding to return smaller value structs instead of mutating one large struct through function pointers.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
src/platform/graphics/webgpu/webgpu-shader-processor-wgsl.js Adds source-based detection for optional fragment built-ins and conditional global/input copy generation.
src/scene/shader-lib/wgsl/chunks/lit/frag/clusteredLight.js Splits clustered light decode data into smaller structs/values and updates light evaluation call sites.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +749 to +757
if (needsPosition) {
block += ' @builtin(position) position : vec4f,\n'; // interpolated fragment position
}
if (needsFrontFacing) {
block += ' @builtin(front_facing) frontFacing : bool,\n'; // front-facing
}
if (needsSampleIndex) {
block += ' @builtin(sample_index) sampleIndex : u32,\n'; // sample index for MSAA
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in a followup

Copy link
Copy Markdown
Contributor

@mvaligursky mvaligursky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks! I'll merge this and do a small follow up improvements as suggested.

@mvaligursky mvaligursky merged commit 4bdecbf into playcanvas:main May 18, 2026
10 of 12 checks passed
mvaligursky added a commit that referenced this pull request May 18, 2026
Follow-up to #8733. Generalizes gating of optional WGSL built-in inputs
across all stages, fixes invalid WGSL for shaders without varyings, and
restructures emission to be data-driven.

Fixes:
- Empty FragmentInput struct for fragment shaders with no varyings and
  no pc* references (e.g. gizmo-unlit). Sentinel built-in emitted only
  when struct would otherwise be empty.
- Detection now recognises both pc* globals and input.field access.
- Word-boundary regex prevents identifier-substring false positives.

Changes:
- FRAGMENT_BUILTINS / VERTEX_BUILTINS tables describe each optional
  built-in; adding one is a single-row change.
- vertex_index / instance_index now gated symmetrically with fragment
  built-ins (previously always emitted).
- Per-stage sentinel fallback (position for fragment, vertex_index for
  vertex) guarantees non-empty input structs.
- processAttributes signature extended with device, source,
  entryInputName (internal only).
- copyInputs re-runs ENTRY_FUNCTION match against the post-rename
  source to avoid stale brace positions.

Co-authored-by: Martin Valigursky <mvaligursky@snapchat.com>
mvaligursky added a commit that referenced this pull request May 18, 2026
Follow-up to #8733. Brings the GLSL clustered-light chunk in line with
the WGSL refactor, applies the Samsung-precision workaround to WGSL
that was previously only in GLSL, and tightens decoder helper
signatures.

WGSL changes:
- ClusterLightData slimmed: flags / anglesData / colorBFlagsData
  moved to module-scope temporaries (Samsung precision issue, #7800).
- ClusterLightData members reordered so each vec3f packs with an
  adjacent 4-byte field into a 16-byte slot.
- decodeClusterLightProjectionMatrixData returns mat4x4f instead of
  mutating a module global as a side effect.
- decode helpers tightened to take only what they need
  (lightIndex: i32 / biasesData: f32 / no args).
- One-line comments on ClusterLightSpotData / AreaData / ShadowData.
- Removed four obsolete dev comments from the original WGSL port.

GLSL changes:
- Same struct slim-down, member reorder, and three new sub-structs.
- decode helpers changed from inout-mutating to value-returning,
  taking int lightIndex / float biasesData / no args.
- decodeClusterLightCore returns ClusterLightData by value.
- sampleLightTextureF takes (int lightIndex, int index) for symmetry
  with WGSL.

No public API changes. The two backends now have near line-for-line
equivalent structures, easing cross-backend maintenance.

Co-authored-by: Martin Valigursky <mvaligursky@snapchat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: graphics Graphics related issue area: xr XR related issue performance Relating to load times or frame rate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants