WebGPU fragment shader optimization#8733
Conversation
Emit WebGPU fragment builtins only when the processed source references the corresponding pc* globals. This avoids carrying unused front-facing, primitive-index, position, and sample-index inputs through material fragment shaders; in particular sample_index is no longer requested unless a shader actually needs pcSampleIndex. Refactor the WGSL clustered-light hot path to avoid mutating a large ClusterLightData through ptr<function> helper calls. Core light decode now returns value data, and optional spot, area, shadow, cookie, and omni-atlas data is decoded into smaller values at the point of use to reduce register pressure and potential function-memory spills.
|
The fragment shader overhead was mostly from generated WGSL asking the compiler to carry data the shader did not actually use. Before the fix, the WGSL processor always emitted these fragment inputs/globals for WebGPU: @Builtin(position) position : vec4f, and copied them into private globals: pcPosition = input.position; For the material shaders we compared, only pcPosition was used for fog. pcFrontFacing, pcPrimitiveIndex, and usually pcSampleIndex were dead plumbing. In particular, sample_index can be expensive because requesting it may force sample-rate fragment shading on MSAA targets, which is much more work than pixel-rate shading. The clustered-light path also had WGSL-specific overhead: it decoded light data into a large ClusterLightData local and passed it through helpers as ptr<function, ClusterLightData>. That makes the hot per-light loop look like mutable function-memory traffic to the compiler. It can increase register pressure or cause spills, especially because the struct included fields only needed by optional spot/shadow/cookie/area paths. The fix was:
|
There was a problem hiding this comment.
Pull request overview
This PR optimizes WebGPU fragment shader inputs and WGSL clustered lighting to reduce unnecessary built-ins and lower register/function-memory pressure in material fragment shaders.
Changes:
- Emits WebGPU fragment built-ins only when corresponding
pc*globals are referenced. - Refactors clustered light decoding to return smaller value structs instead of mutating one large struct through function pointers.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
src/platform/graphics/webgpu/webgpu-shader-processor-wgsl.js |
Adds source-based detection for optional fragment built-ins and conditional global/input copy generation. |
src/scene/shader-lib/wgsl/chunks/lit/frag/clusteredLight.js |
Splits clustered light decode data into smaller structs/values and updates light evaluation call sites. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if (needsPosition) { | ||
| block += ' @builtin(position) position : vec4f,\n'; // interpolated fragment position | ||
| } | ||
| if (needsFrontFacing) { | ||
| block += ' @builtin(front_facing) frontFacing : bool,\n'; // front-facing | ||
| } | ||
| if (needsSampleIndex) { | ||
| block += ' @builtin(sample_index) sampleIndex : u32,\n'; // sample index for MSAA | ||
| } |
There was a problem hiding this comment.
addressed in a followup
mvaligursky
left a comment
There was a problem hiding this comment.
Great, thanks! I'll merge this and do a small follow up improvements as suggested.
Follow-up to #8733. Generalizes gating of optional WGSL built-in inputs across all stages, fixes invalid WGSL for shaders without varyings, and restructures emission to be data-driven. Fixes: - Empty FragmentInput struct for fragment shaders with no varyings and no pc* references (e.g. gizmo-unlit). Sentinel built-in emitted only when struct would otherwise be empty. - Detection now recognises both pc* globals and input.field access. - Word-boundary regex prevents identifier-substring false positives. Changes: - FRAGMENT_BUILTINS / VERTEX_BUILTINS tables describe each optional built-in; adding one is a single-row change. - vertex_index / instance_index now gated symmetrically with fragment built-ins (previously always emitted). - Per-stage sentinel fallback (position for fragment, vertex_index for vertex) guarantees non-empty input structs. - processAttributes signature extended with device, source, entryInputName (internal only). - copyInputs re-runs ENTRY_FUNCTION match against the post-rename source to avoid stale brace positions. Co-authored-by: Martin Valigursky <mvaligursky@snapchat.com>
Follow-up to #8733. Brings the GLSL clustered-light chunk in line with the WGSL refactor, applies the Samsung-precision workaround to WGSL that was previously only in GLSL, and tightens decoder helper signatures. WGSL changes: - ClusterLightData slimmed: flags / anglesData / colorBFlagsData moved to module-scope temporaries (Samsung precision issue, #7800). - ClusterLightData members reordered so each vec3f packs with an adjacent 4-byte field into a 16-byte slot. - decodeClusterLightProjectionMatrixData returns mat4x4f instead of mutating a module global as a side effect. - decode helpers tightened to take only what they need (lightIndex: i32 / biasesData: f32 / no args). - One-line comments on ClusterLightSpotData / AreaData / ShadowData. - Removed four obsolete dev comments from the original WGSL port. GLSL changes: - Same struct slim-down, member reorder, and three new sub-structs. - decode helpers changed from inout-mutating to value-returning, taking int lightIndex / float biasesData / no args. - decodeClusterLightCore returns ClusterLightData by value. - sampleLightTextureF takes (int lightIndex, int index) for symmetry with WGSL. No public API changes. The two backends now have near line-for-line equivalent structures, easing cross-backend maintenance. Co-authored-by: Martin Valigursky <mvaligursky@snapchat.com>
Emit WebGPU fragment builtins only when the processed source references the corresponding pc* globals. This avoids carrying unused front-facing, primitive-index, position, and sample-index inputs through material fragment shaders; in particular sample_index is no longer requested unless a shader actually needs pcSampleIndex.
Refactor the WGSL clustered-light hot path to avoid mutating a large ClusterLightData through ptr helper calls. Core light decode now returns value data, and optional spot, area, shadow, cookie, and omni-atlas data is decoded into smaller values at the point of use to reduce register pressure and potential function-memory spills.