Minutes 2023 02 16 17 F2F

GPU Web F2F 2023-02-16/17

Note that unless stated otherwise this is a GPU for the Web community group meeting and not a working group meeting.

Chair: CW+KG

Scribe: ds, KR, KN (wgsl), DN (api)

Location: WebEx: https://appleinc.webex.com/appleinc/j.php?MTID=ma6ec2c88dd34dd937064d3ad2e15630a

Tentative schedule (all times are in PT)

All blocks are aligned with hours, but each roughly each hour is planned to have 10m breaks. Ideally breaks are at [10 minutes before the hour, the hour], but can be moved to not cut topics too abruptly.

This schedule is a tentative, blocked-out version: the details of topics to be discussed will be filled on the first hour of Thursday, but the overall schedule is expected to be fairly dynamic and might change depending on the group's interests or speed treating some topics.

Thursday morning (9AM - 1PM)

9AM - 10AM

Meta topics, fill out the block out schedule.

10AM - 11AM

Various demos:

Debuggers (API: GT, model viewer: BJ, wgsl: JP)
Coverage (KN)
Earth (LK)
WebGPU on mobile (BJ)
WebGPU on GLES/D3D11 (SW)
yours here?

11AM - 1PM

WGSL v1 items.

Thursday afternoon (2PM - 6PM)

2PM - 3PM

Cross-topic future tech (WebGPU-compat, WebXR)

3PM - 5PM

API V1 items - mainly just burndown

5PM - 6PM

Shanghai topics: WebNN, DP4A, push constants, WebCodecs (optional)

Friday morning (9AM - 1PM)

9AM - 1PM

Remaining WGSL v1 items, then

Future WGSL tech items

Friday afternoon (2PM - 6PM)

2PM - 3PM

Guest presentations / demos:

Vello, compute-based path rasterization on WebGPU [Raph]
Bindless overview and implementation experience in wgpu/Naga [Connor]
yours here?

3PM - 5/6PM

Future API tech items (assuming no more v1 items)

5PM - 6PM

?

Topic haystack

Meta topics:

~~Administrivia [chairs]~~
- Cadence of meetings?
Agenda for next meetings (and whether to skip the next ones) [chairs]

Future tech / post-V1:

API
- Cannot upload/download to UMA storage buffer without an unnecessary copy and unnecessary memory use #2388 [Corentin]
- WebGPU + WebCodec #1380 #2498 [Shaobo]
mapSync - Motivating use cases and experiment results #2217 [Brandon]

Attendance

Apple
- Brandel Zachernuk
- Dan Glastonbury
- Mike Wyrzykowski
- Myles C. Maxfield
Google
- Alan Baker
- Austin Eng
- Ben Clayton
- Brandon Jones
- Corentin Wallez
- Dan Sinclair
- David Neto
- Gregg Tavares
- James Price
- Kai Ninomiya
- Ken Russell
- Loko Kung
- Shrek Shao
- Stephen White
- Ryan Harrison
Intel
- Enrico Galli
- Hao Li
- Jiajia Qin
- Jiawei Shao
- Jie Chen
- Yunchao He
Microsoft
- Chai Chaoweeraprasit
- Rafael Cintron
Mozilla
- Erich Gubler
- Jim Blandy
- Kelsey Gilbert
- Teodor Tanasoaia
Unity
- Brendan Duncan
W3C
- Francois Daoust
- Giorgio
Henk Heidstra
James Darpinian
Jeremy Sachs
Mehmet Oguz Derin
Rob Conde
Tadeu Zagallo Da Silva

NOTES

Thursday

CW: Thanks Myles, first f2f meeting in 3 years, exciting to be here. Going to have a bunch of blocks. This will be about meta things, schedule, round table until done. Some administrivia. This morning will have block on various demos. Rest of morning will be WGSL v1 items, at least that's the plan. Will discuss agenda next. Start with small round table. I'm Corentin, working for Google in Paris. Part of group since it was created. Leading chromium implementation.
BC: Ben the Tint Technical lead and manager. Been working on this for 2 years. Tint is the WGSL compiler for chrome.
DN: Work for google, WGSL editor
GT: Programmer at GOogle
SW: Stephen, work on Dawn and Tint and Angle
MR: Work for apple on Pipeline side of things
KN: Kai, work for google, spec editor, do mostly spec and test stuff for webgpu
MM: Myles, also an editor, working on WebGPU since inception and basically technical
DS: work on WGSL at Google. :)
EG: Mozilla, newcomer working on webgpu
JB: Jim, mozilla team lead for webgpu. Maintainer of Naga and wgpu github projects which make up firefox implementation.
KG: Kelsey, co-chair for webgpu specializing in wgsl side. Mostly work on standardization side.
RC: Raphale, work on low level graphics for microsoft. Also represent colour on the web and machine learning group
AE: work on WebGPU impl in Chrome, and Dawn. Started work on it slightly after inception.
BJ: Brandon, sharing spec editing duties. Also worked on WebXR and hopefully will meet up.Like to work on samples and documents
JS: Jeremy - enthusiast. :) Independent. I fight for the users. :) I make little JS things that use WebGL, WebGPU, etc. People look at them and go "huh". Want people to enjoy WebGPU. Online I'm rezmason. I ask questions on Matrix.
KR: Ken, mostly overhead. Mostly management and webgl work.
AM: Antonio, on the Tint team.
TE: Teo, working for Mozilla, working on webgpu implementation
MOD: Oguz, editor for WGSL spec, developer for computer graphics.
Rob Conde:
LK: Loko, at google working WebGPu implementation.
AB: Alan, from Google, WGSL spec editor. Also other shading standards
SS: Shrek, work at Google on Dawn/WebGPU
JP: James Price, one of core Tint developers, working on it for couple years now.
RH: Ryan, works on Tint team, recently been working on WGSL CTS.
FD: Francois Daoust, W3C, team contact for WebGPU WG. I try to leave you alone and let you do the technical work. :) Immediate question is transition of spec to Candidate Recommendation. Can answer questions about that.
YH: Yunchao, Intel, working on WebGPU in Chrome / Dawn project. Previously worked on WebGL. Relationship with team in China. They'll join the F2F remotely later.

Administrivia

CW: Start with administrivia.
KG: Not a lot, mostly f2f coming up. Good to have everyone here, lot of new faces and lot of faces from 4 years ago. As FD mentioned, main goal for me will be to cleanup and solidify our v1 on the spec side and try to get finalization for direction on pieces we haven't decided yet. After doing that we're done, ship it. That's the goal, mostly a burn down of existing issues and ideally having time to discuss more tricky issues which i find easier to do in person and also hopefully time for things to do in the future and rough priorities. Really a stretch goal, main is to get out the door in a way we're happy.
CW: There is CR coming up, after this hopefully will have direction for v1 on most things. Means after the f2f we might all be heads down implementation which is a good place to be in. At end of f2f will probably come back discuss cadence of meetings. Might not need weekly meetings for WGSL and APII and reduced cadence saves a bit of time. Maybe CTS update briefly

CTS Updates

KN: Not much new to add, continuing to work on stuff, later today will show the code coverage stuff we have so folks can see and so may be inspired. Been useful for us, been going through results and finding things that need to be covered. Members of our team have been hammering away at filling testing gaps. Greg has been testing limit behaviour. Intel working on tricky colour space details in video tests.
BC: Shout out to RH for work on the WGSL builtin tests. Lots of fiddly arithmetic precision, mountain of work
KN: Shrek recently wrote great tests for sample masking, alpha-to-coverage interaction. Very detailed tests. In some areas we have some of the strictest and best tests in the industry, esp. with Ryan's work on floating-point and builtins.
KG: Working on integration.
EG: need to figure out how to shard the tests for our CI. Then theoretically will be integrated, work on failures (most tests are failing right now).
JB: Have some early api compatibility things.
EG: Some very fundamental things, big wave to get things passing them most time will be last 20%.
KG: Is Apple doing integration?
MM: We use web platform tests for the integration. Have imported the tests. running them to some extent. Most are failing. Initial goal - make something people can play with, rather than "pass the CTS". We'll add in stuff later that makes the tests pass.

Scheduling for the remaining blocks

CW: Thanks everyone. Scheduling for the f2f. Rough blocked out schedule at top of document. Any concern? Couple frozen items for guest presentation demos, they're people not in the meeting who come at fixed time. Topics for Intel Shanghai at last hour to make easier to join. Any concerns with scheduling. For WGSL blocks, will do v1 items first and then future tech. Gave guesstimates of times.
CW: Hearing no concerns. WGSL, which priority to talk about times
—
DN: Added a deduplication list, have non-editorial v1 issues. A lot are the same issue. So, there is a little table above that shows the map of what are the independent issues. Can do that, and then had a bunch of things land under template disambiguation which should be treated as a group. Otherwise freeform.
KG: Just burndown.
CW: API v1 will be burndown. Not that much which is good. For Thursday afternoon start with cross topic future tech, cross either api /wgsl or w3c working group. WebNN has been moved.
KR: Raphael wanted to talk about that
CW: Want Intel Shanghai there, move to 5. There is not that much exciting agenda stuff to talk about. People can comment if the agenda isn't OK.
MM: Lunch at 1
KG: Breaks every 60-90 minutes.
CW: First break now, back at 10, start with demos.

Demos

CW: Demos, start with the debugger. If you have demos of implementation or anything, add to the minutes doc or ping CW. Hopefully demos are quick, will have questions after each demo.

WebGPU Debugger

GT: WebGPU debugger demo. cwallez@, gman@, bajones@ In October did exploration sprint doing random WebGPU things, CW, BJ and GM worked on debugger prototype. Setup to takes screenshot, shows commands coming through, shows binds and draw commands. Captured trace, can play it back. Goes through draw calls and puts triangles up, shows command details , can show texture view and see various things. See the data from the buffers, things like that. It's "working" as a first step, MVP. Lots of missing features. Tried to debug CTS, added about 6 features, took too long. It works. Some visualizers BJ will show
JB: Is this integrated …
GM: Javascript library
JB: Bring in to page and it adds it?
GM: Will probably make an extension version. Capture part, library to capture frame or start/stop capturing. Saves as JSON, it's enormous.
JB: Hacks the prototype?
GM: Yea. Then a replay library to replay traces. Then debugger UI to use both of those if you want. Can give it a trace but right now will capture and trace. Can send trace to someone else to load into their debugger. Lots of work to do, lots of cleanup. Get everything working and debugged. Not sure what to do with muti-threaded GPU. Possible from JS, but maybe at a lower level that spits out the same data. Would like to add everything out there.
JB: What about debug wgsl?
GM: That's next. JP has a demo.
GM: BJ has made a viewer.
MM: Is the goal a JS library, or put in devtools
CW: Moved the JS library to github.com/webgpu/webgpu-debugger. Was full of nothing. Tomorrow morning AustinSamples will probably move there as they're everyone's samples. The group can discuss this but community stuff can be there.
Mm: Not sure if that's an answer, in chrome devtools?
CW: Unknown
GM: Open source library, maybe do something
CW: Likely at least an extension
KR: Devtools are a flavour of chrome extensions. Idea is to make this for the web ecosystem instead of browser specific
GM: Most platforms agree on extensions so part we need could be everywhere, maybe more in dev tools but that would be an extension
KG: Have a version of this for canvas. Would like WebGL and canvas. making work for canvas was fiddly and doable. Totally possible.
BJ: <Presenting demo> Few more features worth going over, visualizations of things like textures, mouse over get what all of the individual values are, can look at what are all the textures, all the buffers, if multi sampled can get each of the samples.
Group: Ohhhhh
BJ: Appreciate that.If have cube map can view it as a cube map. See it as skybox and rotate around. Even for non-cubemap can scrub through mip levels. Can see actual size. If you pick out a draw call, or with pipelines (a bit ore difficult) no good way to infer semantically a buffer binding. Have to come in and wire up what each of the bindings means.Can say location 1 is texcoords. Can get preview of geometry for the draw call, rotate it, shift meanings, get previews of data. In future maybe pulling out for debugging purposes. Lots of interesting ways to visualize the data in order to help understand pipeline.
KR: Whats the perf implementation of instrumenting but not capturing?
CW: Zero clue. Fairly light weight, but still overhead.
BJ: The library is hooked but not capturing and it's showing OK. When capturing it takes a long while to generate a single frame.
RC: Is this as the site is running
BJ: Yea, pulling from site buffer data (creates a copy) and if you give some semantic info can give more on what happens
RC: Curious if that was the debugger moving things
BJ: Yes. Will have camera controls
MM: Had list of texcoord, position, are those hard coded?
BJ: Yes. Common things we can visualize easily.
KG: General model viewer shader
BJ: Generic plug something in and it will try to visualize using given streams as semantic.
CW: Any questions? Moving on to Coverage by KN.

Code Coverage

KN: <Showing coverage reports> Showing stuff BC did. Ben has tooling from Swiftshader for capturing not only code coverage but which test hits each line of code. Can look at d3d12 backend and can see the coverage and then looking at a line of code can see which tests hit that line of code. It basically runs every test individually, resets coverage in between and compresses into a single coverage report which saves test prefixes which hit. Using this to go through each backend/frontend to see whats missing, what tests aren't hitting. Been great for filing needed tests.
CW: This is running through dawn node?
KN: Yes, that's not chromium, but dawn.node. Doesn't test blink code but tests backend and frontend validation
MM: in-process?
KN: yes. Also added our tests to the code coverage builder for chrome. Can see coverage for all of chrome and looking at dawn and blink code can see. This one does not capture which test hits each line but can see what's missing coverage. Similarly for blink we have coverage for all the front end javascript. Like GPUCompilationInfo, not hit. :) Provides coverage by line, and by branch. Since lots of branches can get in a line, can see where coverage is missing more easily. We don't know which tests hit these, but we know how often it was hit. Failure cases, for example, are missing test coverage in many places.
KN: Right now we're running Win and Mac Dawn code coverage. Mostly targeted at CTS tests, implementing them to cover all of the code. Mac stuff not working for Blink right now. Ben's coverage dashboard is great - wish we could have that for all of Chrome.
MM: relationship between these numbers and when you're planning to ship?
KN: only that it's telling us how to prioritize tests we need to and want to write. Not likely to impact shipping schedule.

WGSL Debugger

JP: <presenting slides> Wrote a wgsl interpreter for fun. Turned into a useful thing. It's in Tint, makes use of tints core libraries. Pares to tint AST and then walks ast to emulate execution of shader. One of the cool things is it re-uses all of existing titn logic for const-eval. Don't have to do implementations for all math and binary opts. There is a new dawn backend called "emulator" which is pretty light weight and instantiates the interpreter and executes shader fo dispatch workgroups. This is a CTS demo, WGSL shader and a bit of API stuff to run it and check results. Shader that takes input array, each invocation loops over and sums values and writes back out. In Dawn can run cts from command line and have dawn.node bindings to run cts tests outside browser. Run test, see passes. Changing shader to go out of bounds, running on metal GPU, fails and just get output values. Switching to debugger, run test again and get useful information showing out of bounds memory access. Out of bounds load/store shows line of WGSL, invocation, workgroup and which allocation. A 1k allocation accessed at 1k offset, out of bounds. First and simplest thing. Since it's an interpreter, can make interactive with a lldb style shader debugger. Step through statements and see things like in a command line debugger. Can print values, set break points. Switch to arbitrary workgroups.When hits out of bounds, stops where it hit that code and can debug with context. A few other things, same shader with some silly math. Running this one, shows tried to call sqrt with a value that wasn't invalid. So may have got some indeterminate value. Will do that for all the rules in wgsl spec for const-eval. Right shift integer causing sign change. Demo 3, same shader but this time optimized to use workgroup storage. Doing same loop, but a tile at a time. PUlling in a tile of data to workgroup storage, looping over tile from workgroup storage. Running on metal and it failed, sometimes it passes. Non-deterministic failure. Failing differently each time. Typical cause is data race, Wrote a data race detector…
JB: As one does …
JP: Takes longer, but identifier the data race, shows 2 parts involved, one invocation trying to store and a different line invocation trying to read from same scratch location. Reason, happing in loop, between write/read but not from read and write on next iteration. Need second barrier before the write.
JP: summarizing. If we didn't have uniformity analysis in WGSL - have dynamic uniformity analysis, and it has zero false positives because it’s actually executing the code.
JP: limitations: interpreter, so it's slow. Could make it faster. Still several orders of magnitude slower than hardware. Compute shaders only. Graphics would involve writing a software rasterizer. Supports all CS primitives aside from textures and samplers. User flow - command line now, using dawn.node. Could integrate with trace+replay tool instead. Could also add DAP support to integrate with other debugger UIs.
KG: so cool!
<general agreement>
JB: data race detection similar to TSAN's? Data for each word about the last sync?
JP: sort of. Collects every memory access. Sorts them later, looks for conflicts. Very intensive.
MM: do you plan to release this and how?
JP: it's in Dawn/Tint, those are already open-sourced. Aiming to open-source this.

Earth

LK: Similar to other debugger from sprint, bringing back up Earth
LK: wanted to get Earth running on WebGPU backend. Had it working for couple years. But wanted it "ready" - if we can land WebGPU V1 we wanted something larger to show what WebPGU can do.
<slight issue presenting this>
LK: during exploration sprint, got a multithreaded version working to some degree!
LK: shortly after V1 we hope to show off what WebGPU can do esp. compared to WebGL!
RC: when it runs on WebGPU is it better than WebGL?
LK: can't directly compare metrics. It's a port - still needs to convert shaders in backend. Uses SPIR-V - not native WGSL. ATM maybe not faster. Eventually should be. Also some caching drawbacks - WebGL backend has better caching right now.
KN: the "large application" I talked about in the multithreading presentation a few months ago was Earth. Substantial perf improvements from WebGPU multithreading. Earth's architecture can take advantage of it.
LK: demo working! Can use 3D view, see buildings, etc. Not too many FPS drops on average. (~59fps) Single-threaded version, no optimizations.
MM: did someone have to rewrite Earth's backend to use WebGPU?
LK: yes, rewritten backend.
KR: Loko also got nightly builds running so it won't break.
LK: fun fact, it is in fact broken. :) It's using the SPIRV ingestion path, and that was just removed.

WebGPU on Mobile

BJ: If you have an Android device - install Chrome Canary, go to about:flags, turn on WebGPU, try it! Let us know if it blows up. Passing around demo phone too. Showing Austin's samples, Metaballs demo.
CW: it's what you'd expect. Same sample, just on a phone.
MM: Interesting because it’s running over Vulkan, but your initial ship target are Mac and Windows?
BJ: not going to ship this at the same time as we ship on Mac and Windows and ChromeOS. Little more work needed on this. Works pretty well for a first light. Some frame pacing issues and other tuning.
KN: That’s running on an Arm GPU.
KR: Metaballs demo is generating geometry from compute shader. Any issues related to the GPU being a tiler?
BJ: Didn’t make any changes. The demo is more written assuming desktop; not tailored to tiling. Didn’t see any problems related to being a tiler. Used atomics for synchronizing output buffer access, which avoids tile-boundary artifacts.

WebGPU/Compat

SW: subset of WebGPU. Targeting D3D11, OpenGL ES. Prototype works in Chrome Canary. –use-webgpu-adapter=compat .
SW: Uses Dawn's OpenGL ES backend. Using ANGLE to translate that to D3D11.
SW: many demos work, some don't. Will cover this more later today. Rendering works. Particles doesn't work. Game of Life works.
SW: it uses ES 3.1 Compute. Compute+Graphics working together.

WGSL V1 Items

(See above agenda items and merge these together)
KG: FYI:
DN: agreed to land template list disambiguation. FYI, we did. Queued up ~4 follow-ons. PR out for uniformity, been fixed a few times.
DN: The range diagnostics PR. Last detail is analysis being sensitive to ptr vs memory of the pointer. Minor technical fixup. 4 things about uniformity, ready to land just need non-googler review.
KG: Stamp from someone for range diagnostics
JB: want to look over a bit more. Sounded like things we noticed early on have been addressed.
DN: Gregg's keywords / identifiers: similar to template disambiguation. Previously tried to make types not keywords; but what about things that become keywords when adding template lists. Make all types keywords again? Want to have the discussion again. Slide deck later.
OD: provided opinion on 3520, attribute syntax
KG: DN, ready to talk 3819?
DN: prefer to talk about keywords / types.

David's slides on Template Syntax Followups

slides

#3803 Lookahead disambiguation V2
(Question about "type generators" vs. "type constructors", the latter of which should really be "value constructors")
KR: In the example where you redefined rgba8unorm. Can/do we have negative tests that say this must not compile in any implementation?
- DS: No. If we do this then probably won’t add any
- JB: Used to be ambiguous to have an “expression or type” grammar production. This eliminates this and so we can go to town on taking advantage of it.
- JB: Identifier and type are still ambiguous (DN: yes).
DN: can already make a structure-type value, with parens. Ambiguous to user-defined functions. Problem in Tint: different kinds of AST nodes, parser had to do symbol lookup to generate right AST node. Disappeared now.
BC: this removed thousands of lines of code.
JB: nice. Concern was having extra pass. But if we get payoff for it lower in the impl, makes me feel better.
KG: desired next steps?
DN: Ben prototyped the Treesitter and grammar changes. Having second pair of eyes on these. Types / vec2 array type generators, enums having the properties I said, etc. 4 steps.
DN: 1, 2, 3, 4 - un-keywordize a bunch of things.
JB: Treesitter thing - some "horizon" - 1024 tokens - because it's trying to be incremental?
DN: Treesitter's context-free. You inject a magic token to Treesitter. Goes into custom scanner. If it's my disambiguation, you can save 1024 bytes of state. Can fit in bit vectors of these. The state has to persist for when you see these later. When you see a token with one of these - might need disambiguation. Artifact of how the treesitter API works
JB: interacts with incremental aspect of Treesitter? Will this mis-compile large programs?
DN: could do this the way in the spec - second parser, go over whole program. Would make it fine.
BC: Compiler using the fixed [horizon] might fail the parse of a complex program.
KG: any reason to keep things keywords? I like it when things are just identifiers.
JB: think we'll be happy to have done this.
BC: bitcast would be first builtin with template type. Opens the door to explicitly specify more things. min/max, inferring T type - can pass in explicit i32, for example. Post-V1, but nice thing.
MM: think we should get rid of the keywords.
JP: on enums - access mode etc. - ues them in type contexts, but also in var declarations. Will we handle this the same way? Shadowing cases? Use it in a var, in Tint, it's different - should we unify this?
JB: think it should be consistent. Think things should work the same.
BC: thought "var alias" - could be used for more than just types. Maybe for texture formats, too. Useful.
MM: Could alias w = workgroup; then declare something using var<w>
BC: Yes. Have not discussed this with anyone.
JB: Would be many “kinds” of aliases.

#3819 wgsl: Statement must not be a type-constructor

BC: current spec doesn't rule it out. Tint has validation to prevent this. This week, noticed a mismatch. Think we should rule it out. Probably a cause of bugs.
MM: context: right now illegal to write variable name + semicolon. Can write ident + semicolon - function call. If ident is type name - that has no side effects, so should be illegal.
BC: discussing this AM. Unification we're talking about - type initializers becoming builtins. Why should these be special? Propose a new attribute like C++'s nodiscard. Tag functions, don't evaluate this thing and throw it away. min/max, tag with this attribute, same for type initializers and converters. Used to have rule - every function call, have to use the return value. People complained. Atomic add, don't want to have to use the return value.
KG: do we have a method for discarding something?
BC: phony assignment.
AB: feels like we're going back on a decision we made. Don't think we're getting enough out of this to bother.
KG: think this'd be good warning to have. Attribute still useful - you should not discard this. On the fence about making it a hard error.
AB: in that case - leave it to the impl to add a diagnostic. Then can turn on diagnostics as error.
MM: aren't there 3 options here? 1) if function returns something, you don't consume it - no error. 2) opposite - you call it, don't use the result - hard error. 3) some functions will cause errors for unused return values, some won't.
AB: think the default should be no error. Up to impl to implement a diagnostic to turn it into an error if the user wants.
BC: difficulty - no easy translation for the backend. Type constructor that's an array - have to work to pull out things from array constructor. Then throw away the array constructor.
AB: Or you can introduce a fake name in the target language
BC: you could.
MM: thought we were discussing whether some constructions in WGSL cause errors or not. Why talking about code generation?
DS: If you allow to call i32() then when we codegen we have a problem (cannot do array construction without assigning to something)
KG: the spec's "as-if". As if you declared an array and didn't do anything.
MM: get rid of the whole call.
KN: but keep side effects that were in them.
KG: do we have rough consensus to add nodiscard attribute. Should hinge the diagnostic off that. atomicAdd shouldn't warn if you don't use the result. Contention over hard errors. Inclined to say, leave it as a diagnostic.
MM: do we have consensus for nodiscard?
MM: would it make it in V1? It’s a significant feature.
KG: we can add attributes and do nothing with them.
MM: "free" what does that mean?
KG: would we accept a PR that added @nodiscard to all of our builtins? Spec as builtin might not say it does anything with that yet. Could say “implementations should produce a warning”
MM: want whatever is done here to be systems-driven. Not just constructors because they’re special. Want either no restriction or a nodiscard attribute.
BC: ask if we add a diagnostic that we not do it for V1.
KG: concern is implementation work? that you’re shipping
BC: bunch of code in tint that assumes type constructors are not statements.
JB: for our implementation it doesn’t matter. But would save you a lot of implementation work for v1.
EG: something that can start out as an error and turn into a diagnostic later?
KG: don’t love it but it’s a shippable thing
AB: feel we’re reversing the decision that we had to use the phony _ = assignment everywhere.
BC: not same, this affects things for which there’s no point at all in calling without saving the result
BC: that decision was made because ……
JB: nothing wrong with reversing decision as long as you don't forget considerations from last time.
KG: what about making it an error for now, for V1? Tolerable?
JB: attribute later to soften it? Seems fine.
MM: so "discard" instead of "nodiscard"?
KG: mark things nodiscard. We can make it an error at first, warnings later.
MM: what about user functions?
KG: think we should be able to mark them nodiscard.
BC: we would annotate the builtins, and add the language attribute for users
KG: language attribute, use it for our builtins.
BC: can user type "nodiscard"? Only an hour's work. Fine.
MM: what's the default?
BC: default's as is right now. Function returns value, ignore it, that's fine.
MM: can we reverse the default?
BC: would break a lot of shaders
KG: even ignoring that I would prefer discardable to be the default
(...)
DN: wrt naming, I like must_use more than nodiscard.
BC/MM: we can’t use “discard” because we already have something called that.
EG: rust community has interesting history and usage reports about #[must_use]
JP: does it go on the function or on the return type?
KG: I think it's function-level, not return-type thing.
MM: goes next to the workgroup size declaration?
JB: yes, or "vertex" or "fragment".
DN: and we'll add this to builtin functions that don't have side effects?
KG: yes.
AB: texture sampling? Tied up with type constructors too. Right now they're not functions.
KG: think an easy way to figure out what should and shouldn’t be, is part of this decision is so tint can ship. So what does tint effectively treat as must_use now?
AB: if you define the attribute as only applying to functions - where do we put that annotation? Don't think you should introduce this without moving. (?)
KG: to me type constructors are builtin functions.
MM: agree. Just named the same way.
JB: you saying we have to apply this attribute to types too? Type constructors always side-effect-free and meaningless.
AB: for now yes. Spec is set up so - we should do issue #3823 first as approved. Do it at the same time.
KN: in a future where type constructors aren’t side effect free, it’s because you’re able to write constructors for them. So can add the attribute to the constructor function if desired.
KN: in Rust, mustuse can go on functions and types. On types - if you return that type from a function, the function implicitly becomes mustuse (must use its return value). Return a Result<> for error handling - Result<> has mustuse. So caller can't accidentally throw away an error.
- JB: it's a crucial move that Rust made.
- KN: should reserve the idea of putting mustuse on types.
MM: still have Q's about the default. Lot of code that doesn't have annotations. We haven't shipped so not concerned about that. Why not flip the default?
BC: reason we got rid of the error was users complained that calling functions and discarding results was an error, and was annoying.
KG: think there's momentum for discardable functions by default, and adding nodiscard yourself. If I were designing the language from scratch, I would not make the default "mustuse". Tolerable to go in this direction?
MM: we're satisfied with that.
<consensus>
KG: as Alan mentioned - consensus toward 3823. Type constructors as builtins, can tag them with mustuse.
KG: will come back to this post-V1. Consider warning instead of hard error.
JB: is it nicer in the spec to treat type constructors as builtin functions?
BC: maybe not in spec. But the future of these becomes clearer if we want to add user overloads. Consider everything that can take templates, easier than splitting spec into distinct things.
JB: so, spec'ed as - when you define a type, this introduces a builtin function like so?
AB: almost same stuff we have for type constructor expressions. Move into builtin functions chapter.
JB: so, builtin functions i32, etc.
MM: that'd be great, just like bitcast.
AB: already have… no worse than structs & type aliases. Had issues filed about this in the past. Why no const on this? etc.
JB: so when you define them as builtin functions you just make them const in the usual way. I see.
KG: with type aliases to preserve that - type alias = i32, and want to construct it?
BC: another topic to discuss.
Resolution:
- Turn everything into builtins
- Add mustuse annotations. They'll match what's in Tint.
- mustuse is a requirement. Hard error if you don't consume the result.
That'll be V1.
JB: easier to remove mustuse in the future. Be liberal in what you add it to. Judgment call though.
AM: when C++ committee added nodiscard - they felt they made the default wrong. Would have wanted all functions that return something to be nodiscard. And have added "discard" to things you want.
MM: that's our preference too.
KR: but there was user feedback saying that that was too annoying.
<discussion>
KG: I think the Rust solution of putting it on types is nice. But do I want a compiler function so I have to annotate every single thing?
AM: can you share the user research on this in Rust?
EG: I can compile some user reports and present that.

#3520 Attribute grammar / syntax

OD: right now - attributes defined in a table. Syntax - no general definition. Simply specialize to every attribute in the table.
- Likely to get more attributes in the future.
- Prone to bloating syntax, and combinatorial explosion.
OD: think we can generalize attribute definition. one of the last few places where spec syntax will have to change. Other - extension names, identifiers. Think we should remove them from the syntax - impl detail how these are handled. (are notes correct?)
KG: suggesting removing attributes from the grammar?
AB: having one general rule.
AB: think this is OK to do, now that everything's an identifier. Was complex when things were context-dependent names.
<discussion>
DN: 3 cases we didn't discuss before. Probably necessary to make this change.
MM: is this a behavior change?
EG: it's editorial.
AB: not a behavior change.
DN: except if they become identifiers, then you can shadow them.
KG: does anyone not want to make them identifieers?
JB: kind of nice to have them namespaced. Don't have to worry about interpolation styles interfering with variable names.
KG: kind of a problem if we made "size" an identifier.
KG: "size", "id" would be problems.
AB: these can only be problematic if in structure declaration.
MM: if we add push constants, adding a global called "size" will be relatively common.
KG: what's that for?
AB / MM: size of a member in a struct.
JB: don't think this proposal will simplify the spec. Have to declare the rules somehow.
DS: we have this already. Separate table. Already duplicated.
KN: attributes as we've seen them so far - don't know how you'd make a user-defined one. Can alias, I guess. Weirder than an alias for a builtin type.
KN: Think - put something here not called "identifier"? And make one grammar for attributes. Put "attribute name" after attribute.
MM: exactly what I was thinking.
DS: @compute with brackets? Can't do that now.
MM: reminds me of CSS variables - just token sequence, brackets have to match, that's all. Could say attribute is "@" + string of stuff with matching brackets.
DN: I'm more on board with parameters being identifier-like things. Don't think there's value in making the the attribute names overridable.
BC/KN: +1
AB: … doesn't have to resolve to the same thing in the grammar rule.
JP: usually identifiers can't be the same as keywords.
AB: using "identifier" to get the point across - a name-like thing.
MM: do we want to collapse all these rules together? Collapse all of these together?
KN: I'm suggesting ‘@’ attribute_identifier.
AB: attribute identifier can look just like an identifier.
DS: makes grammar consistent between all attributes.
MM: does that solve the original intention?
OD: one general grammar rule, yes, and define it in the table.
AB: summary:
Attribute = ‘@’ attribute_name attribute_params ?
Attribute_params = ‘(‘ (expression , ?)+ ‘)’
JP: attribute parameters are shadowable?
DN: yes.
MM: can't make a global variable "centroid" without shadowing the builtin “centroid”
JB: this seems fine. I had thought the attribute table at the start of the spec was leaning more on the structure provided by the grammar.
KN: depends on the other change to make more of these enums identifiers.
KG: we're resolved on that now.
KG: consensus: go ahead with the PR.
OD: thanks. Think this is important, allows syntactic sugar.
OD: similarly - extension names are specialized now, but they'll become general.
DN: that's not part of it. Right now they're not resolved as identifiers. Think that's important going forward.
AB: didn't discuss named attribute parameters. That'd be an extension to the grammar.
MM: interesting proposal - think none of the other shading languages have named parameters.
AB: GLSL almost has it in its layout specificers.
BC: MSL has it. texture_size (?)
MM: in Metal - you have a shared header between CPU and GPU code. #define MyThing 7. Identified by an integer, but a workaround outside the language.
MM: named parameters are interesting, but not right now.
OD: didn’t mean to propose it [just a point in favor of the general grammar for future proofing]

#3778 placement of generated grammar

KG: can be in whatever repo we want.
OD: I'll set up a repo and automation to push it.

#3792 wgsl_version_at_least

KG: from the question about language evolution. What if you don't have restrictions on pointers any more? How do you test that?
KG: Ben had a proposal last office hours.
BC: talked with colleagues - not all unanimous. Recap: we originally pitched versions. WGSL 1.0 is what we agreed to. 1.1 - unrestricted pointer parameters. We'd proposed browsers have to implement everything in a version to advertise that version. Earlier versions - do not implement features from future versions. Bit contentious. To find middle ground, proposed feature names. "unrestricted_pointer_parameters". Browser can report it supports it. Discussed idea of WGSL versions, but a version's a collection of features bundled together. 1.1 to be unrestricted pointer parameters + a bunch of other things.
BC: there's a danger in this - a browser could sit at 1.0 and cherry-pick forever. Many of us feel that's a disservice to our users. Good for browsers to go up in versions, don't need as many feature flags.
BC: idea of letting browsers cherry-pick things if they want to, but have emphasis that browsers implement highest version they can, is the best compromise so far.
KG: it's also something we can test. If you have this bucket of features, we can say you have all these available but only advertise the old language version - that's a bug.
BC: builtin could be replaced by "requires" statement with version or list of feature flags. If browser doesn't support that, generates an error. "requires" doesn't change compiler behavior. Compilers don't have the 2^n permutation issue. They just report what they implement.
MM: implementing features one-by-one isn't desirable either. Reason I wrote this - not to subvert the group. We want to move forward with everybody. Proposal of naming all features satisfies my list of desires. Will authors list 27 different things in their source code? If standards group is diligent about gathering up features and putting a version number on them, that'll be OK.
MM: if we list feature names one-by-one - this is not in the rest of the web platform. We might as well do that for the web author. If you want to feature detect this, here's how to do it. In sum, generally positive.
DS: other side - API can advertise it supports certain features.
BC: Alan had concerns about spec being littered with features & feature dependencies. I don't have a great answer.
KG: this form being proposed is almost exactly the same form as extensions in OpenGL. Doesn't get out of hand - there's a 2^n risk, but the features are mostly orthogonal and don't interact with each other that much. The CTS will have the worst of the challenges. E.g. I have while loops but not do loops.
BC: few ways we'd try to prevent browsers having low base version but lots of features. A feature can require a base version.
MM: thumbs down.
KG: I understand the fear. But don't think we should worry about it. Take normal approach of standardization. Talk together, say: these are the group of things we want to put in the next version. Think we're find there.
MM: let's tackle that problem if it occurs.
BC: Yes. Was mostly for one feature overlapping another feature. If this feature depends on this set of other features -
MM: I'm OK with technical reasons for dependencies. Ray tracing depends on bindless. Does it make sense that we force browser X to implement feature Y?
KG: easier to define an extension when assuming presence of certain other extensions. Think ray tracing can require bindless, for example. If you need a subset of WGSL 1.1, you could say you require WGSL 1.1.
MM: testable, too. Can't support ray tracing if you don't have bindless.
RC: do we need an increasing version, or a skill tree?
KG: a numbered version …
MM: don't want to add a string type - function taking string returning bool.
BC: it'd rather be a require directive.
MM: OK, sounds better.
BC: we were saying requires, then comma-separated list of things.
MM: similar to enable directive. Work the same way.
KG: different from "features" list?
DS: do you need a different directive? Just use "enables"?
DN: it's different. You shouldn't be able to use f16 if you haven't enabled the feature.
BC: requires is a tiny option.
CW: then we have navigator.doesSupportWGSLFeature?
BC: think you should just list the features.
MM: think this is just a FrozenArray on the device.
KN: think it should be global on navigator.gpu.
MM: / KN: these are all sugar.
CW: maybe tuples are harder to implement on one backend or another. ;)
KG: consensus: requires, something on the API side, enumeration of features.
RC: should authors have to add "requires" to their shaders?
KG: no. It's just if they want to make their shaders more robust.
KG: one tradeoff, you don't get the WebGL style where you have to enable everything.
BC: "requires" has a nice offline tooling story. Can ensure requires list matches the shader.
KG: WebGL had a nice story for this - group should understand they're choosing a different direction and there will be more porting pains.
RC: there are tradeoffs.

Cross-API Topics

WebXR

https://github.com/immersive-web/WebXR-WebGPU-Binding
BJ: WebXR only handles collecting pose data of the hardware. Feed it to you along with parameters for how you should render. You use WebGL for your rendering. Performance sensitive. People want all the triangles they can get. Eager anticipation for WebGPU in Immersive Web circles.
BJ: can't just ship WebGPU and call it done. Interaction point between the APIs. Modeled after WebGL integration - "layers".
BJ: basics - go out and get an adapter. Ask up front the best adapter for WebXR / the headset. Multi-adapter system - device will be plugged into one of them. Want the fast path.
BJ: then get a device as usual. Create XR binding. Replaces CanvasRenderingContext for WebXR. Gives you textures you render into.
BJ: several different layer types. ProjectionLayer - covers entire vision. Render 2 different viewports, or more depending on HW configuration.
<see notes doc above for more>
BJ: lot of WebGL content rendered side-by-side in a single texture. Other way - have an array texture, each layer's associated with a viewport. WebGPU - want to get rid of the viewport. Go with layers. Hardware prefers this nowadays.
BJ: how do we tell developers to render into that efficiently? Today: bind one layer, do rendering, start new render pass, bind next layer, render again. RenderBundles make this not too terrible. Also likely a step down in efficiency from modern native techniques - multi-view rendering. Even from WebGL techniques - do all the binding, render once, switch viewports, render again - for each object in the scene.
BJ: not asking for decisions to be made. Worth thinking about multi-view extensions - they exist for Vulkan, D3D. Not sure about Metal. These can be invasive to the API.
BJ: worth investigating sooner rather than later.
BJ: in the doc we have an older variant of the CanvasRenderingContext replacement. Give a full list of the rendering formats we support, pick the best one. Now we have the guaranteed formats, we'll give you one value - this is the one we want you to pick. Maybe do that again. See if that runs afoul of any platform restrictions.
BJ: also in WebXR group - in WebGL, any time the API gives you a texture, it'll always be cleared. In context of swapchain - efficiency gains by letting user say, you don't need to clear every time (must do it the first time, obviously). Would that be efficient? What to do in WebGPU's context? ClearOp has overheads, so does LoadOp. Have SurfaceProvider not clear if it's already made that content safe previously? Or, update the WebGPU API to have more of what Vulkan has - DontCare? (Don't think we want to do that.) If we had something where texture came out of provider, not cleared every cycle - useful for CanvasRenderingContext as well.
KN: think we can do this already in the API - we do it for canvas. We can mark a texture as "containing unknown stuff". Use as renderpass with LoadOp=Clear, we don't clear it - we let the renderpass do it for us.
MM: doesn't this not work if order of recording != order of submission?
CW: no - submit is just "replay the commands".
KN: not critical here. Executing command buffer, if texture in DontCare state - then exit the command buffer that has a clear - don't need to clear before executing it. If load, copy - then clear before submitting the command buffer. Don't need to modify the command buffer. Just slip in an extra command before it.
CW: only thing Dawn can do that your impl can't - transform the Load into a Clear. Probably have to issue a separate clear in WebKit's impl. We never added a warning about this.
MM: you're trying to get the perf of DontCare without having to rewrite.
CW: not exactly. Either we say "preserveDrawingBuffer" on XR APIs (portability issue, thumbs down), or textures always need clearing. User performs clear, unnecessary because they write the entire frame anyway.
MM: we'd see first command submitted, look at its LoadOp. If not clear, we'll insert a command buffer to do a clear.
CW: yes. Nonzero amount of work though. You're on a tiler, so better than clearing the texture.
KN: should be fast-ish.
MM: don't see this as a problem. User said clear, so we clear.
KN: if user said Load, and needs to be cleared lazily, you have to clear.
MM: clearing for us is cheaper than loading.
CW: even cheaper to leave tile cache uninitialized and assume you'll overwrite the whole thing. Say, don't load anything. Assume overwritten by skybox.
MM: violent agreement.
CW: what to do about this?
BJ: on Meta's WebGL impl, they're experimenting with explicit "don't care". WebGL doesn't force you to make the clear/load choice, so they can leave it uninitialized. No path for this in WebGPU right now. Ensuring clearing only once is good. Not clearing at all because of your app would be a meaningful perf improvement for some use cases, esp. on mobile hardware.
MM: so goal here is, I'm a dev and will write to every pixel on the screen, what load op should I use?
BJ: not necessarily tiler vs. discrete. Semantically, Clear assumes some overhead. Will be true on some percentage of devices. Don't want to invoke it unless benefitting my application.
CW: on mobile, not a problem - explicit clear's basically free. On dGPUs - they have higher performance. Also assume framebuffer compression, can assume lower mem traffic. Minor enough problem. Don't worry about it for now, prototype DontCare op later, see if it matters.
SW: in Skia/Dawn prototype, I prototyped a DontCare, and it didn't affect any benchmarks on desktop.
CW: what's our recommendation for rendering into multiple textures?
BJ: don't want to get too far in the weeds. Recommendation for efficiently rendering scenes - two slightly offset views of the same thing - probably something where we can wait and see where the problems are. Need to prototype things.
MM: reason we're discussing this in WebXR content because it cares more about performance?
BJ: that topic's brought up because Meta Quest 2 platform - their engineer was able to profile in their WebGL impl and see that clear was taking a significant portion of their frame time.
BC: I was on the Daydream team - clearing was a major bottleneck for studios deploying to mobile hardware.
MM: clearing's a perf bottleneck, why WebXR specifically?
BJ: WebXR is certainly perf sensitive, but they'd benefit all applications.
CW: feedback from this group on shape of WebXR integration?
BZ: Meta's using multiview, as well as fixed foveated rendering. I implemented this in WebGL. Can WebGPU do this?
BJ: great question. WebGL impl, TextureProducer, gets 0-1 level of foveation. Doing that here - fixed foveation, probably the same 0-1 scale in WebGPU. Whatever hardware can produce, let them do it. Hopefully when we get to this for WebGPU, can investigate true variable-rate shading. Fixed foveation was a half step. Would impact people beyond WebXR.
MM: does PSVR2 do fixed foveation or variable-rate shading?
BJ: they have foveated rendering + eye tracking.

WebGPU/Compat

Slides: https://docs.google.com/presentation/d/1-c-zyvqPm_g3NnFjhW_cxt1vJxYHYS5Gd6PfdqRkYZA/edit?usp=sharing
- Full list of compatibility issues: https://docs.google.com/spreadsheets/d/1Cu8xqtGQrfosMBz8sepXQ5ibzgfWMchRpt7iqGpDZU0/edit#gid=0
- Tint test suite failures: https://docs.google.com/spreadsheets/d/1q-IIfld0EgQiSkiwmjNq6W5wrrwjN8b0VTjmWPMP1hA/edit#gid=1890904767
SW: why do we want this? Investigated how many devices can't support WebGPU in the field. Most of these numbers are public.
SW: how to reach those users?
SW: design goals: run as much WebGPU as possible. Including compute. Run with minimal changes. If you don't implement Compat - it should still run. Your app should still run. Need to target the intersection of D3D 11.0 and ES 3.1. There are nice things in 11.1 - storage textures in vertex shaders - but this reduces the reach by something like half. Need to stick with 11.0. ES 3.2 - cuts your users in half again.
SW: proposed IDL: GPURequestAdapterOptions addition. And a readonly boolean on the GPUAdapter.
SW: what can't we do? Biggest - lack of texture views in GLES. Several examples.
SW: some unavailable functionality. Discussed.
SW: resource limits are lower. Size of uniform arrays is much lower than WebGPU's default.
SW: implemented prototype in Dawn/Chrome/Tint. Tint: native GLSL output. Got rid of SPIRV-cross dependency. Don't have native D3D11 Dawn backend yet. For now - Windows prototype uses ANGLE. Dawn calls GLES, Tint produces GLSL. Runs on Linux. 3 platforms we care about most are Windows, Android and ChromeOS.
SW: table of samples. Particles sample probably the hardest.
SW: interop with other stuff. Video uploading not working yet.
SW: Can run the CTS! See slides for details. On Windows - 23K passes on ANGLE/D3D11, and 28K on D3D12!
SW: See slides for Tint test results. Have bugs logged for the Tint tests. Lack of cube arrays, for example. Remarkable how much stuff works.
SW: Dawn end2end tests - ANGLE's passing as many tests as the native GLES drivers. Probably still work to be done on ANGLE side, though.
SW: proposal is - the IDL, and discussion of the various items.
BJ: Stephen reached out to me about one of my demos - shrinking an array that was too large. Anything you saw where we might need to advertise additional limits over what we have now, because it goes down in compat mode?
SW: yes. Would like another additional box in the spec. In compat mode, this is unavailable. Or, these resource limits change.
BJ: are all the limits that would drop in compat mode, are they expressed in the current limits?
SW: good question - don't know.
KN: if we have new limits I think we don't need to add them to the core API. Return a new supported limits structure containing new fields, for Compat?
KG: got some similar numbers on Mozilla's side to the ones you presented. ~65% of Windows users have D3D12. Your numbers sound right.
SW: sounds good. Our 34% also has D3D9 etc users, so addressable users are probably ½ what I presented.
KG: 95-96% of Firefox users have D3D 11.0. 11.1 is under 80%, like 78%.
SW: our numbers came from Chrome's internal reporting stats. People having blocklisted drivers, for example.
KG: our source doesn't include blocklist. Their supported GPU generation is of this type. Yes, actual results will be lower.
CW: explicit request: we really want this because it will make WebGPU reach all the machines WebGL 2.0 runs on. Want compat mode to be optional to implement, and some meetings to discuss compat mode constraints.
AB: on WGSL side, if we can have callouts in the spec for extra restrictions - and ideally an agreed-upon diagnostic - would be useful. Give me an error if I'm going out of compat mode. If not everyone implemented it, at least the people who wrote compatibility mode.
BC: if feature thing we discussed earlier was an agreement - could have a core thing and WGSL 1 thing for things that aren't in Compat.
CW: is we get Compat mode boxes in the spec - probably questions about how this is surfaced. Probably for another meeting.
BZ: expected perf characteristics of these devices? If WebGPU has certain QoS, some back-of-envelope testing for how this will work? And user expectation. What are these devices for?
RC: why does this have to be a mode?
CW: there's additional validation we don't want in the main spec.
MM: if we were more willing to recompile late, some of these restrictions would go away. How many hoops to implement WebGPU on GL?
CW: if you recompile all the time it's a disservice to the user.
KR: Looking at details, there are some WebGPU features that can’t reasonably be implemented on GL. E.g. if you need to insert expensive copies all over the place, that’s unacceptable. Want to keep developers on the straight and narrow path. Many of the GPUs that run GL but not the full WebGPU feature set are quite capable in many respects. And we could bring in a large number of users who have those.
CW: there are GPUs that support WebGPU/Compat that are faster than those that support full WebGPU, too.
SW: even old Mac that runs Metal might not run WebGPU well.
RC: D3D12 also supports feature level 11.0. Newest of the new API on older machines. In practice, will people be better off using WebGL? It’s beneficial to developers to not have to support two codepaths. Let them use WebGPU, and get lower limits.
SW: there are things like texture views - performance cliff for these users. All they need is a hint - that's one IDL thing I didn't talk about, up for group discussion. If we know how the texture will be used - would avoid expensive workarounds.
MW: impact of setting this on D3D12?
SW: we expect browser vendors will implement the validation on top of D3D12. Doesn't force you to use D3D11.
MM: you're proposing two WebGPUs then? We have to do the validation on devices that don't need it?
KG: yes, but - we do this for WebGL 1.0 and 2.0. It's harder than this, and its' fine. It's not a 2^n issue.
CW: browsers can choose to not implement this. But if they do, you'll get the API subset.
Discussion about this.
MM: sounds like not much of an impl burden. Potential fragmentation burden, worrying. Also - these numbers are 0 users on Apple hardware. Makes us not interested in this. We would like to not resolve on a decision today, but have more discussions internally. Maybe we can come up wtih a compromise.
KN: don't think we'll use it on Metal, but there are some really broken Metal drivers. 🙂
KG: also we support much older hardware than Apple does. Makes it easier for us.
CW: understood we wouldn't agree on a specific thing this meeting. Can come back to this. At least on our side - we have a huge interest in this. Many millions of users.
MM: rate at which that number's changing?
SW: not as fast as we'd like. :)
CW: really sad how slowly it's going down.
KG: we have a gap of 34-35%, going down at 1-2%.
KR: would love any initial feedback from Apple. Topic near and dear to Firefox, Microsoft, Google.
KG: not really near and dear to Firefox yet. :) Just provided some numbers as input.

API V1 Items

Tacit Resolution

Add feature for filterable *32float textures #3828
- Not always filterable even though renderable.
- Bikeshed on the name? float32filterable?
- MM: any "filterable" in existing extensions?
- KN: no, but have "renderable" etc.
- KG: putting a warning at the top? "This isn't blendable?"
- KN: just added a "blendable" column at the top.
Add maxFragmentCombinedOutputResources limit #3829
- CW: implementing something we agreed upon earlier.
- KN: this name's copied from Vulkan. Bikeshedding on the name. Issue is where almost no PowerVR devices supporting Vulkan. This is the key limit. Total number of things you output from a fragment shader. Includes storage buffers, storage textures, render attachments.
- CW: depth/stencil?
- MM: don't we have a limit for this? Per pixel?
- CW: this is different - also the resources.
- KN: this includes storage buffers & textures.
- MM: what's the baseline?
- CW: 8.
- KN: pretty small. It’s things that can come out of a fragment shader.
- KN: current values for these limits: 8 storage buffers, 4 storage textures, 8 render attachments.
- CW: seems fine. Cases where'd you use storage buffers or storage textures in fragment shaders are pretty rare.
- KG: seems fine.
- KN: think it does include depth/stencil attachment. Will add text for how to check this limit.
- MM: for Metal it'd be infinity. How to handle?
- KN: PR would say, don't expose a value for this greater than maxStorageBuffers + maxStorageTextures + … + 1.
- MM: OK.
- KN: no effect on most devices.
- KR: what was PowerVR's limit?
- KN: 8.
- Tacitly resolved.

Who actually wants multiple timestampWrites with the same TimestampWrite.location? [#3808](https://github.com/gpuweb/gpuweb/issues/3808) [Myles]

CW: we could postpone this post-V1.
MM: is a change in shape next week too late?
CW: Chrome branches March 23, so not too late, but onerous. We'd probably ship without the feature and turn it on later. If we want to change the shape, probably OK. If you want to investigate implementing it differently…
MM: small investigation I need to do. Couple of impl options. Think this group's resolved on one of the existing options.
CW: OK, we won't ship this and wait for group to rework it.

Defaults for depthWriteEnabled and depthCompare may be surprising [#3798](https://github.com/gpuweb/gpuweb/issues/3798) [Brandon]

Also #3777
BJ: defaults, and whether we want to do anything at all.
BJ: what should be the drivers of these defaults? If all the native APIs have a particular default, we shouldn't question it. Make porting easier. Developers won't trip over it. Even if just historical momentum for why it's in place.
Spreadsheet
BJ: which face is front face? Didn't realize this was controversial.
BJ: think not worth trying to change things up in various situations. No default we can adhere to. frontFace, cullMode. D3D12 default is cullMode=back. Don't think we should adopt that. (If cull mode's wrong, you notice this quickly.)
BJ: pick reasonable default - user will notice right away.
BJ: now we come to the ones Myles brought up. depthWriteEnabled, and depthCompare. depthWriteEnabled is False on all APIs except D3D12 and OpenGL.
BJ: any interaction with depth buffer - have to include whole stanza about depth/stencil. You'll see visually whether you're getting what you want.
BJ: depthCompare - think "less" is probably what we want to do.
CW: concerned we're reconsidering defaults at this stage, for reasons I pointed out in the issue. My preferred solution - stop changing, whatever's in the API now is good. If we start talking about cullFace, cullMode - more changes that didn't trip anybody.
KG: not sure that's true. Maybe nobody's bashing down our door. But maybe the gap there isn't important.
JS: is this user-facing API expectations?
BJ / KN: entirely.
BJ: Myles did complain about it.
MM: I don't count.
MW: this is the first thing I ran into, depth writes getting set to 0.
KG: esp. if people are coming from something, like WebGL. Most people don't read the WebGPU spec, and won't know why things silently didn't work.
MM: 1) changing these in impl is super cheap. 2) Mike found this just walking in the door. 3) once we set defaults.
CW: changing these in impl isn't cheap. CTS is this group's responsibility. Purely Chrome's responsibility - changing these isn't cheap. Have to update hundreds of tests, or because we removed a default. Same true for this group. Have to maintain the CTS. Update hundreds of CTS tests. No uninitialized value in JS - "undefined" - IDL validation of browsers would explode. Has a cost. It's valuable feedback - but we thought about these defaults a long time ago, haven't changed in years.
KG: disagreements about whether we thought about this deeply.
CW: we did. Understand Mike's feedback is coming now, but 0 user feedback in the past 2 years.
KG: frustration - trying to put together a spec. At some point, have to make the decision to cut the changes.
<discussion>
MW: are we proposing changing some of the other defaults?
BJ: think that's what we want to get to the bottom of. Part of the reason I did this spreadsheet - a lot of defaults we have right now that match up with the graphics ecosystem so don't try to relitigate. A few where we don't match up. Requested we talk about them. A few of them - required - some of these were discussed extensively among Kai, Dzmitry and I. Don't want to revisit them. And a couple we've had requests to discuss.
BJ: I think I sit somewhere between the two endpoints. Think it's a discussion coming in fairly late in the process. But I've seen that these kinds of issues that can be deferred again and again until it's too late to change them.
JB: emergent processes that ensure that when we're under the most pressure, the largest issues come up. :)
KG: anything in the "required" list, I'm not worried about authors messing up. Ignore all of those.
Reason we're re-discussing - Chrome doesn't want to do the previously discussed strategy of making it required, then making it optional again with the other default.
Resolved:
- Make depthStencilAttachment.depthClearValue 1.
GT: I've run into this. Why not make the other ones required?
KG: we know when something's passed to us vs. not. We can generate warnings if you don't pass something to us. We can choose to have that warning on by default, or when DevTools is open. No breaking change needed. I hear you, super frustrating.
KR: what about the verbosity of having to specify these?
GT: you already have to specify all three for the state to be useful.
<break>

Bikeshedding names [#3804](https://github.com/gpuweb/gpuweb/pull/3804) [#3805](https://github.com/gpuweb/gpuweb/pull/3805) [Myles/Brandon/Kai?]

#3804
MM: we have a function, it has "create" semantics, but it's called "get".
KN: counterpoint: that's not what we intended "create" to mean. "create" creates a WebGPU object, not a JS object. Yes, it creates a JS object, nothing else matches it.
JS: don't we have another createBindGroupLayout?
KN: yes, on the device.
JS: other examples of similar names on different objects?
KN: yes, destroy().
MM: you can create it with a BGL or a set of BGLs. This thing will either return the BGL specified, or an autogenerated one. Doesn't actually create anything.
KG: leave it then?
MM: OK, let's close it.
#3805
MM: similar, creates new object from JS's perspective.
KG: does this IDL say "new object"?
BJ: don't think so, but we could say that.
DJ: why does the other one say "get"?
BJ: this is a function call rather than a getter. Add'l work done, getting back new result every time. If it was overhead-free - I'd probably drop the "get" from the other one.
CW: getBGL(), you pass it the BGL index. Has to be a function for this reason.
MM: if you wanted getBGL to be a property, would be a Promise<FrozenArray>. Not proposing that.
DJ: previous one, call it multiple times, you get back different objects?
BJ: Chrome's impl - multiple JS objects pointing to the same Dawn object.
MM: if you use an expando property, different on different return values. Deduplicating, you mess up the debug labels. Programming model is - each one has a label. From JS perspective, it's exactly create semantics.
DN: does Wasm have different creation semantics?
KG: everything's const through JS.
KN: in Dawn without wire, we give you the same pointer each time. With the wire, we might not know - give you a handle.
CW: back to compilationInfo().
KG: think it should be compilationInfo() if it's the same object, createCompilationInfo() if different objects.
BJ: if same object, I'd want it to be a getter.
CW: it's a Promise.
CW: what about "request"?
Discussion.
KN: there's a real reason these have different names. "get" is correct in my mental model. This one, there's no GPU object it represents. I don't think there's anything wrong with these names.
KG: think they all feel tolerable to me.
BJ: that's the key for me as well. Maybe don't line up the same way, but nothing unambiguously better. Nobody will get their depth buffer inverted because these are incorrectly named.
Could make this "get"..
KN: can't put SameObject on something that returns a Promise. You could call it again, may need to throw an Exception, can't throw, have to reject, but can't. Can't be the same Promise.
CW: call it getCompilationInfo?
KN: it's concretely different from getBGL. Don't see why it should have to be the same.
KN: nothing against "get", just don't think it's concretely better.
CW: strong preference for anything here?
RC: going back, why can't we return the same JS object from getBGL?
KN: we can.
CW: difficulty is - we have a cross-process model. If you expect same JS wrapper, then given JS wrapper to createPipelineLayout, then get the BGL, you'd expect the explicit one you returned much earlier. Difficulty: what if your Pipeline is an error on the service side? It doesn't give you a BGL. It gives you an error BGL. Suddenly, you call getBGL - Error BGL.
BJ: concrete problem WebGL ran into.
KR: yes. We had to retain entire graphs of objects from the WebGLRenderingContextBase. Don't want to make the same mistake again. Very complex in multiple browser impls. In Chrome we had to wait for a brand new primitive to show up in the garbage collector (Member<T>). Multithreading issues in WebKit w.r.t. the garbage collector.
Discussion.
CW: resolved to "getCompilationInfo".

Passive Fingerprinting Surface [#3101](https://github.com/gpuweb/gpuweb/issues/3101) [Myles]

CW: We thought about it.
KG: Mozilla and Google chatted. Think there was room for consensus where implementations “should” restrict the uniquely identifiable configurations to at most 32. Act in good faith.
MM: why "should" not "must"?
KG: think "must" is tolerable too.
KN: didn't want "must" because it's difficult to ensure you're conformant.
KG: hard part about "must" - is it OK to be wrong? Write "must" in a spec and be confident it's true? Backward compatibility isn't an absolute. Think we should say we "must" keep it under 32 buckets and try our best.
MM: think the "out" is - bring it back to the group.
DJ: what if you have 32 buckets and a 33rd one comes along? Merge buckets, add a new one?
CW: another one - this is non-testable. E.g. Chrome right now has 48 buckets for the limits. But, a bunch of these are probably not used by any device.
KG: that would be a good time to come back to the group, e.g. if there are 60 configs we want to support. Consensus for "must be 32 or less"?
CW: likely yes, but would like to see internally.
MM: OK.
DJ: are you aiming for 48 buckets and some are empty because devices are similar enough, or you've defined the bucket and don't see any devices in the wild?
AE: we have 4 * 4 * 3. We don't have data.
KG: we have ~27 unique renderer strings possible from our renderer string sanitization. Recent NVIDIA cards for example.
MM: realistically we'll have a few of them.
Resolution: Google will discuss internally. Expected resolution: "must be 32 buckets or less".
MM: when you normalize renderer strings - does that feed into other limits?
KG: de facto, but not in code. Max texture size won't change in a GPU generation.
CW: may be able to resolve this tomorrow.

maxBuffersPlusVertexBuffersForVertexStage limit [#2749](https://github.com/gpuweb/gpuweb/issues/2749) [Myles, Brandon]

MM: maxBindGroupsPlusVertexBuffersForVertexStage is what I'm proposing.
MM: if we can't agree on "bind groups" + "vertex buffers" then maybe not worth discussing?
CW: counterpoint: when you have a pipeline, you know where that boundary between the two kinds of resources is. Can remove the unused vertex buffers.
MM: problem with that - would like to set a bind group before you set a pipeline.
KG: did we consider save/restore?
MM: think that'd fit the bill. Seems reasonable.
KG: I want to use unbinding for save/restore.
MM: we have this implicitly - bundles.
BJ: doesn't do that. Causes a state clear.
MM: state clear would also fit the bill.
KG: can you enqueue a render bundle?
CW: executing 0 render bundles probably does this. :)
MM: can we make a shorthand function? clearEncoderState?
KG: think it'd be nicer to have a save/restore.
BJ: the way you'd get there - a way to reflect back what the state is. Not too much attached to encoders. Current bind groups, buffers, pipeline. Then you can build that save/restore yourself more easily than in say OpenGL. Probably the desired direction. I still question the utility a little bit. You mentioned libraries that don't mess with your state, and I appreciate that.
CW: had some ideas about save/restore, but not the right time to discuss in too much detail.
KG: idea of clearing encoder state.
CW: there's still the same thing. More work for the impl to do this thing itself. Very tractable amount of work to track the number of things you need to apply, and do so at draw time. Yes, setBindGroup should have proportionate cost. It's across the IPC boundary. Draw call is close by. Drivers probably have dirty bits for this.
CW: if you don't do this, then when you do a draw, have to recheck BGLs match the pipeline. Don't want to revalidate BGL.
MM: don't need to defer anything, just need a bit.
CW: exactly. Here you'd have a bit as well.
MM: what's the meaning of the bit?
CW: the pipeline has changed, so you need to recompute where the boundary between vertex buffers and bind groups is.
MM: then, have to rebind everything. That's what I'm trying to avoid.
KN: why?
MM: don't know how to make progress.
CW: saying this would be more than a Metal setVertexStageBuffer?
KN: BGs come from one end, VBs from the other.
CW: what's the concern? setVertexBuffer can result in more than one call to setVertexBuffer in Metal?
MM: yes, exactly. Trying to avoid - multiple binds, then they set the pipeline, then they draw.
CW: then they set another pipeline and draw.
<discussion>
BJ: this seems like a problem that'll vanish once we have bindless, for example.
MM: for some definition of "vanish". :)
BJ: don't know how far out that is. Seems like something that can be addressed via some feature or extension. Less problematic once bindless is there. Is it problematic to start off with the stricter limits, and allow it to become more flexible down the road?
KN: if we need this limit, need to add it for V1. Need to check it. If every impl ships it, and it's the sum of the other two, would still have the problem.
CW: would like to keep the full hour to talk with Intel Shanghai. We can re-discuss this issue.

Shanghai topics: WebNN, DP4A, push constants, WebCodecs (optional)

WebNN

NH: Web ML WG - developing WebNN API for a while. Sent Intent to Prototype - implementing in Chromium upstream.
Web Neural Network API Explained
NH: WebNN exposes functionality in native ML frameworks. Any modern deep learning model is a neural network. It is a computational graph over mathematical operations. WebNN allows app to create the graph; describe your workload in that graph; WebNN implementation can compile the underlying graph via the native ML framework, then run them. Explainer lists the major targets; Windows DirectML; macOS …(?); ChromeOS originally targeted Android NNAPI, and more recently the ML service in ChromeOS.
NH: It’s a cross-process unit of work. Could be implemented by CPU, GPU, or special purpose neural processor.
NH: use case we want to discuss - WebGPU interop.
JB: are you thinking WebGPU could be a backend for WebNN too?
NH: possible. In WebNN polyfill, we implemented using TF.js's WebGPU impl. We position that as a polyfill. Developer can access WebGPU directly. Not much reason to host WebNN on top of that. No perf gain for that solution.
CW: can use WebGPU backend to polyfill instructions not implemented in the core WebNN impl, and use WebNN for ML as part of a WebGPU app.
NH: collaborating with WebRTC group. InsertableStreams, MediaCapture Transforms.
https://github.com/webmachinelearning/webnn/issues/226
NH: WebEx for example. AI-based enhancement. Web app sets up real-time video processing pipeline, ideally on GPU. Composite into a few stages. Can do with WebCodecs and other APIs.
- https://docs.google.com/presentation/d/15iAIhzpaA6reKJBL-ecgYtic6ZKHEpKL5OK_sExTllc/edit#slide=id.g12497b971a8_0_1
- Single frame mixes use of WebGPU and WebNN.
NH: needs to be done on client side for privacy reasons. Good use case for these APIs interoperating. Opportunity to offload the ML processing stage for segmentation?
NH: doing this, we see a perf gain and power reduction using the optimized impl on the native platform.
NH: concept - hardware specific optimized kernel impl. Operators through DirectML, Metal Performance Shaders. AI accelerator can work with the GPU.
NH: 2 requirements we want to discuss and get feedback:
- 1. For AI accelerator offloading, developer needs to bind GPU resources as input to the WebNN graph.
- 1. Developer may need fine control of the WebNN graph submission, along with other GPU commands like post-processing in WebGPU, or custom kernel execution that WebNN doesn't support.
NH: have a proposal - ML command interface.
Web Neural Network API - The MLCommandEncoder interface
NH: want WebNN graph as a workload into a command buffer. Submit to WebGPU command queue, fulfill the scenario we described. A proposal; want your feedback. Or do it another way? Consume the GPUBuffer more directly? Want to start the coordination. You're close to your V1 release.
https://www.w3.org/TR/webnn/#api-mlcommandencoder
CC: hello everyone - I run the Windows AI team, including DirectML. Late today - apologies. :) Briefly - we have a couple design issues that would benefit from your help. Maybe not in this meeting. One - about WebGPU/WebNN interop. Think this is very important. Many scenarios require 0 copies to make them efficient. That part of the spec is something we've put out, but not making much progress understanding whether it can be implemented. If you can help us understand that, that would be very helpful. WebNN in general - meant to be implemented by the platform. On Windows, have done initial work to map to DirectML. Going well. Very curious about other platforms like Apple CoreML, Android, others - from POV of web standard. Are we on the right track? Defining something realistic? WebNN has been under development for ~3 years - close to Candidate Recommendation status. Need to understand real world implementability.
MM: some feedback: we think the most valuable part of WebNN is running on the neural engine. Thumbs up. Maybe unique about it - our system framework chops up models and runs part on CPU, GPU, Apple Neural Engine. That's valuable. We think any impl of WebNN on Apple platforms should use the same chopping facilities.
MM: we think WebNN by itself fits that model.
MM: WebGPU integration - more challenging. MLCommandEncoder - challenging. ANE doesn't execute on the GPU's timeline. Wouldn't want to attempt synchronizing the two.
MM: example of image frames, passing back and forth to WebGPU - a use case we want to support. That's the shape of our underlying platform cross-process object (IOSurface). Using textures there - good. Using buffers there - bad. Not implementable.
MM: MLCommandEncoder - would only be able to use the GPU. That would be unfortunate.
RC: does this slicer let you wait for the ML part, then let the GPU take over the IOSurface?
MM: yes.
KG: you have different adapters in a system - lack of associated adapter in the IDL was a concern.
RC: one PR we have up for review is passing in the GPUAdapter if you want WebGPU interop. For non-GPU workloads, impl will have to handle the fences for you.
CW: for DirectML seems the API would work very well. Think these are implemented on the GPU? Could work there. Passing adapter beforehand? Whole graph passed to WebGPU? Not sure. On Mac, different systems on different timelines - more difficult. Think similar on Android. Physical neural / image processing engines - more challenging. Sync overhead. Have to explicitly transfer things. Handoff work from one to the other. Solution - hand off objects explicitly between WebGPU / WebNN - seems more portable. Looks nice though. Easy from developer standpoint. Difficulty in synchronization is there.
CC: multi-device/adapter issue Myles raised is complex in practice. Would like to understand better how to make it easier to implement. GPU-centric timeline - want to make sure it's synchronized properly. Talking about single graph - want to chop into pieces - some running on the neural engine - sync across these timelines is important. Should do more deep dives into this to understand it better.
MM: WebNN explainer, near bottom, has a table - WebNN and model-loader APIs. Are these siblings?
RC: model-loader is an alternate API we can expose. We prefer to stick with WebNN.
MM: these are competing?
RC: you could say yes. Each has pros and cons. To avoid a format war, we'd prefer the model/graph expressed with JS commands, and do model parsing with JS libraries.
MM: Personally not super interested in implementing model-loader.
NH: A model can be loaded by a WASM implementation, or by JS. WebNN could be a backend for these.
CW: to wrap up for now: what should interested members of WebGPU CG do?
CC: I'd say - look at the links we shared on MLCommandEncoder, and provide feedback. Also PR # Rafael mentioned earlier, passing GPUDevice to instantiate WebNN. Third, GPU/neural engine timelines and synchronization: Need to understand how WebNN supports implementation that involves multiple devices and timelines · Issue #350 · webmachinelearning/webnn · GitHub.
NH: those, plus resource sharing APIs. These are the top priorities.
https://github.com/webmachinelearning/webnn/pull/322
RC: also - there are several operators in the spec - will these operators work well on Apple hardware? Would be great to get feedback from domain experts.
KG: think most imp't thing is to do investigations into backend implementations on Android, Linux, Mac, ChromeOS, etc. at this time.

DP4A

Issue: Proposal: Support DP4a as WGSL built-in functions · Issue #2677
PR: WGSL: Add dot4U8Packed and dot4I8Packed as new built-in functions by Jiawei-Shao · Pull Request #2738
JS: though this is post-V1, we want to propose it again. Useful for executing int8 ML models on the web. Can run much more quickly.
JS: thanks David for reviewing my patch. Couple new primitives that map to SPIRV, HLSL. Can't expose the saturating primitives because not supported in HLSL.
AB: Questions. The way the SPIR-V extension in Vulkan is you have to expose all the functionality and add a flag to say whether it’s actually in the HW. In WebGPU, should these be polyfilled in software?
AB: In SPIR-V, the dot products can take any vector supported in SPIR-V. When we add types like i16, then do we want to expand these builtins to that type. I’m not a great fan of using such specific names. Instead should we overload the ‘dot’ builtin function?
JS: I think - these have been supported as core feature of Vulkan and HLSL.
AB: do you want these polyfilled? Then there's more impetus to add them to the spec as a core feature. If you want them HW accelerated, we need an enable flag. Is there concern that there would be a noticeable performance cliff if it has to be polyfilled in software.
MM: today, for apps using DP4A - they exist - do they not work on devices that don't have that instruction? Or what do they do? App has 2 code paths?
JS: not sure. For Intel's XESS, only requires SM 6.4. Supported on many other devices.
MM: so apps themselves don't check for HW support, they just run the instruction.
AB: on the Vulkan side you can see if it's HW accelerated.
JS: similar on D3D. Can query device ID.
MM / KG: sounds like we should polyfill it.
KG: Agree should polyfill it.
CW: then people won't know whether it's HW accelerated.
MM: they're saying people don't care.
KG: agree the name's unfortunate, if we expand to 16-bit pieces. But if stuck with this weird builtin and go in a different direction, that's acceptable.
MM: if we picked a different name would that disappointing Intel?
AB: the "A" - accumulations - the proposed feature here don’t do accumulations. Would also want to think about a future where we have unpacked i8 support. Understand packed dot situation, and integer performance.
MM: how much does Intel care about the name?
JS: not sure. DP4 is not a name only used by Intel.
MM: OK for us to rename it?
JS: sure. Also compilation time we can fuse together to DP4A instruction.
CW: the "A" - accumulation - peephole optimization in compiler.
AB: Vulkan only adds accumulation if you add saturation. We don't expose that.
MM: sounds like everyone's pretty positive. Questions about the name, etc. But general thumbs up.
KG: also, pretty narrow use case, so not too worried about the name.
CW: since thinking of polyfilling - it's not an enable. It's in WebGPU V1?
MM: you OK with that?
BC: there are other things we have hardware caps for. Why polyfill?
MM: existence proof - there are apps running on D3D12 and they run everywhere, just hardware accelerated in some places. Tried polyfilling FP16, was unsuccessful.
BC: V1 vs. 1.1?
AB: thought we agreed, post-V1.
CW: consensus - yes for this - name is fine.
AB: long ago, we agreed post-V1 on this.
KG: technically, post-V1 polish - agreeing now that we should just do this, soon after V1.
DS: thought this could be the first post-V1 sugar feature.

Push constants #75

D3D12: Root constants
Metal: setVertex/Fragment/ComputeBytes
Vulkan: push constants
Potential proposal: passEncode.setPushConstants(offset, arraybuffer/arraybufferview)
Very limited space: 16bytes? (root signature is constrained, push constants used by the WebGPU implementation itself).

WGSL:

var<push_constant> size : u32;
@fragment fn scalar() {
  _ = size;
}

struct TensorSizes {
  input1 : u32,
  input2 : u32,
}
var<push_constant> tensor_sizes : TensorSizes;
@fragment fn tensor() {
  _ = tensor_sizes;
}

Jiajia: I'm making the TF.js WebGPU backend at Intel. The requirement for push constants is that they are a performance improvement and would need 64 bytes, otherwise many op can't use them.
JQ: unary op - need at least 32 bytes. Most ops - have two inputs, one output. From TF.js POV - may need 64 bytes for push constants if you want this feature.
MM: are push constants for performance or convenience? We can already attach data to shaders. What do these give that buffers don't?
KG: performance.
MM: recall I ran experiments. Performance wasn't compelling. Would need similar microbenchmark showing better performance.
DN: Google folks have been asking for this for perf reasons, should be able to produce that.
MM: if you show rigorous perf improvements - that's sufficient.
CW: if we do this for perf and/or convenience reasons - there'll be discussions about when they're cleared, etc., and how much space is required in the spec. That's the scary part. Very limited space on Vulkan - competes with our impl's use of push constants. On D3D12 - the root signature max size is limited. Scary too.
RC: and on D3D12, Dawn uses root signature to implement its own features.
KG: somewhere it said, we could only give you 128 bytes. But if TF.js only wants 64 bytes - still valuable.
MM: just want a compelling benchmark.
CW: other part - what does this look like in WGSL? Bunch of restrictions. wgpu - has push constants today. Users can use in native. It looks like in the minutes (see above). Tint also has this for our internal use - not exposed to users.
MM: don't like designing based on impls - but if design's well informed. and happens to match impls, sounds fine. Believe it can work, not a question.
KG: no preference about syntax.
CW: we'll wait for the data.
KG: thanks for bringing up these topics!
MM: excited for these additions!

Friday

Future Topics

Real Pointers / Unrestricted Pointer Parameters

MM: Our highest priority is “real pointers”.
BC: Meaning exactly what.
MM: “The Clayton Transform”
BC: I call that unrestricted pointer parameters.
MM: That’s a-ok.
BC: Ok, wanted to confirm we’re talking about the same thing.
DN: So it would be a named feature like we discussed yesterday?
Group: Yes

Standard WGSL namespace:

BC: We’ve made so many things shadowable. E.g. you can shadow i32 with something else. That’s not advisable for authors to do, but we’ve chosen it for extensibility etc.
BC: There’s a larger discussion to have user-declared namespaces, but that’s a big discussion. Before them we’d like to create a ‘wgsl’ namespace to hold all the predeclared things. My mental model is you have your language scope things which is an enclosing outside scope to the user declarations. The wgsl namespace would be a way to fully qualify the outermost scope of predeclared things. Syntax-wise, we’ve talked about syntaxes. Suggest using “name.inside”
KG: I’m for it.
JB: Need to make sure the dot is feasible, syntactically. Would we have global variables of struct type and do member references. Would we accidentally shadow them? E.g. A.B.C today A is a struct, and later A might be a namespace. Do you know how to parse it without knowing if A is a struct vs. a namespace.
DS: you wouldn't access the member off the struct.
JB: if wgsl.myGlobal were a struct type, you could write wgsl.myGlobal.x. Wouldn't know namespace vs. struct dereference until later.
BC: Given out-of-order declarations, I’m not convinced that changing the grammar gives you more.
JB: So you parse as a chain of names, and only at identifier resolution time would you figure out what that meant. I like having smaller bits of ink for this syntax. Don't want to pay for it in spec complexity. Like C++'s ::.
JB: In principle agreed, and will look at details.
KG: what about wgsl.max()? wgsl.bitcast<f32>?
(MM catching up on discussion)
MM: what happens when we add structs/classes that can have member functions?
KG: In my mind the parser just parses it as a chain of name references, and resolve it later.
MM: user-defined namespacces?
JB: not yet but want it in the future.
MM: wgsl namespace is magical?
JB: yes.
MM: meaning, wgsl namespace members can be accessed without dereferencing the wgsl namespace explicitly.
JB: the way rust structures this, might be good for wgsl, is they say everything is in std::. Every shader module in our case, implicitly has a prelude that does ‘use wgsl::*’
KG: (agreement)
JB: Rust, Standard ML, Haskell, I think use this
EG: can think of the prelude as reimporting things from another module. [Can think of] other languages like Python, JavaScript this way too
MM: can I not have the prelude if I want?
(no)
MM: can I write wgsl.i32?
(yes)
KG: kind of like if you do “using namespace std”. some programs do that, others do not. Can write programs in a way where they only get standard definitions.
MM: how do I know wgsl.i32 is fully qualified?
JB: need leading "." syntax.
BC: ‘wgsl’ is reserved right now, and we can keep it that way.
CW: like in C++, can do ::std
DN: … reserve ‘wgsl’ …
JB: reserving ‘wgsl’ totally works. but inconsistent with our past direction (syntax is fixed, identifiers are not)
BC: Not keen on that. For the initial ‘.’ you want that to mean module scope. This is not module scope.
EG: in the root scope? would you have a name for that?
BC: currently have many things above global scope
DS: can just have a symbol representing the root? $.i32?
that’s what reserved ‘wgsl’ is
KG: would be my preference.
MM: in the future when we have user-defined namespaces, I would not be allowed to make a sub-namespace called ‘wgsl’
BC: would likely be a keyword
CW: we've been trying to remove reserved keywords
KG: "reserved identifier". Don't have to worry about removing keywords.
CW: nice property of Rust, C++ - lot of builtin things are just a function implemented by this. Might be implemented in hardware. Could be less magic in the WGSL language. Doesn't have to be a goal, but nice.
BC: this is the carveout to allow you to avoid the shadowing. A mechanism to let you definitely get the builtin thing.
CW: .wgsl.something does that.
KG: could also do it that way. I don't like ".bitcast".
BC: seems ambiguous to have a ".".
KG: is this not tolerable?
KR: given that spaces aren’t significant, don’t you have a problem that the previous identifier might interfere or be ambiguous?
BC: yeah, that’s the possible problem with dot as prefix.
JB: not sure this discussion is the right way to design this. would like to have a proposal
EG: question about keyword proposal. feel like this carveout is forward compatible …
JB: keyword would be forward compatible
MM: one way to look at C++ analogy - they have a reserved namespace, too. It's named "empty string". User can't make their own namespace "empty string". Instead of wgsl being special, empty string is.
BC: point was that we will want a mechanism to let you get to the module scope. So if you shadow f32 but you want to refer to the builtin f32. :: as you’re proposing is the wgsl namespace. Would not have a way to get to the root [user] namespace, so would need another syntax for that.
MM: the goal isn't to make a final proposal - this is WGSL V2, probably a long time for now. Taking temperature of the room, broad brushstrokes. I think this is a good idea. Like it to be systems-driven. Single feature that can do the goals of your proposal, and do user-defined namespaces.
CW: seems that the decision we need to make today is whether we reserve the wgsl keyword.
DN: already is reserved (not in the keyword section).
JP: the sort of things in this wgsl namespace would be: enum values, builtin types, builtin functions. All one namespace, so since we have a center enum, we wouldn’t be able to add a center() builtin function
DN: yes.
JP: making type constructors builtins: then have f32 type in wgsl namespace, and f32 constructor function there. Do we need magic for type constructors?
DN: no because we can redefine the object that f32 points to. We can imbue any properties we want to it.
MM: btw "systems-driven" == generalizable.
EG: but not necessarily user-accessible at this time?
MM: right, those are 2 different requests. 1) user interactable - want user to make their own namespaces. 2) not too much magic.
EG: in this case, we put something in the wgsl namespace - type constructor and value constructor.
MM: we should say, all types get a type constructor / value constructor (?)
DN: we do. "texture" is special though.
EG: exposing in wgsl namespace - expose the same functionality to users at the same time?
MM: user creates type -> create type constructors for them? yes.
EG: builtins expose this. If what's built-in exposes this, user should get the same functionality.
MM: it's that types get a constructor spelled the same way as the type name.
EG: same rule for builtins lets user do the same thing?
MM: yes.
KG: want to understand the gap between my understanding and the …
KG: concept of adding a wgsl namespace. If people want their stuff in a namespace they write a namespace. A way to think about how builtins are implemented is that they’re in a namespace that’s in scope.
KN: can you shadow things at the same scope?
BC: no.
MM: just don't want to treat wgsl namespace differently than other namespaces.
KG: I propose all of our builtins are in the wgsl namespace, and our preamble contains "using namespace".
MM: as long as author doesn't have to "use wgsl", and can opt-in to using "wgsl." everywhere.
KG: an un-using?
(... yes)
KG: I’m ok with stopping there and having that [be special]
MM: think this is a detail. Shouldn't try to design in detail in this meeting.
JB: everything essential that was said sounds great. As long as people can generate code unambiguously referring to the builtins, ok. I think the rules will generalize nicely. Think we can just go see a proposal.
MM: in KG's proposal can people "use namespace myles"?
KG: yes.
JB: this proposal's just for this one thing. Starting to think about user-defined namespaces. Don't have to do that now.
MM: not at this moment but I think they should be the same feature.
BC: I think we'll want what I proposed before user-defined namespaces. User code can shadow builtins, types. If you translate from another language to wgsl, you have to rename all their stuff to enable you to use wgsl. People jump through hoops for this right now.
KG: I proposed this in a comment in the past.
https://github.com/gpuweb/gpuweb/issues/2941#issuecomment-1164927025
KG: goes into more detail of how you'd do nested, versioned namespaces. I called it std, we can call it wgsl. I will shape this up and get it spec-ready.
EG: interesting to develop snippets that show what we're proposing?
KG: you're welcome to, but don't think we need to - think we have consensus.

WGSL Process And Norms

Process and norms [David]
- Assume WGSL is decoupled from WebGPU.
- Intended release cadence?
- Mechanics of updating spec with in-development features. (Branching? )
  - How would we develop a feature that takes 50 commits to finalize.
- Cherry-picking features from “future” versions.
  - E.g. comment
DN: we have 2 specs. How do they co-evolve?
DN: maybe answered this yesterday. I assume - want to confirm - that WGSL will evolve without needing to sync with WebGPU, aside from surfacing feature names.
JB: alternative to this?
DN: you have WebGPU say, we accept createShaderModule, these languages at certain versions. I assume they'll evolve independently. Join point - whatever names show up in your GPU device features.
MM: think independently's too strong. Bindless - can't do that independently. Changing anything - push constants - can't be independent. For sugar features we can do what we want.
KG: technically we have 2 specs but one meta-spec for WebGPU, and WGSL spec's separate for convenience.
DN: developer will only pay attention to one version of the spec - the latest published spec from the W3C. No updates. Publication's monotonic for both WebGPU and WGSL specs.
(agreement)
DN: release cadence. Follow-on from yesterday. Version 2 of WGSL - pay attention to this, developer community! Don't want to have more than one of these per year. Want them to be a point in time.
JB: agree on 1 per year at most.
MM: should be less often than that.
DN: should get all browsers up to same level before considering V2.
KG: agree with this browser cadence.
MM: here we're not signing up to ship something 2 years from now. The farther ahead one browser gets, the larger that gap is, and synchronization between releases more difficult.
DN: if namespaces take 50 commits to land, how does that process work in general? If the feature's not ready and we're trying to develop it in discussoins.
KG: we use one mainline and merge things into it. We mark things as needing more work. I'd say we should use a feature branch for that.
MM: think it depends. Each wrok item can make that decision.
CW: can't editors say, this is too much / taking too long to land?
KG: maybe. If it'll take 50 commits, put it in a feature branch.
CW: DP4A will be one commit, so can land directly.
MM: other W3C specs - it's common for tip-of-tree to have unimplemented things. People wanting to know what's out there typically look at the TR version.
JP: When we want to cut a release, do we take all the features that are in the spec? Or do we talk about it?
MM: Decide when we get to that first release?
KG+JB: We will make these decisions anyway.
MM: there must be spec text somewhere saying V2 corresponds to [this list of feature names]. So we have to do that anyway, should do the normal standardization process to figure out what’s in that list like we would with any other change. And it’s ok if a feature is in the spec already but doesn’t get rolled into a release.
KG: and everything we have now is in the V1 release.

Passive Fingerprinting Surface #3101 [Myles]

CW: Google discussed internally about “must” 32 bucket limit. Seems fine to us.
MM: OK, ship it.
KG: I can write up the spec language.

Defaults for depthWriteEnabled and depthCompare may be surprising #3798 [Brandon]

CW: we suggest making depthWriteEnabled, depthCompare, required. depthClearValue - doesn't need to be required since it's a validation error to leave it undefined.
MM: Thought we were going to make them required, then add defaults again before v1?
CW: that’s only necessary for implementation strategy. Spec can just choose now to either change to required, or change the defaults.
KN: I still oppose the new defaults for V1, I want them required in V1 [if we make a change].
KG: making them required is slightly better. Even better to have good defaults. Can litigate that later. Comfortable with having no defaults and making them required.
CW: Gregg had good reasoning for keeping them required indefinitely. Better to make people explicitly choose what they want.
GT: my preference - all required. Defaults impossible to litigate - will be a vote. Required = fewer changes, more explicit.
KG: you can add these kinds of defaults by modifying the JS prototype, too. Consensus to make them required?
MM: "the future's longer than the past". Can make changes in the future.
CW: resolved. Editors to take it.

maxVertexBuffersPlusBindGroupsForVertexStage

CW: imagine the limit's 32. (MM: Probably 29 actually) Or, say 4 for an example. Can be 4 VBs, 4 BGs, or a mix of the two adding up to 4. If no add'l limit - 4 VBs plus 4 BGs. If limit created only at pipeline time - 4 VBs, or 4 BGs. In encoder, can set 4 VBs and 4 BGs. Set pipeline using 4 VBs, draw. Set pipeline using 4 BGs, draw. Impl has to remove stuff behind the scenes. Draw implies lots of rebinding.
CW: suggestion: we make the limit happen on the things currently bound. Add a way for users to unbind stuff. Can stay under the limit.
CW: if user has 4 VBs and 4 BGs, draw pipeline 1, draw pipeline 2 - what's the work they'll save compared to what the impl could do by itself?
CW: work to unbind/rebind BGs - it's the same work the impl would need to do. Have to do it themselves, have to do validation.
MM: central thesis of WebGPU - your costs are explicit. When the binding happens, there's a call, change the binding. If user doesn't want that they don't call it. Expose operations that have to occur.
MM: for program you just stated, the costs would be the same. It wouldn't be for other programs though. Could bind everything at draw call time, then we're back to OpenGL world.
KG: part of concern - perf impact of making these calls from JS would be higher than doing it internally.
CW: that's one. Another - adding complexity, runtime/draw time checks. Then need a way to unbind things. User has to do it. It's slower. Opinion: all for something (subjectively) corner case. Tradeoff is imbalanced.
MM: understand. I disagree. I'm asking for one limit that can be enforced with one machine instruction. Cost's low. I understand you think the benefit's small. I think it's medium. For unbinding - some appetite for doing it even without bind group reshuffling.
BJ: even if we had unbinding would we be able to implement the behind the scenes scheme you mentioned, without tracking the bound buffers + bind groups intermediately, and rebinding them just-in-time for the pipeline?
MM: there's a trick. You have to keep track of what was bound before. Where you have to swap, in Corentin's example - the place where the swap occurs isn't at draw time. It's at unbind time. It rebinds the things that would have been bound. Author pays for the costs explicitly.
CW: wouldn't you have to check this at each setVB / setBG? You draw, start unbinding/rebinding - if we validate only at draw. you can start binding bind groups, unbinding VBs - it's not straightforward in your impl.
MM: I made a long post about how this works. Can link it.
MM: https://github.com/gpuweb/gpuweb/issues/3787#issuecomment-1415445208
CW: remember reading about how the max is computed. But you can't call Metal setVertexBuffer … at that time.
MM: you need to know - for a particular slot, what are the two most recent candidates. VB or BG. Then when you unbind one you rebind the other. Still pay-as-you-go. When you unbind, we still know what to rebind.
BJ: understand where you're coming from. I'll perceive that unbinding something's less costly than binding something. In this scheme it's the same cost. I see how that works out.
KR: Intuitively don't’ see how having it per slot lines up with what users would do. Wouldn’t you need full history?
MM: Don’t need history. Only need the last item. Author says set a vertex buffer to one slot, and another to a different slot. Only need the last one (-- I didn’t get that). Only need an array of tuples.
CW: for impl, what does this entail? In our impl, probably a bitset of VBs and BGs that are set. Count leading zeros, do sum of the count for the validation.
MM: validation - of all the BGs, which one has the max index. Same for VBs. Add together. For you, you can set limit statically; defer binds; etc. Kai enumerated them.
CW: we'll just not implement the validation and have it be the sum of the two.
MM: if user asked for 5,000 VBs you'd just say no.
KN: they could raise VBs by 1. Then combined limit would stay the same.
CW: so we need to continue to implement validation. If you raise the number of VBs - they wouldn't raise the combined limit, so we still need to implement the validation.
BJ: are we allowed to increase one of the limits by default?
CW: not conformant.
KN: not currently conformant.
MM: is this a spec feature request?
CW: we can implement the validation.
GT: you could warn. If you raised this limit, we could go faster.
MM: the validation's so simple I think the if-test skipping it would be slower.
CW: from our perspective we think this is imbalanced. Need to add unbinding - add'l complexity. Needs add'l testing. One day, we'll find we have bindless and don't need this feature any more.
MM: yes. When bindless is enabled we can bypass this.
CW: or if Safari has stats that people never run into this we could stop exposing the limit?
JB: Mozilla has no opinion.
CW: ok, so validated at draw time. Can pass null to setVB and setBG.
MM: there are multiple ways to do unbinding. Think, start conversation about unbinding.
KG: I have some preferences; main goal is to save/restore state. I'm just asking for save/restore option for the bindings.
MM: just bindings? Not VBs, not pipeline states?
KG: this is easier for us because we have render bundles and they have a state reset. RB doesn't restore state afterward. That's the state it'd be nice to be able to restore.
MM: RenderBundleDescriptor, another "save" flag? "Restore" function? Won't work.
CW: multiple ways to do this. saveState, new object. pushState on encoder, popState. Making it opaque - robust to future spec changes. Add'l piece of state? Add it to it.
GT: now I want to create those outside the encoding.
KN: has come up before. Right now we require a bunch of JS calls in the hot path.
BJ: sort of what RenderBundles were supposed to solve.
MM: if there's a state object and it persists outside the encoder - the whole idea of PSOs. This is a PSO plus the state. Not sure how much I want to innovate here.
CW: these are possible. Rather not go into so much detail now. Don't want save/restore, push/pop for V1.
MM: is your opinion about save/restore different than setVB/BG passing null?
CW: yes. Annoying to add tests for the latter, but fairly natural and self-contained.
MM: if we start with that now, and eventually - maybe for V2/V3 - consider push/pop, save/restore - KG how do you feel?
KG: sounds ok. RenderBundles are enough for V1.
CW: OK. Resolved: New limit, draw time validation, setVB(null) resets, can pass garbage for offset/size (no validation), setBG(null) too.

[Post-V1] Dealing with holes in the pipeline layout. #2043

CW: now you can reset BGs to null / uninitialized. Right now have to fake up empty BGLs. Have to create empty BG.
MM: instead of doing that, can do a null BG, same as creating an empty one.
CW: even better if BGL sequence can be sparse.
MM: careful about terminology. 3 levels. 1) all things need indices which increase by 1 every time (dense). 2) use any number. 3) Choose a number between 0 and 8, but can have a 3 without a 2.
CW: here, talking about (3).
MM: sounds great.
CW: to be clear, this is post-V1.
MM: also great.
KG: V1 is dense?
CW: right now, yes. If you create an empty BGL, unbinding the BG doesn't do it.
MM: that's fine for V1. Make it sparse later.
CW: point of process. This is the first agreed upon post-V1 feature. When do we put it in the spec?
MM: spec will have a bunch of features. Not all are in V1. Not all in V7. Some list saying "V2 is this".
CW: not a hardware feature. Won't be an optional feature.
KN: needs to be feature detectable.
KG: everything is, by checking errors. :)
MM: would like to see the code for that. If more than ~15, that's unfortunate.
CW: push/pop error scope, await a thing.
KG: need dummy resources to bind?
CW: no, just create empty BGL. Empty pipeline layout with nothing in 0 adn that in 1. push/pop error scope, await. ~6 lines.
BJ: make it a new limit. maxGapBetweenBindGroupLayoutIndices. :)
KG: just land it. Then canIUse has a breakdown. Who supports sparse BGLs?
MM: worth having a discussion - e.g. in CSS, feature detection's @supports . For us, granularity's non-obvious. Spec'ing several different entry points for a new feature - if I test one of those, does it mean all of them are there? Yesterday, wgsl resolution - each feature gets a name, written down. Have deliberate discussion about feature detection?
BJ: check feature? Feels heavyweight. If dividing things into post-V1, we have a way to name features.
MM: not suggesting giving a name. One particular solution. Features feature - if device doesn't request one of them, have to act as if it's disabled. Not what we want here.
BJ: I foresee a lot of contentious future debates. Which bucket does a future feature fall into? Easy to look at HW caps, definitely a feature, might not be universal. Stuff like this - more palatable to say, we can implement it as we go. Making it feature detectable is important. Don't rely just on canIUse. Want to check at runtime.
MM: every option here has feature detectability. Not should things be feature detectable - what should the developer experience be? Maybe we don't define any granularity. Try stuff, push/pop error scopes. THa'ts one extreme. OR, do it like hardware features. Possibility.
CW: we'll have helper libraries. Here's something that does feature detection. Something we could produce.
MM: reasonable. Once we talk about creating new community group documents, that's where charter gets involved. We could write a JS function that does the feature detection.
CW: procedurally - members "happen" to maintain another repo.
MM: that'd be fine.
KR: comment about Modernizr and not wanting to create random WebGPU devices just for feature detection, like it did for WebGL. Also, github.com/webgpu org doesn't have the same charter as the CG, so helper libraries can go there.
AE: right now, setVB will throw an exception due to the Web IDL.

[Post-V1] read-write storage textures #3838

Teo posted an investigation about tiers
Austin looked at Chrome’s stats about tiers
TT: But you can’t even query tiers until macOS 10.13. We support macOS 10.12.
SW: For Compat, ES 3.1 and D3D11.0 can do Tier1 only (R32*) but you can bind typeless RGBA8 to them (see here), non-coherent only. Coherent version can only be done in D3D11.3 IIRCp
CW: 3 questions: WGSL (access mode on storage texture), and a new barrier), API, and which texture formats.
KN: API easy. For texture formats, Teo’s investigation looks solid to me. WGSL, defer to Alan.
AB: For the shader side, we didn’t have it because it needed a newer version of macOS / Metal. Would need an enable because we want to support those older versions as well. Metal2 on iOS and Metal2 on
CW: We added a stat for this and we saw many devices that
AE: I checked Chrome’s metrics recently. Chrome doesn’t support 10.12 anymore (number?) 7-8% which is Tier “unknown” which i metal copy all devices returns empty array. Otherwise all support Tier1.
KN: So we can support it?
CW: So read_write storage textures could be soon?
MM: Might not work the way you think it works. Read cache hierarchy is not the same as write.
AB: So yea, on Metal, need an image fence between read and write on the same invocation. And between invocations you need a barrier across invocations. That’s what we need in the solution. There’s a possibility of eliding the fence within an invocation. Would take work and would be a heuristic.
MM: Worth thinking about.
AB: Within an invocation it’s a mem image fence. In MSL, on the image itself, it’s the .fence() method. The conservative codegen strategy is to insert fences before and after each read and write, then elide them if you can prove they aren’t needed. Would be super slow by default.
AB: I don’t know about the buckets of the formats, and how to group them as exposed features. Which formats can be write, and which can be read-write.
MM: High level bit is, the answer is “yes, the group is interested in this”. (Proceed to detailed proposal).
SW: one point - on the older APIs you can bind an rgba8unorm texture to e.g. r32uint and do pack/unpack in your shader so you can implement custom blending modes even without rgba8unorm storage textures.
MM: tier 1 - can be deployed everywhere. Later ones need explicit querying.

[Post-V1] Drop support for macOS 10.12? #3238

KN: About dropping support for macos 10.12. V 10.13 also introduces argument buffers. We said a few months ago we didn’t want to drop support for 10.12 because arg buffers alone wasn’t enough to motivate it. But together it’s probably enough motivation to do so.
KG: Firefox 10.12 activation's so low I'm not worried about it for WebGPU.

[Post-V1] Add rgb10a2uint texture format #3841

CW: seems to be available everywhere. Didn't check Vulkan 1.0 spec though.
JB: Kai, you have a correspondence doc. Is it listed there?
KN: Yes, it should, but that data isn’t there yet.
KG: put it in core, and if it doesn't work, take it out? Try it.
CW: in Milestone 2?
KG: yes.

[V1] Validation for resolveQuerySet() doesn't require the query to have already been written to #3812

CW: accidentally made non-editorial?
CW: Dawn does a compute shader pass for timestamp unit conversion, and if a query hasn’t been hit/isn’t ready, we just write zeros.
MM: Why not error?
CW: If error, your frame is gone.
MM: Ok
Resolved: Write zeros, don’t error.

Support for arrays of textures #822

MM: probably a good idea. Array of 1024 textures probably won't fit (first example in the issue). Feature seems useful.
CW: in terms of API size - in BGL entry, in addition to binding number, we have arraysize number. Equal to 1. If specified, means - it takes N consecutive binding numbers. Then you add binding number N, N+1, N + arraysize - 1.
CW: WGSL side - probably have a type. "binding array".
MM: we have texture array already. Want to make this "array of texture".
AB: That’s what he’s saying. It’s a new kind of binding_array where template parameter is the texture type.
CW: layout of memory of multiple textures, you can sample them. This is different. Multiple textures that are independent.
MM: why can't this be a normal WGSL array? Today, "var foo = MyTexture : texture". Can't we just say ": array_texture"?
AB: Presumably we want the same solution for bindless buffers, and we already have a top-level array there. And then it would be ambiguous about whether it’s an array of bindings or a single binding of an array of data.
MM: fair.
CW: bindless array can be a new type, if we decide to go this way.
MM: possible to do this with no API impact, just shading language feature.
KN: was going to suggest that.
KR: Do all WebGPU GPUs support dynamic indexing of textures at the bottommost level?
CW: All support dynamically uniform indexing; Vulkan enabling feature has 97% coverage. The one that doesn’t is lavapipe.
CW: Reason we need an API change is that in Vulkan you need to pass the array size, and that needs to match the shader. And if you say binding N, array size N, it’s not bindings N through N+k, it’s got a single binding slot with an extra array size all fitting in that one binding slot.
MM: Why not have the two dimensional design.
AB: GLSL does the 2d indexing, SPIR-V needs a one-dimensional array. We could do it as an optional.
MM: Understand it does the flattening but not the motivation.
AB: Vulkan requires it one-dimensional, but it’s flattened by the compiler to map to the 1d model.
MM: Trying to understand the pros and cons, if we could understand motivdation.
AB: One issue in SPIR-V is there is no distinct array type for this. And so there are a bunch of odd-looking rules. But it does have the Block-decorated struct. And then verbiage about what arrayness level you need to peel before rules start kicking in.
MM: Sounds like the group likes the idea generally, but it interacts with bindless.
CW: Don’t need to necessarily know how the two co-exist. Can decide how to make bindless not interact.
MM: Not sure that’s good. May have regrets later.
AB: Agree you want a sketch of what we want bindless to look like.
CW: Connor will talk about one way it could be done.
CW: Sounds like the group is happy with direction.
MM: Don’t need hardware support for this. It can go everywhere. That’s good.
CW: Yes, it can. Into milestone 2.

Guest block

Vello, compute-based path rasterization on WebGPU [Raph]

Slides: https://docs.google.com/presentation/d/1YVtmNdlG72jbkLFvV-Xnnla6oNF4VAno9ZLvSAOYElg/edit?usp=sharing

Raph Levien (RL):
- Vello
  - github.com/linebender/vello
  - Research project for next-generation 2D renderer. Forward looking.
  - High quality high perf.
  - Offload to GPU as much as possible.
- Main pipeline is 14 compute shaders
  - 3300 lines of WGSL
  - Bevy-style templating
  - Runs natively on wgpu, including Android
  - Runs on Chrome Canary
  - Prev iteration GLSL → SPIR-V → SPIRV-Cross
JB: Q: What’s a “conflation artifact”.
RL: A: complicated answer. Inaccurate rendering when winding numbers add up just kinda wrong. (?)
RL: Good interactive response even for challenging scenes.
- Paris image: 50K paths. Million path segments. M1Max rendering at 120fps. <8ms of GPU.
- Pixel 6 shows it around 20fps. No optimizations for Android.
KR: Q: What happens when paths are packed so tightly you get overlap. E.g. when shrinking text. I kind of saw twinkling. Multisampling with GPU doesn’t solve the problem?
RL: Two questions are there: 1. are you seeing conflation artefacts? For text, you generally don’t see that. Don’t think it was in Vello ; think it was the video call. 2. If you have dozens of paths intersecting it computes the right analytical answer. Literature talks about this. Slug does have (?) this problem. But this is a different class of renderer.
SW: Is there any caching going on from frame to frame? Any CPU transfers per frame?
RL: The Paris map is parsed from XML and encoded in the Vello packed representation: dense and efficient for parallel operation. That’s done once. Thereafter you can apply transforms almost trivially; the scene is not re-encoded. About 12MB for Paris map, down from 14?(40?)MB from original. Thereafter it’s completely dynamic. We don’t have animations in this demo.
SW: If you had a path where the control points varying by animation, you’d have to re-encode to the Vello representation, then upload that again.
RL: That would be like Lottie animations. We have a PR to do the Lottie → Vello packed form, to upload. Research area to do animations directly on the GPU.
SW: Interactive animations means you have to upload again.
CW: The point is the encoding is usually once only. But common frame rate things are all cheap and on the GPU: transform and clip, etc.
RL: Right, otherwise you wouldn’t have interactivity of complex scenes on the phone.
RL: The “cheapness” is relative. The encoding is cheap, but the GPU is so fast that to render, that the dominant part of drawing time is encoding time. Big research area is to get the encoding “right” for max performance.
RL: Did switch from GLSL to WGSL over end of 22/start of 23.
RL: WebGPU good things
- Higher development velocity, lower friction
  - Much easier to write shader code.
  - Tools are getting better. Using wgsl-analyzer.
- Performance is quite good.
- Working in safe Rust is much nicer.
RL: WebGPU mixed
- Rough edges still
- Get to fix bugs for the entire ecosystem
- Lack of “advanced” features is a creative constraint
  - Adopted WebGPU minimum as a baseline for compatibility
  - More than once, creative solution to work around lack of subgroups / device-scoped barrier.
RL: WebGPU bad
- Lack of advanced features limits research
  - The earlier GLSL prototype let us explore subgroups, memory model, etc.
- Runtime is expensive
  - Slow startup on Android (still exploring)
  - MM: Is that app startup or WebGPU startup?
  - RL: … (didn’t get the answer)
RL: Wishlist
- Precompiled shaders (not on the web)
- Descriptor indexing
  - So compute-based render can access N images
- Subgroups
  - More important for sparse rendering techniques
- Device-scoped barrier
  - Allows single-pass prefix sum; not available on Metal <= 3

Q&A

AB: What subgroup functionality would you want? If we had to break it up.
RL: Main thing is “shuffle”, if they are in different buckets.
RL: Really want to have subgroup size control. Exists in Vulkan. Limited feature in DX12 SM6.6. Metal does not have it. I want to write up a more detailed request to describe it. Related to prefix sum.
MM: Can you show the “things we don’t do well” slide.
KG: What do you want for precompiled shaders?
- More seriously, I’d really want issue numbers. :)
CW: Precompiled shaders is a question about runtimes to handle. Not a web platform thing.
KG: Think it’s more complicated than that. This is why I want to capture it in an issue. Think there are ways to get you to a better place, that are not a security nightmare.
RL: There are some open issues against wgpu for this. Yeah, not totally out of line to talk about this. Something Dawn could do.
RL: To be clear, I’m not talking about precoimpiled shaders on the web. I’m talking about the heaviness of an application that has to have an online compiler to the native platform.
MM: To be clear, this is not a concern for a web browser.
RL: Right.
KG: When we triage what we spend this W3C’s group time, it’s for the web.
RL: Absolutely.
RL: the other three issues would be very suitable and valuable on the web. Think the barrier is the simplest.
KR: Have you done quality comparisons of text rendering vs. the Saffron Type System, especially at small point sizes. Used to test against it. E.g. multiple small fragments.
RL: Quality is one of the key goals. We have analytic algorithm. Doesn’t have problems of HW multisampling. In future, work on several areas. Will apply stem thickening in a compute shader. That’s challenging. Excited about doing accurate offset calculation.
- 1. Many algorithms do alpha mask in the wrong colour space. Get artefacts. If you look at (?) you get spindly results. Not visually appealing as the ones calculated in the wrong gamma space. Between the two, think we can get the text really good. But not necessarily as TrueType. Will need hints to get there.
- This is research. Q is how far can you get text quality and performance.
JP: Do you have much work to get the WGSL portable between wgpu and Dawn/Tint?
RL: Biggest problem is ‘const let’. Have a fudge right now.
RL: Under other rough edges, I would add handling of constants. Wgpu does not handle pipeline overrides yet.
RL: Overall, the amount of work to get it running on the web is pretty modest. Surprisingly pleasant experience. Will be smoother as Naga approaches the spec more closely.
RC: will this make its way to Skia and elsewhere in chromium
RL: would hope so, but no promises

Bindless in WGPU

slides

CF:
- Prototype in wgsl
- Supported on underlying platforms that support it.
- var blah: binding_array<texture_2d<f32>>;
- Bind group layout entry ahs count parameter. Don’t write it in the shader.
- Benefit is you don’t have to rebind things on the CPU as you go. Bind once at the beginning. Then later decide what to use.
- As usage scope, behaves identically to N different bindings.
- Uniformity of index matters a lot.
  - Most vendors emit a scalarization loop. Only some (NV and QCOM?) natively support non-uniform indexing.
  - In GLSL/HLSL/MSL must annotate nonuniform index with something. Otherwise scalarization loop won’t be emitted.
    - array[nonuniformEXT(index)]
  - Our WGSL extension does not have that requirement. Leverage the uniformity information we’ve already generated.
- Shaders don’t need to know the length of the array.
  - Naga generates the right code /constant in the shader by looking at the binding array length.
- Partial binding is really useful. Don’t bind all the slots to the pipeline.
  - Has wide support. VK_DESCRIPTOR_BINDING_PARTIALLY_BOUND_BIT; dx12 resource tier 3, all metal (think so; haven’t seen documentation whether it does not)
    - Intel hangs windows/vk
  - Think this should be a separate feature bit since limited reach.
- Bounds / residency checking currently unimplemented in wgpu/naga
  - Without partial binding, trivial bounds check vs. array length
  - With partial binding, will send occupancy info to check against. Bit vector is cheap enough compared to what it’s tracking.
  - Wgpu will likely to the latter in all cases.
- Limits
  - Devices with bindless support ~500K bindings
  - intel/windows/vk supports 1800
    - Want to ignore this combination. Hope to use intel/windows over dx12 instead.
  - Metal supports 80 without argument buffers.
- Wgpu doesn’t use argument buffers currently
  - Current plan is to use one arg buffer per bind group because of macOS limitations.
  - KG: Good news: we decided to only support 10.13 and later.
  - CF: Yay
- intel/linux/vk supports 65K
- intel/dx12 is tier 3, and supports 1 million+
- How to bucket the features: Have thought about this.
- Really looking forward to having this. It’s really useful.
MM: Q: How about residency? One of the benefits of bindless is you can have more resources in an arg buffer than can fit on the GPU. Metal Command encoder has to specify which resources are resident. DX12 has something like that. But how does that fit. Does it map conceptually to your partial binding concept?
CF: (?) Partial binding with holes already has bitmap of current state. Can update that. Would need more than one buffer: which ones are bound, and which ones are resident. Using the same bind group across multiple queues might get racy. That would need a whole proposal and careful consideration. Haven’t seen much need for it, particularly right now. I know the need exists. Haven’t been asked to do that, since a workaround is available: make a new bind group.
RC: Residency is not just about when your app runs low on memory. Also an issue when other apps need to run on the GPU. When the other app needs to run, the OS can pick something to evict. May not be the thing you wanted to evict. Experience/experiments shows that things run much better if there is cooperation with the OS.
CF: Don’t know how to proceed.
RC: May get developer to supply info about which are evictable.
MM: Think we should only proceeds with bindless if we have a good story for that.
RL: Vulkan has 3 mechanisms. Descriptor indexing, buffer device address, and (?). Which of those do you use. If that’s not available, how do you downgrade.
CF: We use descriptor indexing. We could use buffer-device-address. It’s flexible but horrendously unsafe (risky). By the time we implement this API on BDA, everything should be able to support descriptor indexing. BDA a potential solution, but doesn’t give you textures. What I can tell, textures is what you want. For the buffer use case, there are many more ways to work around it.
CF: Caveat this: I’m not a gamedev. Am self-taught in this space.
MM: Q: Regarding the feature buckets you’ve investigated: What about apps that use wgpu. Do they have 5 different code paths to adapt? How do they approach?
CF: That depends on the app. For my own renderer it’s binary: either has all bindless features the app may ever want, or fall back to none. And it happens it doesn’t care about buffer_binding_array.
CF: For Naga, we don’t want to make that decision for developers. We want to illustrate the underlying constraints. Give an indication of what kinds of hardware is out there.
CW: Thanks this helps a lot. Q: Binding_array_binding has one type of descriptor in it. Have you had developers that wanted mixed kinds of descriptors in the same binding array?
CF: Not with an actual use case. People have complained about it, but not with a thing they couldn’t do by just having two binding arrays.
CW: I was thinking the availability bit was expanded: not available, which kind is available.
CW: Good to know it’s not super-useful.
CW: Q: If I try to access a texture that’s not bound in a partial scenario. What happens.
CF: We would give you a zero. The bounds check guards the texture read itself. So the read result is as if it’s all zeros.
JB: How does that work when you pass the texture to a helper function?
CF: Not handled. Wasn’t sure about whether it was supported in SPIR-V to pass a pointer to array element to a helper function. So didn’t address it.
CW: Alternate, could redirect to the 0th texture.
MM: q: Partially bound arrays of resources is something im not familiar with. Assume the API sets up a big array 10k elements. When you bind it, on the API side, you say bind indices in range 34-500. How does it get used in the shader.
CF: In the underlying platform shader, you have the max size encoded. Then if you don’t dynamically use it, then it’s fine. Junk in those slots but you don’t care. In the API, we let you bind a prefix of the array.
MM: So setBindGroup takes more parameters.
CF: no, create another bind group. Create a bind group layout that is huge. When you create a bind group, you say which parts are populated, and it’s potentially sparse. I set the bind group layout size to the device limit, capped to 160K. Then the bind group is however many textures I have, and it populates the initial part of that bind group layout.
KN: You said partial binding requires hardware support. Is that something we could polyfill. E.g. fill it with some null texture?
CF: You can. Users can do that. Trouble is if someone makes a bind group layout with 1 million descriptors, then having to walk that takes forever. And driver can take a long time.
MM: It’s also hard for us. Have rules about not aliasing read-write and read-only.
CF: Read-only read-writable property is only set at the top level.
KN: So browser polyfilling it can get the aliasing wrong.
MM: But the browser can know that it has \done it and skip that check.
KN: Right. Seems a lot more tractable.
KR: What does the data type look like?
CF: var<storage, read_write> blah: binding_array<Foo>;
KR: I see the bounds check problem. Different elements of that binding array have to do extra array sizes.
KR: In days past I thought could solve this with using fat pointers. The real physical address, plus size.
CW: Many ways to solve it.
MM: var<storage, read_write> blah: binding_array<Foo> suggests that the binding_array itself is writable. Can do that on Metal. But maybe that isn’t what you meant.
CF: Can take 1ms to create a group of 500K. Some complain about it, saying that is really slow, most have not.
MM: Need a story for it to be fast, usable to create bind group resources throughout the lifetime of the app.
KR: Have you done cool stuff with it?
CF: Yes, I gave a talk recently: https://www.youtube.com/live/63dnzjw4azI?feature=share&t=3700
RC: Question for raph. You work at Google. Is your research going to land elsewhere in Chromium? E.g. skia?
RL: Strategic question. I hope so.

Other future features/topics

Raytracing #535 [Kelsey]
Cannot upload/download to UMA storage buffer without an unnecessary copy and unnecessary memory use #2388 [Corentin]
WebGPU + WebCodec #1380 #2498 [Shaobo]
Variable rasterization rate [Myles]
Mesh shaders [Myles]
Upscalers [Myles]
mapSync - Motivating use cases and experiment results #2217 [Brandon]

Upscalers

MM: How does the group feel about an “upscaling feature”. Like DLSS, and 2 others are interesting. DLSS, MetalFX, and the AMD one (FSSR).
CW: And XeSS from Intel.
MM: The way these work is, they are often closed source, whichis interesting. They are work-scheculed on the GPU timeline. Could be encoder or a command on the command buffer. You annotate teh stuff you already rendered with “this is a color buffer”, “this is a motion texture”, “This is a depth texture”, then you say “go do the upscaling”, and the result is an upscaled texture.
MM: This is important for two reasons. 1. The cost of performing the upscaling is cheaper than the time it takes to render the normal scene at high resolution, and results are good (sometimes comparable, sometimes better). 2. For features like raytracing, it lets you get good quality for the time budget.
KG: Think it’s worth investigating this. The APIs we need, there has been some amount of flushing out. Folks aren’t immediately asking for us, but could be an obvious layup that could be a nice gift.
MM: The author’s app renders to an offscreen set of textures, annotate those, then say “please blow them up”.
AB: Feels a bit different. Every vendor has their own solution. Do you have a sketch of how that would be unified. AMD and NVIDIA’s work quite differently.
KG: That’s an output of the investigation.
CW: SEmantics are not the same. WE’re trying to be an API with high reproducibility. These things are inherently different from each other. SSR, SSR2 are open source. It could be interesting to have that ported to WebGPU and WGSL, as a user-space library.
MM: Some of those 4 implementations are implemented in terms of ML primitives. So may want an optimized ML path. Maybe WebNN integration? May not want to pay the cost of downloading the model.
KN: could we put this outside of WebGPU and say “make the canvas pretty”. Why as part of WebGPU?
MM: The reason it’s in the middle is it’s in the middle of the native processing flow. The HUD is rendered in full resolution.

Variable rasterization rate

MM: VRR. Fragment shader can be run at fewer invocations than all the pixels on the screen. The interesting part is “how much fewer”. D3D12 has 3 ways, Metal has 1 way, Vulkan has ? ways. Goal is performance. Do less work. Reason this is web-exposed is: if you do it wrong it looks awful.
RC: Is the intersection of these things tractable?
MM: No.
BJ: Has an interaction with headset XR. Once we get to that integration, would be a nice complementary feature.
MM: In D3D the rate is either set in the draw call, or in the shader, or in a texture.
MM: Of the analysis I did, I think there is a common subset.
KG: Worth investigating.

Mesh shaders

KG: Two new shader types. Replaces the geometry part of the pipeline.
MM: In mesh shaders, no vertex attributes. Based on compute. Every workgroup creates a “meshlet”. Another shader tells you how many to run in the mesh shader. Combination is useful because you can have variable sized work. Dynamic data can affect how much geometry to generate to process in the 2nd stage. Would not need tessellation shaders. The other benefit over vertex shaders is, more influence between triangles. In vertex shaders each vertex is independent. But in meshlets they can influence each other. Works on modern GPU.
CW: cool tech. In Vulkan - mobile GPUs (are Apple GPUs considered mobile?), not sure if they can do mesh shading at all. In Vk it's an EXT_ extension. Multi-vendor support? Concerned, it's cool tech, not coalesced yet, hard to make WebGPU optional feature for it while allowing other mesh shading formulations in the future.
JB: WebGPU contributor's played with mesh shaders.
KG: what'll it look like on Android? That's the main question. Not too worried about the impl details.
BC: can you polyfill this?
CW: yes, at huge cost. Extract all the dispatchMeshlet into prepass. Run the pass. Then a draw which references the generated buffers. Would be much slower than vertex/fragment shaders
MM: so, slight thumbs down for WebGPU V2 or V3?
KG: if tomorrow, mesh shaders worked perfectly on Android, we'd probably do them instantly. Not thumbs down on the idea, but we have a blocker.

Sparse textures

MM: apps don't use most textures. Nice if you can split up textures, make only some sub-tiles resident.
KG: concerned about efficient impl. Something that prevents a shader from accessing a part not filled in?
MM: doesn't prevent it. Some HW, you do that, get undefined behavior. Some, you get well-defined behavior.
CW: you get query in shader, is this texel resident? You can ask that.
MM: software bounds checking.
KG: probably efficient enough?
MM: remains to be seen.
CW: with bindless, what's the concrete use case for this? Do games do this?
BJ: fairly imp't part of Unreal Engine Nanite system?
CW: don't think so
BJ: virtualized geometry, but I thought also virtualized textures.
KG: can ask them.
Question marks.

Ray tracing

KG: depends on bindless. Discuss later, when we figure out bindless.
CW: different levels of ray tracing. wgpu gained ray tracing recently - "traceRay". This is level 0. You want to dispatch rays from the shader, and call continuation shaders depending on what it intersects. Fun. Like "bindless shaders". Almost a virtual call in shaders.

mapSync

BJ: presented above slides
KR: in addition to this experiment, we have partners doing ML who want to use WebGPU and have to get the results back on CPU. Only solution for them is mapSync on worker. Don’t know how else to do this. Strongly advocate for this in V2.
BJ: V2 then?
CW: user desire for this. What happens to other callbacks that executed while you were waiting? What happens to the rest of the timeline?
BJ: doesn’t need to exist in dawn native necessarily. Nice to have. Plenty of example on web of cases that block against guidance. All promises and callbacks queue up and wait. Not best way forward but worth making informed decision to block for it. In a web worker or worklet for example. Would not push for exposing this on the main thread.
KG: you want to block for buffers, not fences?
BJ: specifically buffers because what everyone is interested in is getting the result of an operation back to the CPU. If I could block on a fence - that's nice - but I need to get the data out of a buffer. Wait on a fence after a mapAsync, and once that returns, the data'll be in a buffer I passed to mapAsync? Seems more complicated than "let me map the buffer synchronously".
CW: what happens to timeline? Was problem - buffer mapping callback supposed to be onSubmittedWorkDone callback. (Was dependency in the other direction?)
BJ: shouldn't affect the device or queue timeline. Only content.
CW: what happens to rest of Promises? Semantically, like waiting for the mapAsync promises. Decorrelated now? Maybe fine?
CW: what about Wasm JSPI? Can we optimize mapAsync to be as fast as mapSync if we use Wasm JSPI?
KR: not sure we can. Brandon's test cases were pure JS. Shows that "await mapAsync" is not nearly as fast as "mapSync".
<discussion>
KG: can we not do better as UAs?
BJ: if you want to produce results synchronously, with async version, have to wait for next microtask boundary. Still have missed opportunity to return that buffer. Have to wait for callback to come back around. Baseline added latency of 30 / 60+ ms depending on how big the buffer is. Agree we should push to resolve Promises as quickly as possible, but you often add X latency inevitably.
KG: not convinced all of that follows. Not saying you're wrong, but I'm not convinced.
CW: how do we get to something convincing? How do we convince ourselves that the UA can do better? Obviously - we try to optimize things, we profile. How do we get to the point where we've investigated enough?
KG: need to sit down with an example and take it apart.
GT: Brandon's example, you'll have to redesign the audio API. You have to produce the data synchronously.
KG: that's your design constraint. You have to preflight, have to have your data ahead of time.
CW / BJ: you can't.
BJ: it's audio coming in from a stream, microphone, etc. Playback of music file - you could pull off some tricks, do readahead. Know your system latency. But coming from mic, etc. - you don't have that. Talking with WebAudio engineers, like Hongchan, for a system doing audio processing for musicians, they start getting annoyed if latency is > 10 ms.
KG: understood. You get some callback with 30 ms of data from a mic.
BJ: yes. There's an implication that some buffering's going on. Don't want to add 30 ms on top of that. If we can avoid it.
KG: I'm most curious about - what is it that causes async option to be worse than sync option? Know it's vague.
BJ: just looking at - without the callbacks in the way - I'll do a mapAsync & mapSync at the same time (can't do that - pretend) - they don't resolve that much differently to each other. Some latency from mapAsync - it waits for other tasks to process. Interrupted by keyboard, message, Fetch, etc. Could be preempted. In my experiments, mapSync resolves in 2 ms, mapAsync resolves in 6. Not a huge gap. Problem is - in a lot of cases, if you return control to the browser at all, you've lost control of when you can finish the work you're trying to do. Maybe fine in some or most cases. But we have some cases where partners say we can't return to the browser and get back to us at some unknown point.
KG: better color, thanks.
BC; no way you can resume execution quickly on some sort of event?
KG: imagining: if you did mapAsync inside a Promise callback marked "realtime" - like, audio callback's marked realtime - and we await something in that?
BJ: there's a world where we could restructure JS to allow for that. Going out on a limb, I think that's a harder problem than adding a function to our API.
AE: is there another option instead where Web Audio's callback lets you return a Promise? It seems like the latency between mapAsync and mapSync are close - or they can be made very close. The issue is that Web Audio's callback doesn't let you await something since it demands the data immediately. But, past the Javascript level, the browser probably doesn't care. It’s going to wait for the data either synchrously or asynchronously. So, if instead the callback lets you return a Promise… that could solve this particular problem?
BJ: maybe. Not sure why Web Audio team designed it the way it did.
KG: they understand their constraints. Maybe we found a use case compelling to them.
BJ: certainly worth going to Web Audio team, this might be an interesting use case, maybe you could do an async version. Can follow up with them. Also doesn't encompass the totality of the use cases. This is just the one I tested because the latency requirements are high.
KG: these experiments are good, but showing mapSync is 10x better than mapAsync is disingenuous.
BJ: tried hard to not say that specifically, but understand where you're coming from.
MM: normally the way mapping works is you have premapped things and you don't map something synchronously. Why doesn't that work here?
KR: it's for readbacks - that doesn't work. At least, you'll add many sample-frames of latency.
SW: your scenario works for uploads. Here we're waiting for readbacks.
MM: well, don't do that. :) We have a mechanism for that, it's called await.
BJ: showing that that doesn't work well.
MM: well, we're showing that we're already adding latency.
BJ: issue is, you miss your edge, the next opportunity you get is too late.
KG: I think the amount due to "async" - yes, you can miss that edge - shouldn't happen, basically. Something else is happening in the way the API is structured. Wait for WebGPU's result, hand it back to something else. That takes some amount of time to send the data to the GPU, work on it, get it back. ~the same time for mapSync. Doing it with mapAsync adds overhead, we try to drive it to zero. How bad is just that latency, really? Sounds like it's pretty low, but we might have friction with other APIs making it disproportionately likely to miss frames.
BJ: the API, or application use cases.
KG: if built-in browser asyncify (JSPI) removes some of the runtime hits you get, and the difference is reduced to the gap between doing it sync and doing it async, and that gap's only 2 ms - becomes less important to fix this.
BJ: less imp't for people using Wasm, yes.
KG: even for JS folks .If await only adds 2 ms…
BJ: depends on requirements for the data. Also - pointed this out on an issue sometime ago - worth exhuming - we're dealing with a reality where there's a synchronous readback on the web today in OpenGL. You can construct a Rube Goldberg machine of terrible copies. Buffer -> texture -> WebGL canvas -> ReadPixels. I wrote a library for this. Said, I hope nobody uses it. Couple weeks ago - had an internal partner, how do we deal with this problem? Had a Google Doc, and one of their proposals was 'use my terrible library". I said, "no!" But the inclination is there. The only viable way to do this.
KG: want more details of use cases
KR: we can't go into detail but we have internal partners running ML workloads on the GPU using WebGPU and they simply must get the data back on the CPU.
CW: we tried to convince them to move all their data to the GPU
KG: these terrible escape hatches are sometimes terrible on purpose. Asking people, can you come up with a different way to do this? Generally better behavior.
KR: talked about internal use case more. We could do the entire prototype, but only if the CG agrees that if it solves the customer's problem, we'll put it in. Don't want to throw away all the engineering.
BC: you can't poll?
BJ: pollable, no. Can't get the buffer back until the Promise resolves.
KG: that could be before your await() / callback is hit.
BJ: you're saying, if I put a "while true" loop, and check GPUBuffer.isMapped without returning to the browser, that bit'll flip?
MM: no.
KG: I think no.
BJ: then no.
BC: can you roll your own sync function.
KG: can't do that, but - if you're enqueueing other microtasks continuously.
KN: no. State change is when it resolves.
<discussion, including about Promise consistency>
RC: I don't want mapSync on the main thread, btu OK with it on a worker.
KG: people don't exhaust every possibility before coming to us.
KR: we pushed back hard against the internal partner but they convinced us with the complexity of their use case. They migrate data back and forth between CPU and GPU. Works OK on native platforms. Won't work with async and JSPI. We got them to implement async readbacks in their WebGL 2.0 code path and they don't use it. The latency's too high, waiting for the next frame or two for the data to come back.
<discussion>
MM: only on web workers?
KR: yes, we would only push to expose this on web workers or on worklets. Definitely not on the main thread.
<discussion about worklets>
BJ: worklets are intended to do as little work as possible. Very short. Have access to other parts of the browser. Smaller scale. Part of the worklet - you have to do a lot of small tasks. They're off main thread - if you do need to do longer tasks, they dont' block. Here you'd block the audio thread.
JB: thought they were off main thread so main thread wouldn't affect them.
BJ: yes, works both ways. Not giving up, myself, and just putting in mapSync. I'm happy to put a best practices doc together, why you shouldn't use mapSync.
<discussion about internal partner>
KG: not voraciously opposed to having this in a worker. But want it clear why we're doing it, and what the benefits are. Don't want slides saying, mapSync is 20-30 ms better than mapAsync. Doesn't match what we agreed to.
MM: what would it take to change others' minds?
BJ: on our side - having, at least, a couple partners find and successfully implement alternate paths, and showing it wasn't an adverse thing for their app. Large request, somewhat out of our control.
MM: only you can do that, somewhat unsatisfyign answer.
BJ: understood.
CW: what if using JSPI you can get the same latency as mapSync?
BJ: think that'd go a long way. Doesn't satisfy JS-only script cases like script node. Maybe multi-prong approach.
CW: what would satisfy other folks that this is needed?
KG: other concerns with ReadPixels-based solutions. Different behavior in different browsers because we have so much built up around them. Can be frustrating to have to ensure we have good enough sync readback performance, even with different browser architectures. Maybe latency's better in chrome when I use readpixels. Not compelling to say - the spec needs ReadPixels here. Sometimes people add these things for Browser A and Browser B gets a lot slower. In the past - we were doing CPU-side texture data upload. Sync CPu upload point, sending to GPU, sync download point - slower than having pipelining on both sides. Maybe latency gets better in one browser, but framerate gets worse in another browser.
KG: another - showing more precisely what it is that the solution's required for. If API combo makes async completely intolerable - then, yes, maybe use the escape hatch of mapSync.
KG: if everyone else is comfortable with it- - mapSync in workers is probably tolerable. Want people to feel bad about it, and know that the gap between it and a well-working async solution should be small. Sad that that nuance will be lost.
MM: I'd like to see - some sample content that runs in a worker or worklet that mapAsync (would have to implement mapSync to do this), same work happening, but timeline's more compressed in one case or the other. See in Chromium for example.
BJ: could probably do that. Revive my patch, make something more reproducible. Maybe in Audio Worklet.
MM: would that sort of thing help?
KG: it would. For workers, I don't need much convincing.
MM: GPU does same work either way. Latency of posting a task, vs. not. It's a state of nature. Can be measured. If true that posting a task's significantly slower - then I think your argument makes sense.
KR: does this make sense to prototype doing this in AudioWorklet?
BJ: need to do more research. Think it was feasible, just more time than I wanted.
KR: we could plausibly do the example with a few sample-frames buffering.
MM: yes, producer/producer problem.
KG: yes, this is what I was talking about with the Web Audio group. Why do we need to give back the data immediately?
MM: maybe they didn't consider GPU processing of audio samples.
JB: GPUs aren't good at being preempted. Audio processing where quality's sensitive to latency, where at any moment the shader might take 20 ms to finish what it was doing - seems inherently flawed. Even if all userspace work was fine. Don't know when GPU will be ready to start your work.
BJ: had several conversations about this in last W3C meeting in Vancouver. Talked with Web Audio folks. Idea of processing audio on the GPU - expressed the same skepticism. There are several very large very expensive pieces of well-known industry audio processing software that are implemented primarily in Cg. :)
JB: so they own the GPU. In the web browser they don't. It competes with FF's own animation.
BJ: valid points. I want to stress that this happens to be the environment I tried my experiment in. Similar-ish things coming over. AnimationWorklet. Seems more appropriate for the GPU. Already willing to accept inherent nature of GPU while drawing things. Maybe an area where we need more research? AnimationWorklet based off Canvas 2D, correct? Can we let it work with WebGL/WebGPU too?
MM: we don't need to wait for animation to be done like we do with audio.
CW: I think we can agree to do more investigation on our side.
MM: AI to talk with Web Audio WG and talk about Promises?
KG: I'll talk with Paul at Mozilla.
BJ: I can talk with Hongchan.

WebCodecs and WebGPU

CW: we believe the API's straightforward. Not many choices about what the API looks like. Take in VideoFrame. This is what we'll put in Origin Trial. Would like CG to discuss this in coming months.
KG: For importExternalTexture?
CW: Yes
MM: this is strictly additive?
CW: yes.
MM: why are you doing an OT? Will every WebGPU feature have one?
CW: we expect big WebGPU features will have an OT. Eg., ray tracing. Web Codecs is a feature. If Chrome ships something we're making a decision on what it looks like - interaction with Web Codecs isn't in the WebGPU spec right now. Not nice. So we're doing an OT.

Buffer Mapping and UMA

MM: nothing to say.
KG: I want a solution too.
MM: no proposals on the table now. "I think it should work this way" - not productive.
CW: as soon as Chromium wraps up V1, we're happy to make a proposal if one isn't already written. We think this is very important.
Collaboration! Yay!

Agenda for next meeting

Skip next week's meeting.
DN: reviews please on outstanding things to land. E.g., if review approval from non-Googler on diagnostic one.
JB: you'll get it.
CW: we're out of V1 items now. Sure there will be bug reports. Is this the time when we reduce the meeting cadence?
MM: would like to suggest no - how about doing that when multiple meetings end early because we run out of stuff.
KG: think will be soon, but happy to wait for that.
CW: agree.
KN: bit concerned that meetings fill up the entire time.
KG: chairs will push to end meetings early.
EG: champagne time? 🙂
Publish the explainer as a companion Note to the specs?
More repos in the github.com/webgpu org?
WebGL+WebGPU meetup at GDC

Items for next online WGSL meeting:

FYI, notable offline merges:
- #3859 types, type-generators are ordinary predeclared identifiers, no longer keywords or context-dependent names.
- #3874 Define "software extensions", and the "requires" directive
- #3862 Add @must_use attribute
- #3853 [editorial] Move value constructors and bitcasts to built-in functions
- #3851 [editorial] Reorganize memory sections

Proposals:

#3885 wgsl: enable directive should support a list of enable-extensions

Minutes 2023 02 16 17 F2F

GPU Web F2F 2023-02-16/17

Tentative schedule (all times are in PT)

Thursday morning (9AM - 1PM)

Thursday afternoon (2PM - 6PM)

Friday morning (9AM - 1PM)

Friday afternoon (2PM - 6PM)

Topic haystack

Attendance

NOTES

Thursday

Administrivia

CTS Updates

Scheduling for the remaining blocks

Demos

WebGPU Debugger

Code Coverage

WGSL Debugger

Earth

WebGPU on Mobile

WebGPU/Compat

WGSL V1 Items

David's slides on Template Syntax Followups

#3819 wgsl: Statement must not be a type-constructor

#3520 Attribute grammar / syntax

#3778 placement of generated grammar

#3792 wgsl_version_at_least

Cross-API Topics

WebXR

WebGPU/Compat

API V1 Items

Tacit Resolution

Shanghai topics: WebNN, DP4A, push constants, WebCodecs (optional)

WebNN

DP4A

Push constants #75

Friday

Future Topics

Real Pointers / Unrestricted Pointer Parameters

Standard WGSL namespace:

WGSL Process And Norms

Passive Fingerprinting Surface #3101 [Myles]

Defaults for depthWriteEnabled and depthCompare may be surprising #3798 [Brandon]

maxVertexBuffersPlusBindGroupsForVertexStage

[Post-V1] Dealing with holes in the pipeline layout. #2043

[Post-V1] read-write storage textures #3838

[Post-V1] Drop support for macOS 10.12? #3238

[Post-V1] Add rgb10a2uint texture format #3841

[V1] Validation for resolveQuerySet() doesn't require the query to have already been written to #3812

Support for arrays of textures #822

Guest block

Vello, compute-based path rasterization on WebGPU [Raph]

Bindless in WGPU

Other future features/topics

Upscalers

Variable rasterization rate

Mesh shaders

Sparse textures

Ray tracing

mapSync

WebCodecs and WebGPU

Buffer Mapping and UMA

Agenda for next meeting

Clone this wiki locally