-
Notifications
You must be signed in to change notification settings - Fork 315
Minutes 2022 03 30
Corentin Wallez edited this page Apr 5, 2022
·
1 revision
Chair: Corentin
Scribe: Ken / others
Location: Google Meet
- Administrivia
- CTS Update
- “Tacit Resolution” Queue
- mapSync on Workers - and possibly on the main thread #2217
- Are duplicate timestamp locations allowed on the same pass? #2662
- Streaming implementations and indirect draws/dispatches #2189
- GPURenderBundleEncoderDescriptor needs maxCommandCount if it is to be implementable on MTLIndirectCommandBuffer #2612
- Agenda for next meeting
- Apple
- Myles C. Maxfield
- Google
- Austin Eng
- Ben Clayton
- Brandon Jones
- Corentin Wallez
- Gregg Tavares
- Kai Ninomiya
- Ken Russell
- Mozilla
- Jim Blandy
- Kelsey Gilbert
- Snap
- Corvyn Kusuma
- Dhritiman Sagar
- Sumant Hanumante
- Unity
- Brendan Duncan
- Mehmet Oguz Derin
- Michael Shannon
- CW: we never added a LICENSE to gpuweb/types. Should be BSD 3-clause, W3C version. Just did that.
- CW: Oguz started the internationalization horizontal review for the spec. I started the security review. Still waiting for: TAG, Privacy.
- KN: hackathon a couple weeks ago. Fair amount of work done.
- KN: we will probably make a considered focus getting the validation in the spec, and associated CTS tests, ironed out.
“Tacit Resolution” Queue
- 3 issues this week.
- Should be non-controversial.
- 1st one: dispatch -> dispatchWorkgroups https://github.com/gpuweb/gpuweb/pull/2689
- 2: pushDebugGroup/popDebugGroup on queue. https://github.com/gpuweb/gpuweb/issues/1672 Doesn't seem very crucial. Punt post-V1.
- 3: bug in spec, says you can create 2D array views on 2D multisampled texture. https://github.com/gpuweb/gpuweb/pull/2716 Not allowed in Metal, so disallowing this. The only kind of view you can make in this scenario is a 2D view.
mapSync on Workers - and possibly on the main thread #2217
- KN: 3 possible resolutions:
- Don't add it
- Add mapSync on workers
- Add it everywhere
- Can always be done incrementally.
- A lot of people finding this to be a critical functional regression from WebGL.
- May be sufficient motivation for putting it in 1.0.
- SH: representing Snap here.
- SH: voicing favor for the proposal. Have small presentation discussing why we need it.
- (Sumant's presentation) Link
- Note that the code base in question is ~7 years old
- KR: What if you kept the CPU-side data from an earlier buffer mapping resolution? Keep the data from a previous frame and update when you get the new frame?
- SH: A bit of a hack, you'd get some artifacts / latency. It changes the guarantee inside the engine.
- MS: Just to confirm this is WASM?
- SH: Correct.
- CW: could this pipeline run in a worker?
- SH: yes. We actually want to run this pipeline on a worker, using OffscreenCanvas. We'd like to see broader browser support for OffscreenCanvas.
- SH: our constraints: we have to work on the current frame; and we have to get the data back synchronously. TF.js' dataSync() is the only way we can use WebGPU in our current pipeline.
- CW: if I'm reading the room correctly it sounds like mapSync is palatable on workers. OK with folks if it's there but not on the main thread?
- MM: yes.
- KN: at this point I'm pushing mainly for the worker thread.
- KG: I think we wouldn't want to give you a blocking operation that waits for operations to complete on the main thread. Blocking on GPU execution finishing is something we don't want to do on the main thread.
- MM:
- Point Kelsey just made is a good one, one which I recently realized. If on the main thread you can know your operation won't block, that's great. No problem. With WebGPU's current design , no way to guarantee that.
- Forbidding mapSync on main thread's a good idea for reasons Kai described
- We've looked moderately hard, see no problems with mapSync on a worker.
- GT: brought this up in the issue - understand conceptual level, calling across processes, you don't want to block. Working on readPixels right now, can do that, do some rendering, all in 3 ms. It doesn't block the frame. Example given - processing being done on the CPU - that also blocks the frame - we don't ban that. I can block things in JS just as easily as on the GPU. I understand why the rule's there - but think it's ignoring actual data.
- CW: wasn't OffscreenCanvas created in part to free the main thread from CPU-heavy rendering code? And you wouldn't cause blocking of the rest of the UI?
- GT: not saying it wasn't. Do as much as you can without blocking the main thread. You can block the main thread in JS; can optimize by saying, I can run it 20x faster on the GPU. But you can't do that. I tested averaging an array of floats; at about 60K floats it's faster to do it on the GPU.
- KG: on one browser, one impl.
- GT: I tried several browsers.
- KG: it's concerns about portability, shape of API, architecture that concern me. Not entirely technical discussion; also a pressure discussion about where we lead people towards. How we ask them to structure their things.
- GT: if map()'s slow in one JS impl vs a for loop - what do people do then? I don't follow the argument.
- MM: thinking about this claim since last week - fundamental difference, GPU is a different processor. You can write a long for loop in JS. Distinction: can't say in JS, this loop won't touch the DOM, so browser can process the DOM asynchronously. Contrast with running code on the GPU, guarantee no DOM functions being called. GPU code is run async; CPU is free. Idling. It's a waste.
- MS: only concern I have with limiting this to workers - right now OffscreenCanvas isn't well supported outside of Chrome.
- KG: FF also doesn't support WebGPU right now. Talking about future directions. WOuldn't worry about it - we'll have it.
- CW: impl progress in WebKit?
- MM: we're working on it.
- MS: if we can count on OffscreenCanvas coming along with WebGPU - I don't see an issue with it being only on workers.
- KN: what Myles/Gregg said are partly at odds, but I agree partly with both of them. Myles' description of differences makes sense; Gregg says we're preventing you from doing an optimization with the GPU. Also valid. Resolution to tension: want to evolve web platform so main thread's only for DOM manipulation. Everything not involving DOM should be possible on a worker. Having mapSync() on the worker's probably sufficient, at least in the long term - as the web platform makes workers more capable over time.
- CW: OK, so mapSync in workers - certainly doable. Worry I have - we're trying to finish a V1 spec. WebGPU will be a living spec - there will be future extensions, etc. mapSync can be implemented in all browsers - once implemented, de facto part of WebGPU from day 1. Can go on caniuse.com. mapSync is easy to add afterwards. Apps can say, now I have universal support for it. As trying to get to V1, concerned about its specification.
- MM: I have an open issue (#2646) about mapping - mapping's pessimistic right now. If buffer's not in use but GPU's doing unrelated work, you still have to block on that work. Making mapAsync less pessimistic might mean mapSync gets more complicated in terms of specification.
- MM: determining whether mapAsync's less pessimistic must be done in V1 because it'd be a breaking change.
- CW: it's complicated. Do we want to do this for V1, or right after?
- MM: usually there's a champion, if they do the work it gets in.
- SH: from our POV - without mapSync we wouldn't be able to use WebGPU. Not sure if the same is true for other customers.
- MM: question struggling with - should mapSync delay initial implementations, or what if it's implemented later?
- KN: think Intel'd be willing to champion this. They've been working on TF.js. We could probably do a prototype impl without working through all details of making sure everything's spec'd strictly. Maybe worth having available for testing in prototype impl. Would have experience for spec'ing post 1.0.
- JB: having mapSync on workers only doesn't help you, does it?
- SH: it does.
- CW: overall people seem to want mapSync soon. Question is V1 or not. And details of it.
- CW: do we have a list of things we want to get to soon? If not in V1, soon after? I see FP16, WebXR.
- MM: this is procedural. We can make new milestones in Github.
- CW: if we say we don't block V1 on mapSync, what's confidence from people here on how long it'll take in browsers?
- KN: this'd go pretty high on my queue personally.
- MM: does Snap have a WebGL impl?
- SH: yes. We have WebGL, just want to use WebGPU.
- MM: motivation for using WebGPU?
- SH: we ran some ML inference benchmarks; 15% speedup already, more when fp16 comes. WebGL also prevents doing some things WebGPU can do - rendering tech would be improved. Want to use it for ML inference & rendering.
- MM: you could feature detect mapSync on workers.
- SH: yes, it's fine for us to do that to decide WebGPU vs. WebGL usage.
- RC: do you plan to move all inference to the GPU?
- SH: some inference algorithms like face detection are faster on CPU. Also for example gathering rectangles. Algorithm runs faster on the CPU.
- RC: so mapSync is not a stop gap.
- SH: no, it's a proper required feature.
- CW: OK, we need a champion in the group. Maybe Intel. Think we should focus on mapAsync issues, and UMA issues - important for V1. We can work on mapSync in the meantime. We have a process for working on stuff without landing it in the spec.
- CW: maybe try to timebox it so we can make consistent progress toward V1. No other more pressing thing the group wants to do post-V1 except maybe WebXR.
- CW: in other words - don't block V1 on this, but have a roadmap so it comes soon after.
- JB: can we get Sumanth's slides in the minutes?
- SH: yes, can share it out.
- KN: thanks for presentation, very useful.
Are duplicate timestamp locations allowed on the same pass? #2662
- CW: issue where multiple of these are written in the same render pass (?)
- CW: only allow locations to be written once?
- MM: should postpone this - in the middle of figuring this out.
- CW: table until next meeting then.
Streaming implementations and indirect draws/dispatches #2189
- MM: I thought seriously about the rules you made. Super close to working. One particular part that I don't think will work.
- MM: very complicated. Wrote huge reply.
- MM: one specific part I think is a deal-breaker.
- MM: best way forward - give people time to read this & digest it? Figure out if good path forward? Maybe solution I haven't thought of.
- MM: so close though. 7 problems, 6 have solutions.
- CW: Austin & I are interested in figuring a way to make this work.
- MM: hoping Austin can work with me and figure out a path forward.
- AE: I did read your reply. Would need to think more about it; was just posted. I think the # of dispatches would be bound by # of buffers modified, not draws.
- MM: eah draw can have its own region of the index buffer it's using. Can't union them. May be an index in the middle, not referenced, which would cause the draw calls to be rejected.
- CW: maybe you can meet offline.
GPURenderBundleEncoderDescriptor needs maxCommandCount if it is to be implementable on MTLIndirectCommandBuffer #2612
- CW: similar to C++ queue, when you run out of space allocate 2x the space.
- MM: previously this group strongly rejected idea of single WebGPU object corresponding to list of platform objects; that's what this describes. A bundle -> a list of indirect command encoders.
- CW: right. Why would this be OK? Jim, what does wgpu do?
- JB: RenderBundles are a hack. Doesn't use underlying platform's facilities.
- MM: indirect command encoders.
- JB: we don't use that. We implement RenderBundles in the wgpu-core level; no HAL layer for that.
- CW: record/replay then.
- JB: yes. complying with API, not using underlying feature.
- AE: MM in your other post you said indirect command buffers weren't supported on macOS?
- MM: it's compute. We split them - ComputeIndirectCommandEncoders, and RenderIndirect. Compute not supported on macOS; Render only on some Macs. The ones that support Render are those we're targeting for WebGPU.
- AE: thanks, that clarifies.
- CW: not sure. Command count is awkward.
- MM: command count is to make it more concrete for people. Just the number of draw calls, direct or indirect.
- AE: just an unfortunate thing to add. Feels like this command count makes sense where you're making an indirect command buffer and populating it on the GPU; but with CPU-side generation we can keep growing and merge them into one buffer later.
- MM: can you explain "merge"?
- AE: BlitCommandEncoder lets you copy out of indirect command buffer. So if multiple indirect command buffers backing RenderBundle, you could blit them into a single one (don't have to, but could).
- KN: or, not encode them into Metal object until finish().
- MM: no, don't want to do that.
- KN: this is for RenderBundles, not REnderPasses. Tradeoff's different. Understand not wanting to do it for RenderPasses. It's different for RenderBundles.
- MM: compromise soln: a hint? Rather than allocating a stack of indirect command encoders - don't know how big each one should be - author could say, probably less than 50 draw calls. I can use more than that - then browser will make multiple objects. Goal, reduce # of allocations.
- CW: I think we'd use such a hint in Chrome. Even though we do record/replay, we linearly allocate on CPU side.
- MM: would you get any signal about # of draw calls? Or do you need # of API calls? Or say, every draw call probably has ~3 API calls?
- CW: something like that. We'd guess 1 SetVertexBuffer, ½ SetBindGroups, so forth, per draw call.
- MM: happy to make a PR.
- AE: think the hint's fine. How bad is it to make multiple of these MtlCommandEncoders? GPU-side allocation each time? CPU-side anyway?
- MM: I think they're full MtlResources. Could be wrong.
- CW: think that's right because you can encode from the GPU as well.
- MM: sounds like we have a path forward, I'll make a PR and the group can discuss.