Minutes 2018 04 24

GPU Web 2018-04-24 Montreal F2F

Chair: Corentin

Scribe: Ken

Location: Google Hangout

Minutes from last meeting

TL;DR

Canvas integration https://github.com/gpuweb/gpuweb/pull/55
- Discussion about WebGPUDevice being created out of thin air then used to create canvas swapchains.
- Status update about the discussions around OffScreenCanvas blocking commit.
- Discussion on what “present / commit / swapbuffers” really is and where it goes.
- Consensus: have canvas.getContext(“webgpu-swapchain”) return a WebGPUSwapChain and rename present() to commit()
Device discovery and creation https://github.com/gpuweb/gpuweb/pull/56
- Listing gives the most flexibility to the application but also fingerprinting issues.
- Application’s requests for extensions and limits can be arbitrary.
- Looking at all the devices in a system might fire up GPUs but their capabilities can be cached.
- Application could provide hints like highPerformance, or lowPower to query adapters.
- Need a “do the right thing” easy path.
- Defer ability to go above core limits to after the MVP.
Partial pipeline state.
- Pipeline layout has to be separate from pipeline descriptors,.
- Proceed with one big render pipeline descriptor, change if gets too big.
Memory heaps
- For fast placement of resources, not aliasing.
- Problematic for Chrome because it requires a round-trip to GPU process.
- D3D12 and Vulkan have explicit placement of resources at an offset, Metal heaps work like Arena allocators.
- ResourceSets could be an extension.
- No WebGPUHeap for now.
- Need to keep resource residency in our mind.
Descriptor pools / heaps
- Consensus to not expose them.
Error handling https://github.com/gpuweb/gpuweb/pull/53/files
- Resolution of the PR
What next once we have the CLA?
- Should write a spec written in BikeShed
- Discussion of how the test-suite should be written.
- Discussion of libraries that will implement WebGPU, gfx-rs and NXT
Memory barriers:
- Desktop IHV feedback is that the gains explicit barriers provide are tiny compared to the amount of work and expertise required.
- For mobile IHVs, GPUs have less parallelism so barriers are less sensistive.
- Tentatively do implicit barriers.
- Discussion whether we should have a readonly concept in the shading language.

Agenda items:

✓Error handling
✓Memory barriers
✓Collaboration on artifacts (spec, tests, maybe ANGLE for WebGPU)
✓Partial pipeline state
✓Memory heaps
✓Device discovery and creation
✓Canvas integration
✓Descriptor pools

Attendance

Apple
- Dean Jackson
- Myles C. Maxfield
Google
- Corentin Wallez
- Fernando Serboncini
- James Darpinian
- Kai Ninomiya
- Ken Russell
- Stephen White
- Victor Miura
Intel
- Yunchao He
Microsoft
- Chas Boyd
- Rafael Cintron
Mozilla
- Dzmitry Malyshau
- Jeff Gilbert
Joshua Groves

Agenda

CW: Do integration discussion first, then graphics API things then talk about collaboration

Canvas integration

DM: Swap chains
- https://github.com/gpuweb/gpuweb/pull/55
DM: Easiest to have something which derives from SwapChain
- WebGPUSwapChainContext
- It’s there for the user to obtain it from the canvas
- getContext(‘webgpu-experimental’)
- SwapChain is there for you to get textures from them and present them
CW: don’t get to query the textures beforehand, unlike Vulkan and D3D12
- In those APIs, you get all of the textures from the SwapChain up front, and then index into that array
- In browsers, probably want to just give you the correct one to render into, because things might change.
- #1 goal: have one WebGPU device render to multiple canvases.
DM: need to pass the WebGPUDevice.
- Obtain a singleton instance of WebGPU.
- Then can create devices.
- Then can talk to multiple canvases.
JG: why do we need to grab an instance off the static WebGPU class?
DM: idea was, you need something to store extensions, features and limits in./
- But those were moved to Device. Now we can decide whether we need the singleton instance.
CW: let’s say you have the device. How do you work with canvas?
- To create SwapChain from canvas, need to provide the Device you want to work with.
- Want to say how you want to work with the texture. (i.e., render target, format, …)
- Pass that to canvas.getContext(). Returns a SwapChain, giving you textures of the canvas size. Then render and present.
RC: why do we have the format on the SwapChainDescriptor?
- DM: because it’s what you pass to canvas.getContext().
FS: what’s the reason for the tradeoff? Understand you want a singleton across the page for WebGPU. Why not: every time you create a context, you get a device.
CW: it’s so you can share your rendering results. Say you are doing WebVR and want to show both on-screen and in-HMD.
DM: also, you may not want any canvases at all. Do computations, do readback. Helps with mining
MM: definitely want multiple devices on the same page.
JG: could have one context using the integrated GPU.
VM: in a 3D context you have explicit texture sharing that you can do.
CW: tell us about canvas and commit() from workers. Can we present from workers?
FS: the way this works now: when you create a canvas on the document, there is logic for transferring control of it to an OffscreenCanvas. Can transfer it to a worker. It’s how we decide which texture is associated with the thing.
CW: SwapChain would not be transferable, but you could get the OffscreenCanvas in the worker and then get the SwapChain from it.
FS: it’s about transferring the control to the OffscreenCanvas.
DJ: thought you got an ImageBitmapRenderingContext.
FS: yes, can transfer ImageBitmaps. But now OffscreenCanvas objects are transferable.
KR: This is still discussed by the TAG, we are convincing them to have a commit on workers that apply GPU backpressure in workers.
FS: still under discussion. 1) implementing requestAnimationFrame on workers. 2) commit() function on OffscreenCanvas, blocking if needed. Still need to push the W3C TAG to let in commit().
KR: Think the TAG will be reasonable and let this get in, or put behind a flag and get data. Could be in WHATWG but that is still very close to the TAG.
DJ: what display link is the OffscreenCanvas connected to?
FS: the document where the worker was created from.
DJ: why is this better than ImageBitmap and transferable?
BJ: you can choose the synchronization you want. For example pushing things to the page as fast as you can, without care of sync with other HTML. On the other hand, with Google Maps, need sync with surrounding HTML. Transfer the ImageBitmap to main thread, then put it in the canvas yourself.
FS: With offscreen canvas, the animation on the OC is decorralated from the UI thread, so the UI thread can jank without the animation janking.
DJ: because you’re just sending frames directly to the compositor?
FS: yes.
RC: specific scenarios for OffscreenCanvas vs. commit()?
DJ: don’t understand why you need to transfer the canvas.
KR: It is the OffscreenCanvas that is transfered, the canvas is tied to the page but you detach control of the canvas to an offscreencanvas that is transferrable.
DJ: OK, understand transferToOffscreenCanvas().
VM: can still commit on the main thread.
KR: Yes. It is really difficult to figure out in the compositor if you want to have a fast path for a canvas. OffscreenCanvas helps even for regular canvas because it tells the compositor you can use the fast path.
JG: if you can do an early commit(), you can fence and synchronize on it. With implicit commit(), even if you finish your work, there’s a delay in the presentation.
FS: everything the canvas does, it needs to check surrounding things in the document. With OffscreenCanvas you bypass all that.
CW: how does this work with WebGPU? Want to be able to get a SwapChain from an OffscreenCanvas.
BJ: is OffscreenCanvas applicable to WebGPU?
KR: yes. Same basic idea.
FS: problem: have a lot of stuff related to workers. Now can get a device on the main thread, but can’t transfer textures between workers…
CW: it’s an explicit use case to be able to use WebGPU among multiple workers. We will be evolving the API to make that possible.
FS: does WebGPU have any init step?
CW: assume you have a WebGPUDevice and it’s shareable.
FS: any discovery mechanism for device descriptor?
CW: yes, that’s the next agenda item.
DM: aspect I’m not happy about: distinction between SwapChain and canvas context. When you implement it, on present() you need access to the canvas. Has to notify the canvas there’s new stuff to show. Maybe should make a single class.
BJ: when we want to integrate with VR, how do we want to make that work? Right now, you make a Layer on the VR device for WebGL. Provides a framebuffer, you write into it, and it gets pushed to the headset. Writing to the canvas lets you draw to the document. Could get separate SwapChains for the canvas and the headset.
CW: how to deal with 2 eyes?
BJ: right now: allocate a canvas, do side-by-side rendering.
CW: app needs to render two eyes separately.
BJ: yes, API returns two (or more) viewports.
BJ: would be great to say: we have this device, want to work with WebGPU contexts. Give me a SwapChain compatible with this headset. That’s a reason to have a distinction between SwapChain and SwapChainContext.
CW: so headset gives you a TEXTURE_2D_ARRAY?
BJ: yes.
CW: may want to put more stuff in the canvas context like display timing information. And a backpointer to the canvas.
DM: when you present() you need that backpointer. And present() is on the SwapChain, not the canvas context. Felt wrong. But given this use case, can keep it separate.
FS: present() is incompatible with current DOM update mechanisms.
JG: not really. One issue in WebGL is we implicitly present. Enqueue lot of work, then “wait” for browser to pick it up. Browser hits end of rAF, presents to screen. It’s only then that we can fence, send the buffers out. Slows us down because work isn’t necessarily done even if you flushed it properly.
FS: if it’s the last thing you do in the frame it doesn’t slow you down.
JG: it does because the compositor’s blocked waiting for the canvas’s work to finish executing. Best to commit() as close to when you’re done as possible.
CW: also work can happen outside rAF.
FS: this is something the canvas element handles.
DM: assume two canvases. One is done earlier in the frame, then you start working on the second canvas. Need to do some work upon present().
FS: ok, I see. Needs to be phrased correctly. present() is a misnomer.
CW: it’s sort of a swap.
FS: just naming. It’s a different set of things.
FS: it’s like commit() but it won’t show up on the screen. This commit() doesn’t present.
KR: The present() call is on the WebGPUSwapchain, commit on WebGLRenderingContext is similar: flush work to GPU and flag for presentation.
FS: The flush is implicit, and the showing on the screen is explicit.
JG: Think explicit is better than implicit?
FS: could you have a solution with flush() semantics, so if you don’t so it, it happens implicitly?
VM: in WebGL the browser might flush. But the commit() is like “flush and enqueue the frame”.
JG: also lets you draw to a frame and not commit it yet.
CW: not sure if you can have two SwapChain textures in flight at the same time.
JG: more like, you are drawing but not displaying to the page yet.
JG: commit early is something I wrote a WebGL extension for. The idea is you can do it early. Once done you can’t write to it any more.
RC: need a way to apply back pressure logic. Only one commit() per cycle, have to return to browser.
MM: that’s something we feel strongly about too. Don’t want a game spinning on a worker thread, submitting tons of work that’s never rendered.
FS: this is a problem independent of WebGPU.
FS: OffscreenCanvas animation proposal: https://github.com/junov/OffscreenCanvasAnimation/blob/master/OffscreenCanvasAnimation.md
CW: So on the WebGPUSwapchain we have commit and getNextTexture, same for VR?
Discussion about where commit() goes.
BJ: if you can return an actual SwapChain from canvas.getContext() and it doesn’t have to inherit from anything, then on the VR side the layer can just contain a SwapChain object.
RC: what happens if you call getContext on the main thread and then transfer?
FS: explodes. (Throws exception, etc.)
CW: want to be able to use same interface for VR SwapChain and Canvas SwapChain.
CW: How do we change the PR?
KR: Make WebGPUSwapchain a partial interface?
CW: Consensus: have canvas.getContext(“webgpu-swapchain”) return a WebGPUSwapChain and rename present() to commit()
RC: the only reason I can think of to have a separate context is if we want a back-pointer to the canvas.
FS: we turned those pointers into OffscreenCanvas when using WebGL with that.

Device Discovery and Creation

CW: How do you get a WebGPUDevice?
DM: PR #55: Tell the instance what device you need. It’s not a discovery mechanism though - need to be able to e.g. get a low power device. How does WebGL work?
CW: getContext takes {alpha, antialias, powerPreference, failIfMajorPerformanceCaveat, …}
JG: Like the concept of D3D12’s Adapters - can iterate over all of the adapters on the system.
MM: Don’t like being able to see every adapter.
JG: Ultimately you can search, to find every adapter. If you can do that, why not just provide the whole list of adapters?
DJ: In WebGL, if an application requests one high power and one low power context, then the browser ought to be able to give you the high power adapter for both. If the high power context ends, the low power context might end up on the low power adapter again.
CW: What if your application works best with an extension (e.g. tessellation) but also works without it. If it’s not available, you want a context without tessellation.
DJ: Return a context that doesn’t have tessellation even if it’s requested.
DM: why can’t we return null instead?
FS: you want things to work.
VM: you want to query and know that things will work unless you’re out of memory.
DJ: similar to saying that I requested high power and didn’t get it.
VM: do we care about supporting 4 GPUs for your binding?
JD: does Metal switch automatically between GPUs?
DJ: yes, similarly to OpenGL driver.
BJ: this is something we’re arguing on the WebVR side as well. When you ask for certain capabilities, is that a hint or a demand? Not sure we’ve come up with the right path in VR. Sympathize with idea that we don’t want to enumerate the devices. We do see that people write code looking for very specific hardware and properties, and fail if no exact match, even though the demo probably would have worked.
CW: we can prevent that.
FS: we can make it harder. APIs try to disincentivize people from doing what Brandon was saying. While you could do the bad thing, make it harder to do so. Sort of like user agent strings.
JG: in WebGL we have occasionally made breaking changes. Could do the same here if we find it’s necessary. As long as we want quick turnaround, we need to expose the adapter you’re running on.
FS: default mode of the API should make you do the right thing.
CW: what about: browser tells you what you’d get if you were to create the device.
KR: sounds complicated. Two-phase handshake.
BJ: we have the ability to say, I have a descriptor for the hardware I want. There’s a supports() check. In a lot of cases, you want to be able to put a VR button on the page. Want to be able to check support in a lightweight way before turning on the headset. That’s why VR thought we needed a supported() check.
CW: same thing here. Vulkan/D3D12 device creation is fairly expensive -- 0.x seconds.
VM: currently there are libraries which get a WebGL context to get the GPU renderer string to see if you’re a bot.
DJ: we don’t create the WebGL context until somebody starts doing something serious with the context.
DM: if an app has a code path for extensions A and B, preferring A. Passes both extensions to the descriptor. Would be browser’s choice which one to return. Instead, would like app to be able to request extension A, and if it fails, fall back and request extension B.
KR: I like this idea - it’s simple.
FS: seems the decision of which device to pick is a conversation between what the user wants to do, the browser, and the OS.
VM: seems like a good point. Maybe you want to let the user pick. Keeping it simple sounds good, because most of the time we want to use the low-power GPU. The discrete GPU can be optional.
CW: consider: have the WebGPU entry point. I want high-perf or low-perf. WebGPU gives you back a device descriptor. Can query both and see which you like best.
JG: then you’ve just made a list of adapters with more steps.
KR: we have the option of rethinking how canvas.getContext() works.
CW: want extensions to be explicitly enabled at device creation. How do you order them?
KR: would say make them all required.
KN: not going to test all the combinatorial options.
DM: when you call createDevice() the browser knows whether it can create something that satisfies it. No overhead.
JG: it’ll always be created if it supports all of the extensions.
DM: it doesn’t prevent you from going through and seeing which extensions are supported.
JG: need to explicitly turn them on.
DM: got it.
VM: if you could just get a list of all the devices, it would be easier for you to query which one I can use.
KR: ok, can see that most libraries want to optimistically turn on certain extensions, and it’s infeasible to query all possibilities up front.
FS: maybe can decouple low-power/high-power. Get an object which has a list of extensions. Device creation is not based on extension you want.
JG: can we call it “adapter”?
FS: maybe the device / adapter you get back has a list of extensions.
CW: people may say : give me a device for low-power, give me a device for high-power.
VM: if we return a list and return the low-power one first, may guide the developers to do the right thing.
DJ: don’t want to expose a list of devices.
DM: what about required and optional extensions during device creation?
CW: not feasible. In Chrome we have >2000 lines of code for extension selection.
KN: everything in that file would be optional though.
JG: what about getAdapter(‘high-power’) / getAdapter(‘low-power’)?
MM: seems unwise to just use an enum.
JG: would actually be a dictionary.
CW: so maybe window.getWebGPU().getAdapter({ /* dictionary */})?
- I have this feature level, these extensions.
- Then: adapter.getDevice().
MM: it’s two-step.
JG: sounds OK.
CW: sounds OK.
VM: making it slightly more difficult to fingerprint.
RC: in D3D, while you can enumerate the adapter, you don’t know which extensions each one supports without creating a device. You get back “Intel xyz” but have to create a device to know which capabilities it supports.
JG: it’s similar to how OpenGL works.
CW: when you create the D3D12 device for the discrete GPU, it’ll spin it up.
JG: maybe you do it the first time, and cache the result.
MM: how do native applications do this?
RC: all you get back is a string of what it is.
DJ: in Metal you can get back a reference to all the devices on the system, and can ask each one whether it’s high/low power and what extensions it supports. Since there are only 3 feature levels it’s pretty simple. Nothing has created a context yet.
CW: D3D12 should know, but it doesn’t.
VM: Vulkan has a registry of physical devices.
DJ: depends on how many extensions we have. Think an application developer might say, I absolutely require this extension, and write their code around it. Or, I’m not relying on it, just give me a device.
CW: sounds fine. If app is a tessellation demo, it can require the tess extension.
DJ: have to request extensions at context creation time because Vulkan requires it.
RC: you can find out whether it’s discrete or not. Windows Mixed Reality supports giving you the one you “should” use. OS have a control panel to say which one is used by default. Differently set on shared grid machines. But we can’t know extensions, texture formats, etc. up front.
FS: getExtension call might be more expensive but it’s not mandatory. You’re hinting whether you want low power or not.
RC: so there won’t be a two-step thing?
FS: it is two step. Get the adapter, then create a device from it.
DJ: Metal has two steps, sort of. It just exposes them as one function. “GetDefaultDevice()”.
FS: separation between device you get and extensions you get.
VM: because Vulkan requires you to specify the extensions you want (and to have to know what they all are) it’s problematic.
DJ: how do you know in Vulkan which extensions are available?
CW: get root object. Query physical devices, which are descriptor objects. Then create a device from a physical device with a list of enabled features, extra limits, etc.
DJ: so Vulkan didn’t require firing up the GPUs?
CW: no.
CW: there’s a Vulkan loader on the system which aren’t in the spec but are spec’ed by the loader project. And maybe it has to fire up the GPU to make the query. Not spec’ed.
MM: 2 of the 3 APIs have two steps. What does Edge do if WebGL asks for high/low power?
RC: we don’t implement that feature yet. Probably have to strstr(“Intel”).
FS: but you have a discrete integrated thing.
DJ: we should create one per device and run a little perf benchmark.
CB: D3D12 lets you enumerate over integrated vs. discrete memory. Very possible Intel could ship a part with higher perf than the discrete part.
CW: can you query that before creating ID3D12Device?
CB: I think so.
RC: not sure about that. DXGI adapter does not give you the feature level, supported texture formats, etc.
JD: even if you have to create a device, the browser could do this and cache the result.
VM: on low end devices the browser does get killed very often. Startup time matters.
CW: sounds like: Vulkan’s 2-step. Metal sort of has two steps. Similar for D3D12.
MM: if we’re making something that maps natively to the APIs, sounds like we need some bit like high/low, integrated/discrete.
DJ: three states in WebGL: default, low-power, high-performance.
CW: what about WebGPU.getAdapter({ /* power preference */}). Then create a device from it. Or, can create a device passing required extensions, etc.
MM: are there extensions that are mutually exclusive on a Vulkan device?
BJ: why would they do such a thing?
FS: can see why that might happen.
RC: D3D just has feature level, and then you can get information about various texture formats.
CW: don’t think extensions are mutually exclusive in Vulkan, but they do have a cost - don’t want to turn all of them on all the time. Using bindless extension for example might increase memory consumption in descriptor sets and other places.
FS: list of extensions doesn’t have to be more complicated.
CW: developers will probably start telling us how well things are working after the MVP.
MM: this sounds acceptable.
DJ: think we should just go ahead with this now. We don’t have any extensions yet because we don’t have a spec yet.
CW: think we should have at least one extension. Maybe anisotropic extensions?
BJ: not a good example.
DJ: maybe “lose context” extension.
DJ: createDefaultDevice() convenience function. Maps to getAdapter({}).createDevice().
- getAdapters() method taking dictionary having powerPreference key.
- Will return object which has list of extensions, limits, name, etc. and createDevice() function taking dictionary of extensions, etc.
DJ: do most people turn on some extensions?
CW: on Vulkan you do.
DJ: will this createDefaultDevice() have to turn on some extensions because that’s what people want anyway.
VM: browser will have some default things that people need.
MM: months ago were asking about extensions in Vulkan. Conclusion that there are some extensions so widely supported that they are basically everywhere.
CW: those are “features” – different than extensions.
DJ: part of the core spec, but not supported?
CW: yes.
KR: facepalm
MM: so the functionality that a Vulkan device actually has to support is fairly small?
CW: yes, basically ES 3.1.
DJ: so we would need to turn on some extensions by default?
CW: yes, potentially.
VM: stuff we need for WebGPU feature core. Blacklist devices not supporting them.
JG: just made a pull request. https://github.com/gpuweb/gpuweb/pull/56
DJ: when you create the device you don’t need to specify limits or extensions.
CW: there are some limits the device supports. Should be opt-in.
VM: should have one standard way of supporting e.g. limits.
CW: things like GL_MAX_UNIFORMS.
KN: there are numbers, flags, maximum values of things, etc.
CW: features is part of extensions.
CW: you should get the minimum limits by default, and request higher limits.
JG. would be pretty cool…
MM: agree with Corentin that we shouldn’t incentivize people to request features they don’t need.
KN: if you want the extended features of your NVIDIA GPU you should have to explicitly request them.
CW: on NVIDIA hardware you can have shared (?) vertex buffers. On Metal, only 32. Don’t want apps to rely on getting 33 and only test on their own NVIDIA hardware. Spec will say you can use at least e.g. 30 vertex buffers. App has to opt-in to using more on hardware that supports it, by passing limits in the device descriptor.
MM: if you don’t request more, then if you use too many, browser should reject it.
VM: does this make it more portable? Query the maximum, and if you say “1000”, I’ll request 1000.
BJ: think there will be two modes. People who do the default, and people who push the limits. Will do what Victor says, query the limit and immediately request the maximum supported values.
VM: don’t get the difference between context creation failure, and failing later, at draw call time.
MM: it’s about the UI of the failure mode.
JG: this is a min-capability spec. Prevent people from asking for too much.
CW: incentivize people to only request increased limits for what they need.
BJ: you’re assuming that developers will know exactly the number of buffers they need. More likely they will just develop with the maximums turned on, and will never go back to running on the core configuration.
VM: then app has to query and take optional paths.
CW: that should be the case in WebGL.
VM: there’s no portability guarantee based on what I returned, because it’ll differ from the hardware in the field.
BJ: think it’s great to have just the core spec be the default. But it feels like the distinction between OpenGL ore and non-core. Feels developers will often say: I’m capable of running with the core spec, or no, I’m not.
CW: it’s things like max texture size, number of UBOs per stage, etc.
MM: for MVP can we just have a core profile and figure this out later?
CW: we really want, in the MVP, to have validation that you’re up to spec with the lowest limits that WebGPU supports. Otherwise WebGPU will break on some random Android device.
MM: limits should be high enough that most applications don’t need to query this stuff.
FS: how about: expose the capabilities on the adapter, and remove this for now.
CW: if you do that then you have to have the returned device turned up to the high bar for these limits. It’s clear where this is going, but not how to expose it. For MVP, let’s say there is one set of limits which are the WebGPU limits, and you can’t increase them.
RC: have been distracted by looking up memory limits (discrete vs. not). In DXGI adapter can query video memory, local vs. non-local.
CW: when you have an adapter and create a device you’ll pass a lot of info. There’s a question about limits. We want a way for apps to go above WebGPU’s min-spec. Want to incentivize apps to do the right thing. For MVP, do the minimal thing for WebGPU, and just expose the core spec. We don’t have consensus yet on increasing those limits but we’ll figure it out later.
- VM: also features will be folded into extensions.
JG: we are going to have a concept of maximum texture size.
BJ: we can add them in later.
JG: don’t like that the MVP will have an implicit max texture size, and the later version will have it explicit.
FS: it’ll be that the limits in the spec will be “maximum minimums”. Later, the ones returned by the API will be strictly greater than them.
DM: how about adapter returns minimum limits?
CW: no, because people will query for core WebGPU limits from the adapter and get used to that.
DM: so you’ll expose two adapters in the spec?
CW: no, the ability to increase limits will be deferred, and added later.
CW: what Jeff wants is: something somewhere that tells you the spec’s limits. Suggesting that a WebGPUDevice tells you which limits it was created with.
DM: don’t like the idea that the device’s limits are different from the adapter.
Discussion about whether we return the min-spec limits from the adapter.
CW: no, people will get used to using those as the core spec limits rather than what the hardware supports.
KN: or you try to create an 8K texture but you didn’t request it.
Discussion about initially exposing the Limits only on the Device and not on the Adapter, and making them non-configurable during Device creation.
DM: why not just expose these on the Adapter? And/or increase the values?
KR: not forward compatible in the API.
Lots of discussion about whether the min-spec limits will be exposed on the Adapter or Device.
VM: most Adapters will support higher than min-spec.
DM: why not just expose more Adapters?
CW: we don’t want to expose the Adapter’s limits yet. Right now they’d correspond to portable limits, and in the future we want to raise them.
KN: we don’t want to expose “virtual adapters” with higher limits.
RC: so today from the Adapter we just expose the string and extensions?
CW: yes. Today we want people to programmatically query the core spec limits, and that will be returned from the WebGPUDevice’s limits query.
CW: we’ll choose the various required limits from Vulkan, D3D12 and Metal.
MM: the research project of figuring out the limits is also figuring out what which devices in the field actually support.

Partial Pipeline States

CW: in the pipeline there’s a layout. That has to be separate. But there are depth/stencil states, blend states, etc. Do we want these to be separate objects, or respecify things on every pipeline?
MM: we haven’t heard that applications de-duplicate their depth/stencil states, so we’re comfortable moving them into the pipeline object.
CW: so on the API side, do you want many different objects or a fat pipeline state?
MM: think it should be fat. It’ll all get compiled later, so keeping track of them separately won’t buy you much.
CW: so everything is in the pipeline state except:
- Pipeline layout. We use that in NXT to precompute stuff for D3D and Metal (and to pre-create stuff for Vulkan too)
- Render passes.
KN: think that should work.
CW: helps factoring validation.
MM: Vulkan has a ton of tiny things for each stage of the pipeline.
- Would imagine that this is just working around limitations of it being a C API.
CW: also for multithreading. If you pass everything every time you don’t have sharing problems. (i.e., having a separate BlendState object which you can use from multiple threads)
CW: saving a little bit on the input state. One drawback of having everything on the pipeline is that you can sometimes do a little more up-front validation, like InputStates.
MM: don’t think that matters.
KR: we’re not getting back to per-draw-call validation right?
MM: no. the object that’s input to that validation, should it be a web of JS objects, or one object with a lot of properties?
CW: it’s a lot of attributes actually.
MM: let’s proceed in this direction (one big object) and change it if it gets ridiculous later.
CW: Vulkan has a bunch of stuff that can be in the pipeline state but also be dynamic. We should make them all dynamic.
MM: why did they do this?
CW: some GPUs have limited scissor/viewport sizes. Can do slightly less command buffer patching. It’s minor.
MM: if all the APIs support doing this dynamically, sounds good.
VM: we should check with the hardware vendors to see if changing any of those states dynamically are particularly or pathologically slow.

Workers and the TAG

CW: discussions with TAG about WebGPUFence.wait() and whether it can change between microtask invocations
- Different behavior between the main thread and workers?
MM: Apple’s position is similar to the TAG.
- Today: have an elegant solution for apps that spin-loop and produce frames; they don’t show the frame until they return to the browser.
- CW: was a proposal to the TAG to allow commit() from workers.
MM: it’s important that we can throttle applications – to have some back-pressure.
CW: can do back-pressure in commit().
JD: the main problem for the TAG was not returning to the main loop. Can’t postMessage to the worker, etc.
CW: that was just one of the issues. The error handling issue was the main one.

Memory Heaps

DM: it’s clear from developer feedback that using heaps allows them to get substantial gains. Exposing this as opposed to automatically managed heaps would be important.
MM: are we talking about resources that alias?
DM: not talking about aliases right now.
MM: so this is a heap that has fast sampling but slow CPU readback?
DM: detaching resource allocation from its physical memory. Metal does this too, but in Metal they’re more abstracted away. Can only assign one resource to another if there’s no aliasing.
MM: why did these developers get speedups?
JG: makes alloc/dealloc really fast.
CW: moves the point of alloc/dealloc.
RC: do we think web devs will be better at managing their heaps than we would be?
MM: heaps mean many different things. “Within this heap, place this object at offset 73.” If that’s the Q, can the app do this better than we can?
CW: it’s at least something more than Metal shared memory vs. private vs. managed memory.
CW: I think it’s more about placement heaps. That’s what they are in D3D12 and Vulkan.
MM: So is what we are discussing this? I allocate a heap. When I allocate, allocations come from the heap.
VM: allocate slab, then do sub-allocations out of it.
CW: Chrome has a problem with this model because you create a resource and ask its requirements / restrictions, then place it in the heap. This requires a round-trip to the driver.
MM: also the topic of aliasing. This CG doesn’t want to solve that problem now and this model implicitly allows aliasing.
VM: think we should state what we are trying to solve.
DM: piece of feedback from Arseny at Roblox. Allocation performance has been a problem in Metal and D3D12 and a non-issue in Vulkan, because we can do large allocations and then do cheap sub-allocations out of them manually.
CW: for a buffer they can allocate one large buffer. The main question is about textures.
DM: yes. Everywhere you work with a buffer you can specify an offset. Not so much with textures. Question is, do they need direct placement of resources. He hasn’t started using the Metal heaps yet.
JD: a little unclear where the benefits are coming from. If defragmentation, etc., then don’t need the ability to place resources at particular locations.
VM: unless you know more about your allocation patterns so you can reduce fragmentation within your blocks.
DM: creating resources on the fly, too.
DM: would be nice to also take advantage of these semantics to know whether we’re going to run out-of-memory.
CW: Chrome can’t query the size of the resources.
MM: also Metal can’t implement this. We do createBuffer, createTexture, etc. You can attach barriers to a whole heap instead of an individual object.
CW: also indirect argument buffers.
DM: in Metal, is there a way for developers to know how much memory they need for a set of resources?
MM: don’t think so.
CW: you can ask the heap for the heap texture size and round up. (Google for “Metal Heaps”.)
JG: it’s designed to be a single arena size that you can allocate multiple blocks within.
VM: similar to a descriptor pool. You query how much space it’ll take and then allocate it. But you don’t place things yourself.
DM: this level of functionality Metal provides maps to the other APIs, just a more restricted form.
VM: DescriptorPool -> you place the index.
CW: no, that’s a linear allocator too. D3D’s like Vulkan. You create heaps and call PlaceResource.
CW: fundamental problem: we have to query the driver for each resource. It’s not like you can multiply the pixel size by width and height.
DM: what if, when creating a heap, you say “this is the number of objects I’ll be allocating”?
MM: why not just tell the driver, create all these at the same time?
CW: like ResourceGroup. But doesn’t solve Arseny’s problem of mid-frame resource allocation.
DM: it does. Create a heap that can fit this number of resources.
CW: need to know size of resources beforehand.
DM: during heap creation – like creating a DescriptorPool – you tell it what you want to allocate.
CW: disadvantage compared to what Roblox is currently doing.
VM: one of benefits on Apple’s site: transient resources that share the same backing memory, for example when they aren’t used at the same time.
MM: Having two resources live at the same time and pointing at the same memory is scary.
VM: guess the Metal API encourages sharing of resources.
CW: there’s no other mention of transient resources in Metal.
VM: sounds like there’s an API mapping we need to research.
CW: a “Resource Set” – would work for us, but also wondering if it’s the problem. If you’re Roblox, you’ll want this ResourceSet to hold everything you need in a frame. Say, 3 3D textures, 2 2D textures, … but you won’t actually be using all of that in a single frame.
DM: how is that different from now, and specifying the size of the heap?
DM: if you don’t know the texture size you want, say 16x16 vs. 32x32, estimate things based on the larger size and then the smaller allocation will work too.
CW: could also have “create this resource, and I don’t care about heaps”.
DM: does this sound more like an extension to you?
CW: like ResourceSets?
VM: sets are interesting for releasing a lot of objects using a single fence. But presizing / preallocation is more interesting.
JG: don’t think you can get the interesting things in Metal right now.
CW: or in Chrome.
VM: ResourceHeap probably can give some of the benefits. Batch allocation.
DM: what if the buffer creation – part of the WebGPUHeap interface – but not pass in the heap?
CW: if we introduce heaps later on, could pass in Texture instead of TextureDescriptor.
DM: the device implements the backing heap for you. Maybe we will add a CreateHeap method later on the device.
CW: maybe if we add heaps later, we can do WebGPUHeap.CreateTexture().
JG: want heap creation as an extension which affects things like texture allocation. Vulkan does this as two steps. Binding is immutable.
DM: if we want to support the specification of addresses later, can’t have a single WebGPUHeap.
CW: pretty sure we can’t implement address heaps in Chrome, because we need to query the driver.
JG: what do you need a round-trip for?
CW: driver tells you alignment/size. Also render targets take more memory on some GPUs. In browser, can’t tell you, “You just need this amount of memory”.
CW: multi-process/multi-threaded implementations in general won’t be able to support address heaps.
CW: in Intel Vulkan, vertex buffers don’t want to be in heaps > 4 GB in size. They have a small heap that’s for vertex buffers. Need to communicate that to the application. Or, let the WebGPU implementation paper over this.
JG: app shouldn’t have to deal with this but should be able to.
CW: Chrome just can’t do this.
RC: is there no way for Chrome to get back enough information about the heap to figure this out without calling to the driver?
CW: no. Depends on the full resource descriptor.
CW: works differently in D3D12 and Vulkan.
RC: do we think that we can’t do better on the allocator than the D3D11 allocator?
JG: don’t think so.
CW: think if you can fit the object in the heap, do it, otherwise allocate a new heap.
RC: have been warned about residency in D3D12. D3D11 has a lot of knowledge of what’s used where. In D3D12 there are no such heuristics. Will already be in the business of tracking usage.
MM: the story you told is true today in Metal; residency is handled by the system. It’s very important for iPhones to be able to evict resources.
RC: in D3D12 you can use as much memory as you want. Even when it’s running low on memory or the user’s browsing the web, can evict things from the discrete GPU.
- Video memory manager will try. But if it evicts something, it’ll do what we need to. Somewhat related to bindless. Will have to do more tracking.
MM: in a world where bindless isn’t used, is this still a problem?
RC: yes. Because D3D12 driver doesn’t do any of these allocation heuristics.
CW: WebGPU will need heuristics for its D3D12 backend.
RC: it’ll work, but it’ll evict stuff where if you use it the next frame, you’ll see thrashing.
- There’s less information given to the API about what it’s using and what’s bound.
CW: so basically it’s complicated and should be hidden in the impl?
RC: for residency, definitely yes.
MM: how to know what to evict?
RC: can kick heaps out yourself. Also MakeResident. Also a priority scheme per-resource.
RC: you can evict things when you want to.
JD: but the system can also evict things for you even if you didn’t say to.
RC: yes.
MM: so application should minimize the stuff it uses.
KN: can dynamically change priorities?
RC: yes.
CW: it’s resource-level residency. We are trying to do some residency in Chrome’s Vulkan backend.
MM: web apps should not have this level of control, right?
CW: app could get a callback when it should free stuff up.
JG: it’s generally the way it works. You get a cooperative callback first. If you ignore it we kill you.
CW: back to memory heaps. Right now, think best thing is to not use a heap, just create textures, buffers, etc.
JG: seems like this is all we can do for now.
JD: do we have usage flags like OpenGL?
CW: we have resources where the flags specify what it can be used for. Or, I want this shared resource in shared memory.
MM: that should be a hint, and not be observable.
CW: what if you make a mappable resource that’s also a UBO?
- Let’s defer this question.

Descriptor Pools

CW: in Vulkan binding model: descriptor set allocation happens inside pools. That doesn’t map well to D3D. But you can’t use two heaps at the same time in D3D12. Vulkan can. Big mismatch in behavior.
DM: investigated from mapping from DescriptorPools to DescriptorHeaps. Have been looking at it for a while. Things get de-duplicated. By itself, doesn’t mean DescriptorPools have to be exposed to WebGPU.
CW: descriptor heaps have a limit of 500,000. Some apps might use more than this in one frame.
JG: seems unlikely
MM: a lof of applications use atlases. Don’t those consume resources?
JG: could build your own memory heaps.
CW: have a bunch of descriptor pools per bind group layout?
DM: in my book this lands on the same page as the memory heaps. The benefits are very similar.
VM: do they have the same property that you have to query the driver?
MM: no, because they’re all the same size.
CW: it’s a linear allocator. You can reset() and free() but that’s it.
RC: is the D3D problem that you can’t use descriptors from different heaps in the same draw call?
CW: no, that’s not a restriction.
Discussion about differences between D3D12 and Vulkan in this area.
CW: in D3D12 you can memcpy between these.
CW: descriptor pools in vulkan are much more opaque, and you can have multiple in use through the DescriptorSet at any given time.
JG: MemoryPools are also not multi-thread safe.
CW: yes. Single-threaded allocation thing, similar to command pools.
MM: thought the reason for DescriptorPools was for multi-thread-safe allocation of descriptors.
CW: TLS works in Workers. So can have a DescriptorPool hidden behind the scenes for that thread. Maybe just summon the DescriptorSet out of the device?
VM: need a separate pool for thread. If you just grow them then each thread’s grows as large as the largest one.
CW: Chrome doesn’t have this problem because everything’s shipped cross-process, and we can reorder stuff. Think we should just not have DescriptorPool, and summon DescriptorSets out of thin air.
RC: so Resources and Descriptors are managed by the people in this room?
CW: Yes.
DM: just because you ship the commands to another process doesn’t mean you can’t benefit from descriptor pools. Avoiding central dispatch place.
CW: sure, that thread’ll be more loaded than the others, but we can do a lot of stuff on that thread even for WebGL.
VM: are are thinking of going multithreaded for Vulkan.
CW: command encoding for WebGPU is not where you would go wide. Also can batch on the client.
VM: hard to predict.
MM: how about simple approach of not exposing it? Then if you get slower performance you can patch it back in.
CW: OK for now. And in D3D we manage this internally.

Error Handling and Reporting

https://github.com/gpuweb/gpuweb/pull/53/files
CW: combination of a global error callback, and the Maybe monad for objects.
- You can opt in to error handling for certain objects, like out-of-memory if you’re streaming in objects.
JG: so the only way to figure out if it didn’t allocate is to check the Promise?
- CW: yes, or: I don’t want to handle the case where it failed to allocate.
JG: it’s harder to write backoff code.
CW: it’s worth it if you’re Google Maps.
MM: what’s the function that puts you in a separate mode?
- CW: it’s an attribute on the dictionary.
- MM: but you can still use the object even if it failed allocation?
- CW: yes.
- MM: why do you need a switch then? Presumably you can ask for the status of a resource.
- CW: there’s an effect upon out-of-memory; if you set the flag, you don’t get the out-of-memory error in the global log.
- MM: then shouldn’t you be looking at the reason / type field? e.g. “recoverable out-of-memory”. Mode switches are generally a bad idea.
- CW: when callback on error, in addition to enum telling you type of error and a string, get a WebGPUAny which is the resource it errored on. To figure out whether it’s one of your resources, or someone else’s.
- RC: we were talking about a label where you can get more context in the string.
- MM: resources should definitely optionally have a name.
- CW: but you can push/pop debug info on the context, for example during submits.
- MM: if you did this you wouldn’t need a mode switch. In general, having both use cases is a good idea. Maybe Monad is good – better than glGetError.
- CW: you need the error callback to get a token saying which resource you got the error on.
  - If you allocate 3 buffers and one allocation fails, how to you tell which one it is?
- MM: the handle is a JS object.
- JG: if you had a synchronous API and you expected to do “buf = createBuffer(); buf.name = blah”, and the error callback gets called during createBuffer, you still don’t have the label.
- RC: what will teams like Maps want to do?
- CW: they want to add 10 tiles, then 10 tiles more, etc. They look at the Promises for them and ask, “did it work”?
- MM: so Google Maps will go into slow mode?
- JD: it’s not slower is it?
- MM: you don’t use the resource until you’re sure it’s a valid resource.
- JD: what if you tried uploading to it? Validation error, but who cares?
- MM: the Maybe monad would work the same way. Good developer makes the resource, then does the .then() on the Promise, and doesn’t do anything until it resolves.
- JD: it’ll increase latency.
- CW: but you already have network latency for Google Maps.
- Discussion about uploading vs. drawing.
- MM: want these modes to be unified: create the resource, .then(), and before the callback is run, start the upload. And in your callback, say “OK to use this resource”.
- JG: mapping will fail for the upload.
- KN: doesn’t need to be a map operation; could be an async upload.
- JG: not a big fan of Promises in general. That’s not how Gecko code is structured. Ideally would use something we could reimplement ourselves.
JG: are we switching to make Create() return a Promise?
- CW: no, just querying the status returns a Promise.
CW: this is why mapping is asynchronous / non-blocking. Need a fence, wait for it to complete, etc. By the time we’re ready to give you a pointer, we know the status of the buffer, and return null if its allocation failed.
MM: ordering of these Promises?
- CW: yes.
CW: can take any WebGPU object, and ask, “device.getStatusPromise()” of that object. Returns valid / invalid / etc. Not part of buffer / texture creation, but a device thing.
MM: think that’s fine.
RC: so they’ll not use the status promise then?
- MM: a question of which things support the status promise. And which object should expose the status property. Doing it on the Device centralizes it.
JD: like the idea of a function you can call to optionally return the Promise. Don’t want to have to optimistically allocate a Promise in case you’ll need it.
RC: why can’t most of the people query the status promise all the time?
MM: it’s useful for streaming apps to be able to query and back off, but a lot of apps just want to allocate 10 textures and keep displaying them.
RC: seems to me we should define how apps should work, instead of having two ways of doing it (black rendering for you, and smarter recovery for you).
MM: that’s what I’m saying – no mode switch. If the app wants to back off, it can recover from the allocation failure by backing off.
CW: you get every OOM. And for each, you can say, I want to recover from it, or not.
MM: so the PR I made was to remove the pollable error.
KN: there’s still a question: instead of createTexture(), had createTexture and tryCreateTexture(), the latter returning Promise<WebGPUTexture>. In the meantime, you either can or can not use the WebGPU texture for stuff like uploading. Is it useful or not to be able to do that?
CW: sounds like the same thing we’re discussing.
KN: simplifies things quite a bit.
MM: doesn’t support use case of Maps to have the Promise to know when it’s ready, but want to start the upload now.
JD: if you want that API you can build it on what we have.
KN: OK, seems good to keep it simpler. Difference is you can’t do that upload till later. It’s surely a little bit bad, but probably a small impact.
RC: the upload will be done from whatever you get back from the network. Will still have to stall the pipeline in the Google Maps case.
KN: places where there’s a graceful fallback will use the error handling.
CW: we will make the changes to the pull request.
RC / CW: so the changes are: 1) remove the boolean. 2) add the object to the callback. 3) move the function which creates the Promise of object status to WebGPUDevice.
MM: so when you ask for status: there’s no IN_TRANSIT status? Only Yes or No?
CW: it’s a Promise. Only resolves once known.
CW: if it was destroyed while your Promise was in-flight, could figure out a short-circuit.
DJ: definitely like a function to return the Promise rather than a property. Should be getStatus() returning the Promise.
CW: sure. Similar to Fences where you want something Promiseable and Pollable.
KN: two remaining things. 1) this idea. 2) make sure this works for WebAssembly. Want to make it work for non-yielding command loop applications.
CW: the error log is a callback.
KN: the problem with the error log is that if you’re non-yielding it will never get called. Unless all the errors happen on the main thread.
CW: it’s not this group’s responsibility to figure out how non-yielding main loops process promises and callbacks.
CW: a lot of ported native apps out there want to run their main loop in a loop without yielding back to the top level.
KN: there is a yield() call in Emscripten. But is that mechanism sufficient?
KN: we could say, all error logging shows up on the main thread.
KR: could design a WebAssembly API which returns the most recently produced error.
MM: we shouldn’t design for something which doesn’t exist.
CW: let’s do the changes necessary once other people have figured out how non-yielding main loops in workers will work.
KN: for error log: two options. 1) every time there’s a validation error, it gets logged, regardless of whether you use the object. (try creating an object, but you gave invalid args). Is it logged immediately, or only when you try submitting a command buffer referencing that object?
- MM: when would it be valuable to make an error object but not use it?
- KN: maybe before you start using the invalid texture, are errors reported?
- KN: would prefer that only queue submit and a few others can cause error log entries. And the entry should be rich enough to tell you what the first failing thing was.
- KN: say you do an allocation which might fail, and it does fail. But you’ve tried uploading to it, etc., and you submit it to a command buffer. There probably shouldn’t be any error logs. You detected the error state, and you didn’t submit the command buffer. Would expect no errors to be shown in the log.
- MM: from impl perspective, it’s simpler to log things immediately. That way you don’t have to record the construction order of the graph and walk it backwards.
- KN: not sure. Might be simpler, and explains things.
- RC: think we should keep it simple.
- KN: if a fallible allocation happens, and you do an upload to the object, how do you know that that error was OK?
- CW: a way to say “ignore this error”?
- KN: seems harder to do this on the application side than the implementation.
- RC: it will use the Promise status.
- KN: it might use the error log for telemetry and bug reporting.
- MM: if the enum value is “error”, it seems there should be something in the error log.
- KN: don’t think I agree with that.
KN: option (2): you only produce an error on queue submit and other device-level operations.
- You can create a chain of as many invalid objects as you want, but as long as you don’t try submitting it, it doesn’t put an error in the error log.
CW: if you’re uploading to something you’re doing a device-level operation anyway. Can’t work with resources, command buffers, etc. without interacting with the device.
CW: think it’s very niche for now.
KN: agree. But if we don’t do this then you can’t tell from the app that a given error shouldn’t go into the log.
CW: they can put something on the object. Think this is niche enough that we don’t care about it for now. You can work around it.
KN: OK, agree.

Assuming we have a CLA, what happens next?

CW: where does the software live? Test suite? Should we model the WebGPU test suite after the dEQP test suite for OpenGL and Vulkan?
- Has a software rasterizer, etc.
KN: may want to use Bikeshed to write the spec.
DJ: happy to use Bikeshed.
BJ: showed us the WebXR Bikeshed spec.
DJ: for test cases, we could also do web-platform-tests.
- Shared test suite for all test suites, on Github, has its own test harness and runner, hierarchical, and you just submit tests to it. Point your browser to it, and it says yes or no.
DJ: also can run in WebDriver. Needed for reftests.
Discussion about how the test suite might be written, debuggability, the fact that intermittent failures are hard to debug no matter what language they’re written in (flaky failures in the WebGL 2.0 tests, which were ported from the dEQP’s C++ code)
- dEQP’s license is Apache 2.0
- https://android.googlesource.com/platform/external/deqp/
CW: One WebGPU repo that contains everything. Could pull in the dEQP harness as a submodule.
DJ: however the “ANGLE-like” library could be a separate repository.
MM: so if I want to write a test I have to write some C++ code and then run the compiler?
- KN: yes. You’ll have ~2 different build targets.
Kai shows and explains dEQP WASM

ANGLE-like Library for WebGPU

JG: main reason for Mozilla to not use this library would be to have a second implementation.
- DM: we have a library which abstracts over Metal/D3D12/Vulkan. So want to build WebGPU on top of that.
DJ: might be that Apple doesn’t use it either because we only need one backend. Or maybe we only use parts of the library.
CW: want to change NXT and make it look more like the WebGPU IDL.
- First, rename it. fserb@ will help. :D
DJ: things I would like to see:
- Doesn’t depend on Chromium build system.
- Should be a completely standalone library.
- CW: currently it uses CMake.
- senorblanco@ is working on an NXT backend for Skia. So it might have two build systems.
- Want API to be serializable. Is this built in to NXT in some sense?
- CW: we’re going to drop code generation. There’s the frontend API which looks a lot like WebGPU. If you want, you can reach into the guts to do stuff.
KR: opinion about keeping code generation.
CW: WebGPU:
- Device level operations require handwritten code.
- A lot of stuff can be autogenerated, but definitely not the whole API.
- Command buffer recording could be.
- KN: a lot of the inter-process interface could probably be autogenerated. The outer interface is probably separate.
- SW: if the amount of code you write by hand is less than dealing with the code generator, it’s a net win.
CW: finally, productizing NXT.
DJ: so you’re suggesting starting with NXT, changed somehow?
- CW: yes. Reason: it’s proven that it works. SW has used it to build a Skia backend, and it worked. He added texture and vertex formats, and we added a MAP_WRITE option. Also, sampler addressing modes.
- SW: big ones missing: multisampling, sRGB, clearing stencil, mipmaps.
MM: before we agree to contribute a Metal backend to this; will need to review it more.
- CW: there already is a Metal backend. It’s the most production-ready one because Metal is simpler.
RC: so the first step is to push NXT to the W3C including the git history?
- CW: yes, plus relicensing and talking with lawyers.
- DJ: and setting up processes, CI, etc.

gfx-rs

DM: open-source project, history of 4 years. Goal was to abstract over graphics APIs.
- Count up to 140 people in the community who have contributed.
- Contributions fairly sporadic.
- Has mostly been rewritten in the past year.
- API exposed is close to Vulkan.
- Have a C wrapping layer making it a portability library. Can run the full Vulkan conformance suite against it. That’s what’ll ensure quality of the backend.
- Able to run quite a few LunarG samples. Quite a few shiny demos. Actively developed. Used in WebRender, as well as WebGPU implementation.
- Have prototype of WebGPU sketch API inside Servo using gfx-rs. Can create a SwapChain and show a black screen.
MM: what’s the difference between this and VkPortability?
- JG: they both expose a Vulkan API.
DM: we target Metal, D3D12, Vulkan, D3D11, OpenGL ES
- D3D11 and OpenGL non-functional right now
DM: hope to eventually move WebRender to gfx-rs
RC: how does it compare to MoltenVk?
- DM: MoltenVk is a good production-quality library selling for a while. Can run Dota2. We don’t have those demos. But we have hooked up CTS coverage. MoltenVk doesn’t expose their progress to others.
- CW: MoltenVk is great if you don’t mind fixing stuff yourself.
- Memory Barriers

Memory barriers

DM: feedback we (DM, CW, KN) got yesterday from Matthäus Chajdas from AMD:
- Had strong opinions about pipeline barriers.
- First: who are we targeting?
- If we want to target small groups of developers, there is no chance they will get memory barriers more efficiently than we could do automatically.
- Large studios are paired with helper engineers from all of the IHVs.
- So probably does not make sense to expose them in the API.
DM: what workloads are we aiming to do in the API?
- 10K-20K draw calls per frame? Explicit barriers not a help.
- Explicit barriers only help around 50K draw calls.
- If the driver tries to automate it, it gets too much per-draw-call overhead.
- Benefit of automatic barriers diminishes quickly after the 10K draw call mark.
DM: his suggestion:
- Since we have a lot of CPU overhead for draw call submission…
- CW: our CPU overhead per draw call might not be that bad
- DM: ...make the pipeline barriers happen only at the start of command buffers or render passes.
DM: another strong point:
- Very few resources that have to change their state.
- So reduce the number of types of resources that have to track this.
- RC: have heard this feedback too.
DM: last question: how do you provide the initial resource data?
- Usually have to transfer the data in before you can treat it as immutable from then on.
MM: one way: creation function accepts initial data, optionally.
- DM: that’s what Metal does.
- MM: yes, but for the private storage mode that doesn’t work.
- JG: might want to generate it on the GPU and then mark it immutable.
CW: two things:
- Thought memory barriers were most important for AMD. Since an AMD engineer told us this, was surprising.
- Providing data at resource creation time is OK, but seems like a large / complex thing. Textures require multiple mip levels, etc. Thinking instead to drop the transfer destination usage from the resource. Then it never gets another barrier so we don’t have to check it for barriers.
DM: split barriers:
- DM: I was deeply convinced that they were required for efficiency. He says:
- It’s very unlikely that the driver can interleave other work while split barriers are underway.
- CW: he was saying that 90% of the gains can be gotten with automatic barrier generation. It’s easy to do the wrong thing and get worse performance. Was surprising to hear from AMD, where we thought they have the most expensive barriers.
DM: he thought the major benefit was grouping barriers.
MM: so creating and submitting a fixup command buffer?
- CW: yes. Or maybe like in NXT, we re-record the command buffer. You have all the resources and know their states, so you just stitch them.
CW: really challenges our assumptions that explicit barriers are needed for that hardware.
VM: sounds like batching the barriers is the biggest win.
RC: why do we need to group?
KN: to avoid cache flushes.
MM: have to do as few calls as possible, so have to coalesce.
VM: could be very biased to AMD architecture.
CW: talked with Jesse Hall; mobile cares less because it has fewer parts. If you have a “waitForIdle” it’s only waiting for ~2 pixels to finish.
- Apparently NVIDIA’s GPU can’t read compressed framebuffer formats, so had to group the decompression / compression steps?
VM: has NVIDIA weighed in on whether they get pipelining?
- CW: think texture units can read the framebuffer compressed format.
- MM: found the same thing. NVIDIA has more functionality than AMD in this area.
JD: this is a big win for the API complexity.
CW: yes. Have to define that the resource within a pass can only be either writable, or readable.
- Want to replicate the data points we got with other GPU vendors. But if it’s true then it’s a big win for API complexity.
VM: could do extra batching just by doing a no-op command buffer referencing everything. (As an extension?)
MM: so tentatively proceed this way?
CW: I think so. Tentatively, have implicit barriers. Confirm this feedback with other vendors. It’s much simpler to have no API for this than to have an API.
JG: how do we specify what we write to or read from?
CW: SetBindGroup / BindBindGroup / SetVertexBuffer validate that the resource is allowed to do this. Set “resource is used as this” in this pass. Then validation later checks the usage of the resource throughout the pass.
MM: UAVs: think we should parse the shader to see if they’re readable/writable.
CW: think there should be only two usages: read-only, or read-write. The shader then has to have read-only keyword. Has to match pipeline layout.
MM: saying that the programmer should not say, this is readable, or this is writable.
KN: we are at least checking that the use matches the pipeline configuration.
MM: inside the source code of the shader, if you try to write to something read-only, it should be a compilation error.
CW: there’s a decoration in SPIR-V that it’s read-only. Could use that.
MM: you don’t need it though. Plus we didn’t agree we’re ingesting SPIR-V.
DM: would be useful to validate SPIR-V independent of the pipeline.
CW: that’s true. But you have your HLSL code and it uses a UAV, which is read-only. (You never write to it.) Then HLSL -> SPIR-V. It sees you never write to it, so it tags it read-only.
JG: don’t want to change the barriers we issue to have to match the shader parsing results.
MM: possible to have same resource bound twice in your shader program.
CW: SetBindGroup has a matching layout. UBO/SSBO/read-only SSBO. Upon SetBindGroups, figures out the usage automatically.
MM: so this would be detected in both systems we’re describing? You’d bind the same resource to both bind points. Two variables backed by the same thing. One’s read from, one’s written to. When command buffer is submitted, resource is marked as read-write. Yes?
- JG: seems if you use it as both read-only and read-write that it’s still read-write
MM: the point I’m trying to make is that read-only vs. read-write is more than just analyzing the shader; has to do with the bindings.
CW: yes. When you declare the layout, you specify each thing as read-only vs. read-write.
- MM: system will do this better than the programmer.
CW: BindGroupLayout vs. Pipeline: when you compile the shader code in the pipeline with the bind group layout, it says “this SSBO is read-only”. Have to make sure it’s read-only in the pipeline.
MM: rather than pipeline saying “this resource is allowed to be read from / written to”, why not automatically detect that instead of the grammar?
CW: you have an interface to your validation.
Discussion about how / when we need to, or can, compile shaders and/or pipelines.
VM: whatever it does - read/write, etc. - can be saved on the shader module.
MM: it’s just a detail.
RC: if we’re going to do this, not sure whether the HLSL compiler has the smarts to figure out something’s only written to or read from. A shader resource view is what is used. HLSL only has shader global barriers. Need to make sure that the compiler can determine this for us.
CW: we can statically figure out how each piece of code is used.
MM: each time you access a resource, you have to state which resource you’re acting on. Easy to match operations and resources.
CW: fairly simple static analysis to understand whether a UAV is written to.
CW: also HLSL++ can have a readonly keyword too.
RC: so if we can do all this analysis then no explicit memory barriers in WebGPU?
- CW: except maybe for UAV barriers. That’s a detail we should push off for later.
KN: it’s “nonwritable” – SPIR-V declaration you can attach to an SSBO.
MM: in the original source program, people won’t be marking things readonly, right?
- KN: the attribute’s in GLSL too. Could add it to HLSL if it’s not there.
MM: Q is: somebody will have to do this analysis to figure out readable/writable.
CW: same way as in C programs you have “const”. Additional type safety.
- Read-only UBO + read-write SSBO for example. Conflict.
- Use readonly keyword. Interface stays as it should be.
MM: if you make something writable you have to change all the readonly places to be read-write
- CW: that’s called type safety.
DM: did we agree on non-inherited resources between render passes?
- Looking at the issue. 2 comments, not closed.
DM: in the sense of memory barriers: if we inherit, people will forget to unbind something. Might use in another pass, etc. More complicated.
- CW: not sure we reached consensus.
- MM: think we’re in consensus. No inheritance.
- Issue #24.
DM: so barriers are implicit?
- CW: yes, in the sketch API, make them implicit, because less API is better than more API. Let’s reach out to other IHVs.
DM: to me his points sound very convincing. I know what the native hardware does during transitions and if they can’t take advantage of it then I don’t know who will.
KN: if we’re worrying about the last bit of performance then we shouldn’t worry too much about NVIDIA. Rather, Qualcomm and ARM.
VM: think this is the right decision. But if you want to do 50K draw calls? Want to be as fast as native.
KN: if you’re a CAD application / game, you’re not running on the same (D3D11) driver. They optimize for different things. As something slightly higher level than Vulkan/D3D12 we don’t want to be in that situation. Not sure how much I agree with that.
- CW: not sure how much I agree with that.
VM: I just didn’t get it – if I want to do 10K draw calls and not have it take my whole frame time?
MM: # draw calls is poor proxy for amount of non-idle time.
DM: rather, amount of validation you have to do.
KN: if you submit same command buffer many times, maybe not.
VM: doing implicit barriers will add 20% overhead to draw calls. But only shows up if you’re within 80% of your budget. (?)
CW: if we put barriers at beginning of passes: generating them is similar to validating them. Optimizing is more work. Many applications have 1 renderpass anyway.
DM: UAV barriers?
- CW: only inside compute passes
- MM: they can’t be inside passes in Metal
- DM: I propose treating the compute pass boundary as the global UAV barrier.
- MM: you can get smart about figuring out which UAVs need barriers and issuing them.
- CW: seems reasonable.
  - Will have to remember which resources were bound as writable UAVs and flush them later.
  - MM: you could do split barriers. Done writing -> one part. Start reading -> other part.
  - KN: hard to do across command buffers.
CW: this simplifies a bunch of stuff and gives us a bunch of homework.

Agenda

Skip next week’s meeting
Meeting after:
- Let’s get some IHV and ISV feedback.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minutes 2018 04 24

GPU Web 2018-04-24 Montreal F2F

Minutes from last meeting

TL;DR

Attendance

Agenda

Canvas integration

Device Discovery and Creation

Partial Pipeline States

Workers and the TAG

Memory Heaps

Descriptor Pools

Error Handling and Reporting

Assuming we have a CLA, what happens next?

ANGLE-like Library for WebGPU

gfx-rs

Memory barriers

Agenda

Clone this wiki locally