Skip to content

Minutes 2018 12 17

Corentin Wallez edited this page Jan 4, 2022 · 1 revision

GPU Web 2018-12-17

Chair: Corentin

Scribe: Ken

Location: Google Meet

TL;DR

  • Push constants/Dynamic buffer offsets
    • ByteAddressBuffer are a fast-path in D3D11/12
    • Benchmark data is surprising but clearly shows push constants underperforming
    • Dynamic buffer’s complexity depends on the how data is uploaded to the GPU
    • Engine feedback is that dynamic buffers are more important than push constants
    • Tentatively skip push constants for the first MVP?
  • Buffer mapping #138 #147
  • PR Burndown
    • Dynamic buffer PR #149 accepted with nits
    • SetVertexBuffers PR #151: unclear what null does: do nothing or clear the binding.

Tentative agenda

  • More push constants
  • More buffer mapping
  • Multi-queue synchronization
  • PR burndown
  • Agenda for next meeting

Attendance

  • Apple
    • Justin Fan
    • Myles C. Maxfield
  • Google
    • Austin Eng
    • Corentin Wallez
    • James Darpinian
    • Kai Ninomiya
    • Ken Russell
  • Microsoft
    • Chas Boyd
    • Rafael Cintron
  • Mozilla
    • Dzmitry Malyshau
    • Jeff Gilbert
  • Yandex
    • Kirill Dmitrenko
  • Joshua Groves
  • Mateusz Kielan
  • Mehmet Oguz Derin

Administrative items

  • CW: still missing some folks from the F2F. Please confirm your registration.
  • DM: How do I register? CW: Do you want to go? DM: Yes. CW: Okay, registered.
  • CW: Details on mailing list, email me to register.

More push constants

  • CW: MM provided benchmark. Surprising results. DM did some benchmarking. Results seem to be that they are mixed, and dynamic buffer offsets perform better in general.
  • MM: commented on PR about how to implement this with argument buffers in Metal.
  • CW: sounds good. Implementable efficiently.
  • MK: I read it and don’t know anything about Vulkan. As long as you don’t have the buffer that’s dynamic, only the offset.
  • CW: yes, want this implemented on Vulkan efficiently.
  • MM: did you say we should do dynamic byte offset in shader, or make offset be a multiple of the item size inside the buffer?
  • CW: think we should do byte-addressed buffer in HLSL. That’s how ANGLE implements compute shaders on the D3D11 backend right now.
  • CB: we’ve found byte-addressed buffers to be a fast path.
  • CW: given that byte-adderssed buffers are the fast path, and that dynamic buffer offsets are faster in the benchmarks, and that it’s iplememenable on Metal with arg buffers, should we go with that?
  • DM: main q is are there uses of dynamic buffers outside the upload path?
  • CW: best to minimize number of bind groups you have to create.
  • MK: depending on what you choose to do in the uploads...push constants might be the only way to modify some of the uniforms during a render pass on Vulkan, that doesn’t require you to have all uniform data ready for the render pass. You otherwise have to stage everything you’re going to use in uniform buffers. Push constants are advertised by the manufacturers as the way to modify frequently changing data. The tendency to underperform is surprising, but that’s maybe because the cmd buf wasn’t reused in the benchmark. Vulkan, AMD, etc. all suggest using push constants for frequently changing data.
  • CW: having seen the standardization of Vk since early days that was my feeling. But we don’t have any better data than the benchmarks. Provided we don’t reuse command buffers, dynamic buffer offsets are better.
  • JG: my homework is to ask why push constants are slower but I haven’t gotten to do that.
  • CW: CB just sent this: https://github.com/sebbbi/perftest
  • CB: these are large load operations, not just simple updates to a single register. If we can get by with a simple API that provides 95% of performance, probably a win.
  • CW: are you saying push constants are the simple API?
  • CB: no: if we need a different mechanism for small, high frequency constant updates, and if push constants don’t provide them, then don’t include it.
  • CW: people will give us feedback between MVP and V1 of the API. As a group I think we have to say we’re doing dynamic buffers.
  • DM: if we do that instead of push constants you’ll want to do things like sub-buffer uploads, etc. that we don’t support at the API level. We were comparing native ways to do the uploads, but haven’t actually compared what the overhead would be if we expose push constants vs. the buffer uploads. With our complexity in buffer uploads, seems that providing push constants is a simple and efficient solution, even if it does stuff behind the scenes that the benchmarks is doing.
  • MK: correct, esp. since the benchmark uses persistently mapped buffers, which the group is allergic too. Persistently mapped buffers in OGL gave a 3X perf improvement over what was in DX11 at the time, which was map-no-overrwrite.
  • CW: this is part of the next item. I will try to argue that the proposal for buffer mapping that we have right now has roughly the same efficiency as persistently mapped buffers, and is slightly more complex for apps to reach fastest path but apps can get there. If we get close to native upload perf, I think as a group that we have to choose dynamic buffer offsets.
  • DM: I was just saying that the benchmark doesn’t target WebGPU’s use cases.
  • CW: I’d like to not talk about buffer mapping as it exists right now. Would like to reach consensus on push constants vs. dynamic buffers.
  • JG: DM’s saying that our decision is blocked on buffer mapping.
  • CW: have to make a decision before we have an implementation unfortunately. We’re not making a decision on V1; just the MVP.
  • MK: can you keep push constants at the back of your head so they aren’t unimplementable when moving from MVP to V0?
  • CW: I was going to suggest: do dynamic buffer offsets. Dawn has most support for push constants; could prototype that on the web in the MVP (behind a flag) so we can write web-based benchmarks.
  • MM: I’d like to hear the rationale for having one vs. the other.
  • CW: looking at the high frequency operations in a frame, w/o push constants or dynamic buffers, have to bind things all the time and create a lot of them for every single region of uniforms you want to use. W/o either, this will be the single most perf critical thing - creating descriptors.
  • MK: 100% correct.
  • CW: we need a way to solve this. Have to drastically reduce number of BindGroups we need.
  • MM: ok thanks.
  • CW: sound good? Use dynamic buffer offsets, and convince people we have good upload perf with this proposal?
  • MK: the thing is you need dynamic buffer offsets whether you have push constants or not. If you want to port any modern game that uses Vk or DX12 to the web, existing code bases are full of this.
  • CW: you could emulate using push constants.
  • MK: push constants and dynamic buffer offsets are supposed to complement each other. Not interchangeable.
  • MM: we’re designing a new API with its pros and cons.
  • MK: then nobody will want to port their engines to it.
  • CW: that’s feedback they can give between MVP and V1. For now, it’s best if we choose one or the other.
  • MM: you’re right that porting is an imp’t use case, but saying that two features of an API are supposed to complement each other is not sufficient argument to add it to our API.
  • JG: it’s still good feedback.
  • MK: from our standpoint as a software developer using those APIs, I can get by w/o push constants, but impossible to get by without dynamic buffer offsets. In that case, in vulkan, I would have to make lots and lots of DescriptorSets, which would be extremely inefficient.
  • CW: is that because of the 64 KB uniform buffer limitation?
  • MK: yes, and because i can not draw everything with one MultiDrawIndirect call. Have to either bind a different buffer, or same buffer at a different offset. If I had my choice, would be: only have 4 buffers for uniform data. One is a constant buffer.
  • CW: that’s the same way we have for descriptor sets.
  • MK: I would basically use one buffer only. (Also why I’d use persistently mapped buffer.) Allocator on top of it. I’d write out all of the constants I’d need for the shader, then bind that buffer at different offsets. E.g.: 50 torches w/attributes like matrix, colors, etc: array for the 50 torches I’d bind with uniform buffer I’d use in my shader. That would be fast; not changing buffer binding, only offsets.
  • CW: makes sense. Could be emulated with push constants.
  • MK: it can’t. Most hardware will limit you to 128 bytes of push constants.
  • CW: I’m saying: uniform buffer is array of torches. You have one constant which is a torch index.
  • MK: that would work if it’s just torches I’m drawing. Have lots of things, and the structured data is different for each. If I was doing what you’re saying I’d have to treat the push constant as a byte offset which wouldn’t work; alignment issues, have to cast, etc.
  • CW: makes sense. Thanks for explanation. Makes sense.
  • CW: can we reach consensus?
  • DM: are you saying drop push constants for now?
  • JG: I can easily agree to do dynamic buffer offsets for now. If we split off the push constants question it’s easy.
  • CW: good enough for me. We can collectively say: for the first MVP we don’t do push constants and we do this later.
  • DM: I’m not convinced about push constants given the issues we have with buffer uploads.
  • CW / JG: let’s talk about buffer mapping then

More buffer mapping

  • #138
  • CW: also actual proposal: #147
  • CW: comes with an explainer doc, etc
  • CW: basically: IDL is the same as buffer mapping proposal #1 and setSubData. Not mapReadAsync/WriteAsync and setSubData: instead of special WebGPUMappedMemory, returns Promise of ArrayBuffer. Removes pollable interface because this can be shimmed easily. Can figure out the rest later (like polling in between JavaScript tasks.)
  • MM: sounds great.
  • CW: would like MK’s and DM’s feedback. Want to try to convince you this will be about as performant as persistently mapped buffers.
  • JG: why don’t we have a single map call with flags?
  • CW: could even just have .map() because we don’t support readable+writeable.
  • JG: then we could have a WRITE_DISCARD flag which is the same as SetBufferSubData.
  • CW: mostly a concern of ergonomics. We could do that. Immediate upload proposal (1) was MapWriteSync. Idea here is that you don’t have to go through the Promise to get an ArrayBuffer. Would be possible to expose with a single map() call.
  • MK: not doubting that the way the memory is transferred from app to GPU buffer might be as performant as persistently mapped buffers if we’re talking about the transfer itself. The reason persistently mapped buffers are advertised as best practice is the amount of driver overhead of tracking sub-resource. There’s a famous benchmark from NVIDIA, AZDO, …
  • CW: we’re talking about D3D12 and Vulkan. I don’t think these do sub-resource tracking for you.
  • MK: i’m talking about this because WebGPU’s talking about doing the same thing these older (D3D11, OGL) resources were doing. Vulkan lets you cut yourself. Have to do lots of sync. ARB_shader_storage_buffer was predicated on not accessing things while you’re writing to them.
  • CW: we’re talking about doing something differently than OGL. D3D11 and OGL did magic (buffer versions internally, etc.). We don’t.
  • MK: that’s what OGL was doing with map(). OGL though returned an error if you used the buffer before unmapping (before persistent maps existed). What WebGPU would do is insert a massive stall when you try to use the same buffer on the GPU.
  • CW: you’re concerned about CPU/GPU sync point?
  • MK: not exactly but sort of.
  • CW: that’s not a problem - let’s discuss uploads. Allocate buffer. Not writeable. Then map it, get pointer to persistently mapped region (if browser can do that). Write into it, unmap, use GPU to read it. No bubble because you unmapped before GPU started using it. If you map again to write again, WebGPU won’t give you the pointer until GPU’s done using it. No bubble.
  • MK: latency is still a problem. On desktop there’s no latency. It’s my responsibility to make sure that the GPU is done before I start using it.
  • CW: in WebGPU we can’t have CPU/GPU races.
  • MM: this is the major benefit from our perspective. Don’t want to codify major browsers’ behavior and then have everyone else have to implement it. Spec clearly instead
  • MK: it goes really against all best practice on desktop.
  • CW: makes things more complicated, and have slightly more latency, but with browsers with WebGPU in a different process you’ll have more latency getting back results of rendering commands. But this is close to what you can get with persistently mapped buffers. DM ported DOTA 2. They have one huge buffer, write into it as a ring buffer, and read from the GPU. We disallow this in WebGPU.
  • MK: there are lots of engines that use this and it’ll be really hard to port. Take The Last Of Us.
  • CW: in WebGPU instead they would segment their large upload buffer into 10-20 buffers. Each would be either used by the GPU or written by the app. Slightly increased memory use.
  • MK: think you’re oversimplifying the issue. We’re talking about one big buffer for all the VBO data. That’s a very specialized case of using persistently mapped buffer. We allocate one large buffer and use it as a ring buffer. Use with a general purpose allocator which returns offsets. Can allocate chunks anywhere. We use this for streaming updates to VBOs, streaming texture updates (playing YouTube videos, etc.), etc.
  • CW: these are all streaming use cases. Why do you have a general purpose allocator?
  • MK: because I do one large allocation and then make sub-allocations. Otherwise extremely expensive.
  • CW: why not just do a linear allocator?
  • MK: can’t because those ranges might end up being used at different rates. For example, update to YouTube movie texture is kicked off on a separate OGL context, done on a separate thread and separate command queue. Possibly later than the rest. Fonts / buttons are put directly into the streaming buffer, only needed for one frame.
  • CW: everything needed only for one frame could be put into a linear region in a ring buffer.
  • MK: no, some of that stuff is needed for one frame and some is needed for a certain time duration more or less than a frame. Would need to use a GPU fence to figure out when done with it. I insert a fence and then free things once it’s passed.
  • CW: we really have to prevent CPU/GPU races. For these allocations, would have to copy into GPU-local memory. Don’t see how if we look at your proposal of non-coherently, persistently mapped buffers, if we prevent races it’s even worse.
  • MK: you won’t be able to prevent races. Also, most middleware targeting ARM or Qualcomm already handle the non-coherent persistently mapped buffer cases; you have heaps which aren’t coherent, so have to flush manually. When you debug the graphics application with Renderdoc, if you don’t flush your buffers you don’t see results, so you immediately have portable behavior.
  • CW: I’d be happy to talk with you about this via VC more because Github isn’t working out, but at this point we’re taking up the group’s time.
  • MK: I’m only talking about what Naughty Dog did for The Last Of Us 1&2.
  • CW: I’m sure that the proposal I made satisfies this.
  • DM: I’m concerned about the proposal you made, CW. In Valve’s case, replacing dynamic persistently mapped buffer with push constants would be easy, but the case we just talked about, it’s not that simple. Juggling multiple buffers, etc. I think that next steps are going to be, we come up with some use cases, and see how it works. I presented Valve’s use cases for this reason on Github.
  • CW: OK. I’ll provide pseudocode on how these use cases map to the proposal. Would like other proposals with roughly the same level of detail. The other proposals don’t explain how many copies happen, how they work in multi-process browsers.
  • DM: I would have a proposal if group says yes to this: can we say that upload and download buffers are used for only that?
  • CW: that’s not DOTA’s use case.
  • DM: if DOTA uses push constants it’s not a big problem. Can also upload and copy to the buffer; doesn’t change their use case much. Can still have large GPU buffer, and each time, copy into that. Doesn’t change architecture much, just adds another copy.
  • CW: happy to look at your proposal but on mobile architectures, really don’t want to incur another copy.
  • MM: I’d like to read this proposal too.
  • DM: I’m concerned that we’re running into limitations. The fact that a buffer can only be in one state, and that BindGroups/DescriptorSets can only refer to one buffer. Conflict between BindGroups and buffers, then. If we say you can’t use the buffer for anything else then it simplifies the API a lot. I’ll be happy to write this down.
  • CW: it would be nice to make progress on this if we can so that we can make a snapshot of the API by end of January.
  • AIs: everyone, please post proposals.

Multi-queue synchronization

  • CW / JG: going to skip this

PR burndown

  • CW: PR #149
    • Adds IDL for dynamic buffer offsets
    • DM: is there a cost to empty sequence, so we can allow nullable parameter?
    • CW: good question, I don’t know.
    • DM: can fix later.
    • CW: any concerns with PR?
    • JG: should probably be an optional parameter (vs. nullable).
    • CW: can you please raise this as a comment on the IDL and I’ll change it?
    • RC: is there any reason we have to make new bindings for different binding types?
    • CW: yes, on Vulkan it’s needed and on D3D12 we make them root descriptors instead.
    • CW: recorded that it’s fine with everyone, with Jeff’s nit fixed.
  • CW: PR #151
    • We decided last week to split this off
    • DM: can we settle on consistency, whether we document things in the IDL or not?
    • CW: think we said: Dean was going to make a skeleton of a spec where we can copy the IDL. At that point we can do it. Until then there/s no good place./
    • MM: I also volunteer to document things. Need more than a collection of explainers.
  • JG: when we write the spec text can we use Renderdoc /// style?
    • CW: would like to not focus too much on the specific style
    • MM: I just put a comment in there to be clear.
    • CW: I do have an objection on the PR itself. Basically same objection as last week, holes into the interfaces should be first-class citizens. Right now it forces you to allocate bigger arrays of WebGPUInputDescriptors just so you can have holes. Would be good to have input slot in Inputdescriptor.
    • KN: it’s not going to be that big right? Will only have a couple of holes?
    • CW: for JS maybe it’s not that big a deal. But if you’re using this from Emscripten or native, you’ll have to allocate a 16-element array and fill it with null pointers just to do this.
    • MM: you’ll have to make an array anyway. Difference between 14 and 16 members.
    • CW: it’s more that if you use only input slots 1 and 15, you’d have to fill the gap with null pointers. That’s gross.
    • MK: can you not just let the user call it 16 times if they need to? It’s before the frame starts anyway, so efficiency’s not that big a deal.
    • CW: SetVertexBuffers (?) done in command buffer.
    • RC: JS differs between null and other values.
    • KN: I was proposing to treat null, undefined, etc. all the same.
    • MM: would the null clear it?
    • CW: I’m in favor of passing null, etc. and that they wouldn’t reset things.
    • JG: how would you clear?
    • KR: just ran into this in WebGL, where we need to be able to reset VertexAttribPointer state.
    • MK: more work per draw call to keep things resident. Prefer to keep it clean.

Agenda for next meeting

  • Do we plan things or let the conversation go and see what happens?
  • CW: we’ll have some status update to present.
  • MM: we’ll also have some status updates.
  • MM: think it’s imp’t to make some progress on shading languages before Google and Apple ship an MVP.
  • CW: I’m pessimistic of progress happening, speaking just as someone attending this group’s meetings.
  • JG: we have a F2F coming up, conveniently.
  • CW: Jan 7 is shading languages.
  • MM: did we just declare bankruptcy?
  • JG: no. We’re just pessimistic.
  • CW: Jan 7: shading language, status updates, then free agenda.
Clone this wiki locally