Skip to content

GPU Web 2020 04 27

François Daoust edited this page Dec 1, 2020 · 1 revision

So as to not be anonymous animals, the doc is shared for writing with the google accounts that are invited to the meeting. If you can’t edit let cwallez@google.com know. Also be sure to be in “Edit” mode and not “Suggest” mode.

Chair: Corentin

Scribe: Ken / Austin

Location: Google Meet

Tentative agenda

  • Add minimumBufferSize to BGL entry #678
  • Pipeline statistics queries #691
  • Copying of depth24plus formats #652
  • Should we have a core "stencil8" format? #306
  • PR burndown (won't get to all of them but please take a look!)
    • Specification for GPUBuffer mapping #708
    • Add GPUPipelineBase.getBindGroupLayout.#543
    • Add pipeline shader module error enumeration. #646
    • Document pipelines #724
    • Add validation rules on copyTextureToTexture #725
  • Agenda for next meeting

Attendance

  • Apple
    • Dean Jackson
    • Myles C. Maxfield
  • Google
    • Austin Eng
    • Corentin Wallez
    • David Neto
    • James Darpinian
    • Kai Ninomiya
    • Ken Russell
    • Ryan Harrisson
  • Intel
    • Yunchao He
  • Microsoft
    • Damyan Pepper
    • Rafael Cintron
  • Mozilla
    • Dzmitry Malyshau
    • Jeff Gilbert
  • Matijs Toonen
  • Mehmet Oguz Derin

Administrative

  • CW: since we didn't have a May F2F it would be great to send status updates via email. I'm hoping to send our status update soon.
  • DM: our status update just went to the Mozilla Hacks blog.

Add minimumBufferSize to BGL entry #678

  • CW: lot of discussion last week.
  • DM: we agreed that the group may need more time to understand why it's needed and the consequences for the API and user. We're discussing this with one of the active persons on the Matrix. For their use case of an engine, specifying minimumBufferSize would not be a problem. They're creating groups of 64K uniform buffers, so they'd put that into uniformBufferSize.
  • DM: not really a min size of buffer. Size of binding the user creates to the buffer. In most if not all cases, the code that creates a BGL and code creating a binding for that BGL know about each other and can make sure this constant is known as well.
  • MM: presumably code creating the buffer won't be adjacent to the code which sets up the bindings.
  • CW: 2 cases: 1) close to Andre Weissflog's, a uniform manager thing, uniforms in a ring buffer. If you do that it doesn't matter, you just use 64K and maybe waste a little memory at the end. 2) custom engine, a uniform buffer which stays frame over frame, becomes a problem. What if your per-frame data doesn't know about the size in the shader?
  • MM: don't quite understand - what's the fundamental difference between when we have to write bounds checks and when we're trying to avoid them?
  • CW: if you look at shader, and the structure you put in the interface specifically, you have vectors, fixed-size arrays, and you know size exactly. Or, unsized array at end. In D3D12 and Vulkan, it's not a constant pointer but pointer to structured memory. Easy to do bounds check. What if buffer has +3 bytes at the end? Then everything's valid except the last element of the last vector of the last matrix.
  • MM: last week, we wanted shader with unreferenced variable to work even if there's nothing bound to it. If there's a variable off the buffer / end of binding, but unreferenced, that should still work.
  • JG: think we have to figure something out. Most modern APIs punt on it and say, don't give me malformed data. WebGL ones say your buffer's not big enough for this data, and we fail the whole draw call. You're proposing a third consideration, if the buffer's not big enough but we don't use it, it's not an error.
  • MM: I proposed something in the thread, #678. Nobody responded.
  • JG: dneto had one response I think?
  • MM: he was responding to a different comment I made.
  • DN: I wasn't here for the prior discussion but I don't know what the use case is, a shader that references data that's not supplied. Maybe want to use the same shader without modification with multiple draw calls, and don't want to have different behavior... if it's not invoked why care, never mind.
  • CW: if it's not invoked we don't care, but then we have to say it's not invoked statically.
  • DN: right, don't want to have to change shader for cases where I don't supply the data.
  • MM: that's my proposal - author can provide min bounds size. Any accesses less than min don't have to be guarded. Others do have to be guarded. Pass 0 if you don't know the bounds. If you want add'l performance then you provide the bounds. Best of both worlds? Get peak performance, but devs don't need to know info.
  • JG: I think a lot of times what developers would want is an unbounded size for the dynamic part.. makes sense. Think we would have to figure out how do we handle OOB reads for static data.. more complicated there.
  • MM: there are already a few different behaviors - reads return 0, writes do nothing.
  • JG: what if you have a read of something that isn't aligned properly, partly straddling boundary of buffer view?
  • CW: what if you read a vec4, your hw has that operation, and only the first float is valid?
  • MM: then the vec isn't valid and you have to do the bounds checks.
  • JG: have to define what happens in that case. Do we define the first float is valid? Or the whole thing is valid?
  • MM: I don't think it's imp't since it's straddling the boundary.
  • JG: not sure about that.
  • DM: the way I see our bounds checking isn't as a feature or perf feature, it's a portability aspect of the API. It's not something we want people to use, ever. People want to access fields a, b, c - they want to access them, people would expect them to be there. Nobody would expect them to be there if the buffer is not big enough. We only have to handle array bounds checking.
  • MM: Didn’t David Neto say people may want to reuse shaders in different contexts?
  • DN: That’s why I’ve always assumed the fixed portion is always supplied.
  • JG: I'd be fine not supporting the use case where people only supply part of the fixed part of uniforms. A lot of request for not much benefit. Min buffer size gives you this at BGL time so you know about it at pipeline creation time and you can specialize your shaders at that point.
  • MM: If we're doing this I think it has to be moved earlier, to shader module creation time, so you know these sizes when you compile the shader.
  • CW: the idea would be the min buffer size covers the whole static part of the data. You can assume that'll be the case when you compile the shader module. In case you don't supply the min buffer size we clamp in the shader by whatever means necessary. Could use that at createShaderModule time. Do we want to allow people to pass buffers that aren't big enough and do the clamping in the shader? Or do we want draw time validation for this GL-style?
  • MM: I thought the whole motivation for this was to not have to do draw call validation.
  • CW: Right so if you supply minBufferSize, you don’t do draw call validation for that binding. If there’s some where you don’t provide, is the handling of the OOB clamping in the shader? or is it a validation error on the draw call, or something else?
  • MM: I see. Not having it be a validation error seems beneficial because people will run the shader in multiple different contexts.
  • JG: I don't agree with that, you have to make a lot of compromises in that case.
  • DM: I'd like to see this case represented. I haven't seen this case in all my career that people want to bind partial buffers.
  • CW: it's a shader that has a binding, it wants to be able to have only part of it bound, access only these parts.
  • MM: an example would be an ubershader. The program knows only certain materials will be used. They know the resources won't be used, so they either don't bind them or bind a zero-sized buffer.
  • DN: the code path's there but dynamically not accessed, not statically.
  • JG: Don’t think we can reasonably make those guarantees
  • MM: the behavior would be, trying to access those fields would trigger clamping, or reads returning 0. Shader never hits this. If you access certain fields, you get good perf, and if you hit others, you get bad perf.
  • JG: Think the cost-benefit is too low. It’s a bunch of implementation complexity, and I don’t see the value proposition vs the complexity.
  • MM: won't we already have to do bounds checking?
  • JG: for certain things but not others. We don't have to check that a certain part of a vec4 is in bounds.
  • CW: I think the way you can view OOB behavior as some operation on every memory access, or you can view it only on array accesses, and everything else by construction is valid. When you deref a structure, we know it’s valid. When you index an array, we clamp the index instead checking every access. We have wins there. For every target API we generate fewer checks.
  • MM: Hard for me to see it that way because in Meta everything is just memory and there are pointers.
  • KR: Metal may require the most checking of any platform b.c. there are no robust buffer access guarantees. It is very easy to kernel panic the machine. It’ll be necessary to insert dynamic checks that will be hit on many many shader invocations.
  • MM: If the app provides a minimumBufferSize, then those are the bounds checks that don’t have to exist.
  • KR: I understand, but if you pass zero then the performance pitfalls will be severe.
  • MM: Today we have no minimum buffer size so we have all those perf implications.
  • JG: we have a lot more functionality than validation in our impls. This sounds like a lot of extra validation for a fairly narrow use case. I think I see the use case, but I think I'd rather reject the calls.
  • DN: the ubershader case and other scenarios, are handled e.g. by a specialization constant in Vulkan. Would let you sweep away large portions of the shader. You know a priori certain image formats won't be used. so their logic is eliminated, you never have to do the checks because they aren't there statically in the code.
  • CW: that would require spec'ign constant propagation?
  • DN: yes. In the ubershader case the code melts away.
  • MM: Think it’s worth saying that the benefit of this is that no other native API has these minimumBufferSizes. Would be cool if authors did not have to specify it.
  • CW: WebGPU requires a lot of bind group compatibility already. Texture format type is something surprising to people.
  • DN: for me it's whether it's at BGL time, pretty early, or when you attach resource. I think it's OK to delay the check until you're binding the resource. Impl which compiles shader, reflects min size from shader, and prevent people from binding something to that shader in a pipeline if it's smaller than min size.
  • JG: I do think that’s the key of it. The main thing we want is to prevent access to the static section. The dynamic section will need bounds unless we know the buffer is smaller than the minimum buffer size..
  • DN: you're going to need the checks for the dynamic part for sure.
  • JG: think min buffer size is something to very early, by construction, prevent out-of-range access to the static sections. Prove early you won't access static sections.
  • CW: So assuming we want some form of minimumBufferSize b.c. it allows us to better things. We all agree it’s better for perf. The only two dimensions we have choices are 1) do we allow bindings without minBufferSize, and 2) if we allow that, what should happen: GPU cost of OOB clamping or CPU cost of draw call validation?
  • DN: Rafael had a point - can't trust what user wrote in BGL anyway - have to reflect and check it. BGL declaration is untrusted.
  • CW: we know it because BGL compatibility means, pipeline has less than min buffer size, less than that provided in bindgroup. We can make these checks at BG and pipeline compilation separately, make sure they match later.
  • MM: I like that formulation. Would like to know the tradeoff between CPU and GPU cost. If one is 10x larger than the other, would indicate which way to go.
  • DM: I'd like to understand the use case better. MM can you please provide more input on the issue?
  • MM: yes, I can.
  • CW: I think we can do a prototype of this e.g. with the Forest demo and see the perf difference. Not sure we can do the GPU comparison though.
  • DN: haven't been in that code for a long time.
  • MM: I'll compare that cost then.
  • RC: CW you're comparing the CPU cost of validating BGL against pipeline, vs. cost of checking the bound resources against what shader requires?
  • CW: yes - comparison is between shader doing all work for out-of-bounds clamping, vs. doing that check on the CPU. Responsibility for out-of-bounds checking given to the shader.
  • RC: to me there are 2 CPU possibilities. minBufferSize, and something similar to what JG spec'd, at every draw call we do add'l checking, check the buffer sizes against what the shader needs.
  • CW: Myles wants to see, if we don't supply minBufferSize, what are the add'l CPU / GPU costs. If we force minBufferSize to be there, we have the same costs as today because we already compare BGL pointers.
  • DP: What about the option where we don’t force people to specify and infer it from the shader.
  • MM: sounds worth thinking about but I can't give an answer right now.
  • DP: that would resolve the observation that none of the native APIs need this. And intro to WebGPU wouldn't have this extra baroque step.
  • CW: that's what we'll try to measure with Dawn. Check at draw time to see if bindings are big enough.
  • DP: so you'll do a draw time check.
  • CW: yes.
  • DM: getBindGroupLayout would also help for intro to WebGPU. That's the PR Corentin has in flight.
  • CW: we have some homework to do.

Pipeline statistics queries #691

  • CW: It does what was in the original PR except when you create pipeline-statistics query set, you say which set of them you’re interested in. The buffer will contain N * uint64_t where N is the number of statistics you want.
  • JG: that seems shooting from the hip to me. What if you ask for just vertex and fragment, and then or in clip or primitives out? Then the place you were looking for vertex/fragment is where clip/primitives out would be?
  • CW: they're counters. For extensibility and to spec things, you need to know when you resolve the query the order of the stats. In this proposal, it's in the comments - the only statistics that are returned are those in the query set, and they're in the order of their bits
  • JG: I don't think that's a good idea. Traditionally for bitfields you don't need to know the order they're in. If I saw code that depended on the order of bitflags I'd ask for it to be rewritten. Instead, could we always have N (N=5, right now) 64-bit ints, and the ones you don't ask for have some sentinel value, perhaps 0?
  • CW: what when we have pipeline stats for raytraced shaders?
  • JG: you get back a bigger buffer. Your old code will still work. If there are mesh invocations at 0x20, and your code knows about them, that's OK.
  • MM: then you can only resolve one of these at a time. If you resolve one query heap and then another, they have to have their sizes baked in, which is fine. My question: how are we going to add a sixth field to this?
  • CW: Would be bad if you enable an extension and the number of entries resolved is always bigger than 5. Like you enable an extension and now there’s 7. Then code that doesn’t know about the extension will see the layout changed. Should be defined entirely in terms of arguments of the query set.
  • JG: idea: to define the order, you could make it explicit. Array of enums instead of a bitfield. Then it would make sense - be clear about what the order is, instead of relying on the order of these bits. That's the part that feels more foreign to me.
  • KN: would we validate the order in the array or allow arbitrary order?
  • JG: arbitrary.
  • MM: duplicates would work?
  • JG: if we can make it work, but I'd reject duplicates.
  • MM: in Metal the order is spec'd. Would have to reorder in compute shader.
  • KN: I think it'd be necessary for all APIs.
  • JG: if it's not necessary for people, don't add flexibility. But if people think it's easy, do it. Accumulation of each tiny easy thing over a while becomes difficult.
  • MM: Another possibility is to add a bunch of different types of these heaps that you can resolve all independently.
  • CW: Seems more expensive because we’ll have to coalesce pipeline statistics that appear next to each other in the command stream.
  • MM: Right. That was a bad idea.
  • JG: we'd rather have more ideas then fewer.
  • CW: list of enums in query set sounds good. Can just flip a coin to see if we allow duplicates or not. Either way's fine.
  • MM: I agree with Jeff to use array of items rather than a bitfield. Seems better design.
  • CW: no duplicates for now, see if anyone yells at us.
  • MM: is this an optional API?
  • CW: yes, it's an extension.
  • MM: if people want pipeline stats on older Vulkan their device creation would fail?
  • CW: yes, they'd notice the feature isn't available.

Copying of depth24plus formats #652

  • CW: anything this group needs to discuss KN?
  • KN: investigation needs to be done about what hardware supports. Also thing I orig opened this thing about: should we do data conversions to/from depth24 during copies?
  • JG: it's depth24plus because it might be depth24s8 or depth32?
  • CW: yes.
  • JG: is the format for depth24 well defined?
  • KN: yes. I believe so. YOu don't copy the whole thing, just the depth. Not sure what happens. You have to say depth/stencil plane, but you can spec depth plane of packed format
  • JG: I mean, if you write 0.5 into the depth output, are the bits in depth24 format standard?
  • CW: I think mobile GPUs can have crazy optimizations around these formats.
  • KN: assume they could be undefined anyway, and valid only if you copy into the same type.
  • JG: that's what I'm worried about. Potential portability hazard.
  • KN: we could implement this via sampling, which would give reliable results. Copying in is also possible...
  • CW: by drawing a bunch of points?
  • JG: could draw quad and write fragdepth.
  • KN: horrible.
  • JG: it's done.
  • DM: why not run a compute shader to convert?
  • JG: if the format's specified, yes. If you can do the sampling stuff in a compute kernel, that's fine too.
  • CW: if the depth24plus format is sampleable, we can do copies as if it were depth32.
  • KN: do we want to require that polyfill?
  • JG: alternative is the user could do this if they want these copies.
  • CW: this is gated on being able to sample from this format always.
  • JG: I'd expect to be able to sample from it. This is early functionality from WebGL 1.0.
  • CW: Vulkan technically does not guarantee you can sample from any format except depth32.
  • KN: don't always need depth for a deferred renderer. Some deferred rendering techniques use it.
  • CW: We’ll take the AI to dig and make sure we can sample from depth24plus and depth24plus-stencil8. Assuming we can do so, do we want magic copies with it?
  • DM: think that would make sense. Single-pass copy, not more complexity to do that work.
  • JG: it's slower.
  • DM: slightly slower.
  • MM: Think we already made this decision when we decided to only have one texture format. If it’s going to be a texture format then it should be able to copy.
  • JG: there's a long history of depth/stencil having different behavior than color formats. Would be nice to guarantee this. If what we're doing in the general case is what the users would do anyway, then I'd prefer to let user do it so they could make better decisions about whether there's this sample/write pattern instead of a direct copy.
  • CW: In general I agree. The concern here is that in the vast majority of systems, the depth24plus will actually be depth32, and we can copy much faster if we know that.
  • JG: do we do the same thing for packed depth/stencil formats?
  • CW: no...the future is, depth24plus is really depth32-float. Pretend as much as we can that that's the case, but don't specify the precision, and make copies fast when we can. Emulation going on, a bit slower. Pushing the cost of doing depth sampling + draws to copy on to the user isn't the right tradeoff IMO.
  • JG: Does help for when you’re copying into R32F, but if you had a depth24stencil8 packed format which is still pretty common b.c. memory bandwidth, and you actually had a depth24 format, you could copy between those easier.

Should we have a core "stencil8" format? #306

  • CW: even if we don't have this we can emulate with d24s8.

PR burndown

  • Lots of PRs needing attention.

Agenda for next meeting

  • CW: think we'll have our writeBuffer optimization explainer.
  • CW: probably won't have done homework about minBufferSize, in particular CPU-side checks.
  • CW: should we put writeBuffer discussions up again for next week?
    • DM: sounds good.
  • CW: depth24plus formats again? minBufferSize? Other topics?
  • MM: thought we had HW to do about minBufferSize. Can't guarantee that'll be done in a week.
  • CW: cancel minBufferSize then for next week.
  • CW: sourcemaps again?
  • MM: think we're all in agreement about sourcemaps.
  • YH: resource tracking? Corner cases? For compute, per-dispatch?
  • CW: yes - could you write a list of the corner cases you've run into so we can study them beforehand?
  • YH: OK.
  • MM: I'd like to read that before the call.
  • CW: we can have discussions on the open PRs. Issues volunteered.
Clone this wiki locally