Skip to content

Minutes 2018 10 29

Corentin Wallez edited this page Jan 4, 2022 · 1 revision

GPU Web 2018-10-29

Chair: Corentin

Scribe: Ken (thanks! <3)

Location: Google Meet

TL;DR

  • Agreement having “homework” done 48 working hours (2 days) in advance
  • Multi-queue https://github.com/gpuweb/gpuweb/pull/95
    • Concern that exposing queue families instead of just queues is too complex. Simpler mechanism might require transparently using multiple backing API queues, which would be difficult and slower.
    • It is possible to multiplex WebGPU queues on a single backing API queue.
    • Concern that the “implicit synchronization” doesn’t provide enough information to parallelize in a useful way, and easy to fall off the fast path.
    • We could have hints for transferring resources between queues, and give warnings when falling off the fast path. Having them be errors would be too likely to break applications.
  • AttachmentState https://github.com/gpuweb/gpuweb/pull/102
    • Consensus to go forward without the MSAA bits of #102 and add them later.
  • Submission model https://github.com/gpuweb/gpuweb/pull/100
    • Consensus to merge
  • Meta discussion about when to write a spec
    • Right now we have a lot of independent markdown files, they need to be kept in sync.
    • Will start a proper spec later, right now it would be too much friction to update it all the time.

Tentative agenda

  • Multi-queue #95
  • AttachmentState #92, #102
  • Push constants #75
  • Submission model #100
  • Agenda for next meeting

Attendance

  • Apple
    • Dean Jackson
    • Justin Fan
    • Myles C. Maxfield
  • Google
    • Austin Eng
    • Corentin Wallez
    • Dan Sinclair
    • James Darpinian
    • Kai Ninomiya
  • Microsoft
    • Chas Boyd
    • Rafael Cintron
  • Mozilla
    • Dzmitry Malyshau
    • Jeff Gilbert
  • Yandex
    • Kirill Dmitrenko
  • Joshua Groves
  • Mehmet Oguz Derin

Administrative updates

  • CW: WG creation ?
  • DJ: everyone agreed on 3-clause BSD license
  • CW: next meeting, we’ll talk about shading languages again.
  • CW: in https://github.com/gpuweb/gpuweb/pull/104 , have text saying “please prepare stuff 24 hours before meetings, and everyone should review them”. RC proposes 48 hours. Sounds fine to me, who else?
    • KN: SGTM

Multi-queue #95

https://github.com/gpuweb/gpuweb/pull/95

  • JG: previously, would have given summary of it, because we didn’t have homework deadlines, but everyone’s reviewed this.
  • CW: queue families: feedback is that 3 bits of data is more than every API.
  • JG: every API but Vulkan. Level of capability here is the same as Vulkan’s. Copy is optional because of upcoming D3D12 changes. This is the superset of what the underlying APIs expose. Might be extra API noise, but best exposes the underlying API capabilities.
  • MM: what’s the reason for gathering queues into families, rather than each queue having capabilities?
  • JG: comes from Vulkan. Can have different queue families that don’t have the same capabilities. Haven’t checked to see if anyone does that.
  • MM: design rationale?
  • JG: my assumption is that Vk had a reason to do it this way so decided to expose it in the same way. Don’t know the design rationale from Vk working group though.
  • DM: (1) when you create a command buffer you only need a queue family. Not attached to particular queue. (2) within queue family, you don’t need any transfer of ownership of any resource. Synchronization between queues is very simple in the same family.
  • MM: we have to have synchronization between different queues in the same family.
  • JG: it’s easier sync. Going across queue families is more difficult than in the same family. Would have to e.g. transfer objects between queue families in Vulkan. Release / acquire semantic.
  • CW: maybe each queue has its own write cache, and even if they’re sharing the same HW they might have different caches. Might need to flush caches, etc. when transferring objects between queue families. Also: usually, graphics hardware has only one graphics queue. (True for recent NV, AMD HW). But AMD can have 16 HW compute queues. Cheap to create compute queues, expensive to create graphics queues.
  • DM: but that’s not a case for the existence of queue families. Could just as well create different queues with different capabilities.
  • MM: in this case WebGPU could say that it has a collection of compute queues, and exposure of families wouldn’t have to be done in the API.
  • JG: depends on how much API friction you’re willing to have.
  • CW: we could tell the app what kinds of bits are available on queues, and implementation does multiplexing of queues.
  • JG: that’s included in this proposal.
  • RC: are we going to force web developer to move things between families?
  • JG: in later work, probably more optimal if app can voluntarily give hint that it’s transferring something to another queue. But mandating is probably not something we’ll do.
  • CW: possible solution to queue family discussion: what if the device says, I support either compute, or graphics+compute, or both? Doesn’t expose the queue families, but the aggregation of what the queue families can do. Simpler in terms of discoverability. Doesn’t say how things are segmented.
  • JG: that’s one way to implement this proposal. The queue families in this proposal do not necessarily map 1:1 to Vulkan queue families.
  • CW: understood. Looking at IDL: WebGPUAdapter has QueueFamilies right now. Thinking QueueSupportedOperation, and …
  • JG: first queue family in list is the most capable in that model.
  • DM: wouldn’t it be, whether you support graphics or not? If you have graphics queue, whether you support compute or not.
  • RC: I think CW’s suggestion will work, but down the line, we’ll have video queues that are only video queues and can’t work with anything else.
  • CW: either we spec that video queue can’t be used as anything else, or implementation can distribute work to queues behind the scenes.
  • JG: want to talk about whether we can transparently multiplex stuff. Think we want to not assume that we can do transparent multiplexing. Difficult operations. Part of reason for multiple queues is so we don’t have to multiplex under the hood. (multiplexing = moving things between queues) Avoiding transferring, etc. Useful for API syntactic sugar to be able to expose arbitrary number of queues, but a narrowing operation saying, we will reduce things onto a smaller number of queues in the backend. But we should surface when that happens. Easier to go from many:few than few:many./
  • CW: yes. But almost equivalent to Metal model, 1:many.
  • JG: Metal does the scheduling internally. Think that’s unnecessarily constraining the API.
  • CW: back to queue families: understand your point about app controlling which queue family they use. More of 1:1 mapping between app’s queue and backing API queues.
  • MM: q: is there any HW out there that exposes two different families with the same capabilities?
  • JG: not sure.
  • CW: not that I know of. AMD has a graphics-only pipe though. Compute, Graphics+Compute, and Graphics.
  • MM: two different families?
  • CW: 3 different families. Compute. Graphics+Compute, has certain number of queues behind it. And Graphics.
  • DM: can you clarify if it’s possible to have 1 queue serving multiple user queues? e.g. 2 real queues in Vk, one waits on the other; doesn’t matter which you submit first. Parallel, and supposed to be independent. But if user submits to one that’s supposed to wait for the other first, we’ll have to figure out dependencies.
  • CW: JG’s proposal says that if a submit depends on in-flight operations the correct sync happens. Happens as if submits were resolved instantly but they might be parallelized. If every submission is synchronous, we know which order to schedule things.
  • JG: in Vk, you’re not supposed to submit work that depends on work you haven’t submitted yet. Shouldn’t be possible to deadlock a queue like that.
  • DM: follow-on concern: different read access from different queues.
  • JG: similar to having a writer show up. Have to change the read access, and if changing the read access means waiting for current read op to complete, have to do that, and it janks. This is why in the follow-up, providing hints for these things would be useful. This approach lends itself a lot to having hints to barriers, etc. Know we don’t want barriers, but hints would provide apps close-to-native performance./
  • CW: concerns: (1) implicit barriers are tractable because until submit we can go back into the past and insert pipeline barriers for the current submit. Can go back and patch the beginning of the command buffer by inserting secondary command buffer, etc. In this model this patching would be very difficult; hard to figure out how to do this in Dawn. (2) with implicit sync, it’s really easy for the app to change something that’ll use a resource that was read from a diff queue just before, parallelism breaks, and they don’t know why. It’s just hints.
  • JG: that’s when we warn about excess serialization.
  • CW: but we don’t know whether it’s on purpose or not.
  • JG: your complaints hinge on: without proper synchronization that the user wants, multi-queue isn’t viable without hints?
  • CW: not just hints, hard constraints, because otherwise you don’t know if you’re on the fast path.
  • JG: implementation can warn. If we provide hints and say you should use them, it seems fine to warn that it had to inject synchronization. If in the field something goes wrong, impl will steer app back onto the fast path. Can give hints, but correctness is more important than absolute performance.
  • CW: OK I can see that.
  • JG: there’s some trodden path here; it’s what WebGL does. If you do something that causes implementation to do some more work, we warn about it. Higher end users do something about the perf warnings.
  • CW: will try harder to figure out how to implement this in Dawn.
  • JG: there’s some stuff missing in the proposal still. Need you to propose hints. I think I can make it viable if I add hints to the proposal.
  • CW: that would help. I’ll try to figure out what kind of hints.
  • CW: problem: making it useful so that you get actual speedups and not just CPU overhead. App will assume because it called submit() on 2 queues that it’ll happen in parallel, but think it’s hard to write code in the impl to make that happen, as well as the app making the correct calls. Will be voodoo magic whether parallelism happens at all, and it would be better if that weren’t the case. E.g. alignment in WebGL. You try to use parallelism and don’t know whether it’s really working.
  • RC: setting aside impl where you put everything on one queue. You think it’s possible to use multiple queues?
  • CW: Yes. the question is what’s the CPU overhead of doing this (small patches to command buffers, etc.)
  • RC: concern: what will we do about resources used by multiple queues on multiple workers? Lots of race conditions etc. Make resources read-only?
  • JG: this is a case where you could fall off the fast path accidentally.
  • DM: those last accesses will be outdated at submission time, not recording time. Not too worried about synchronization cost.
  • JG: last_access will be set at command buffer finish time. When we submit we use the list of last accesses. Figure out which synchronization we need to do.
  • RC: if you’re on diff threads and submit one cmdbuf which reads from it to one queue, and writes to it on another queue, …
  • CW: if app races its queue submits it’s its own problem. In our impl when we update state tracking we have to be non-racy.
  • DM: make queues non-mutable?
  • JG: that might be a worker.
  • DM: you’ll create the device on the worker?
  • JG: yes. That’s what everyone wants to do in WebGL right now.
  • RC: typical scenario, download image or create terrain texture on one thread, and use it from another one. Perhaps you can only write to something from one thread, and then can only read from it on another thread.
  • CW: read/write on GPU or CPU?
  • RC: on GPU. Could be on CPU as well. Doesn’t matter?
  • JG: could be too constraining. Hard to do more sophisticated manipulations.
  • RC: there are scenarios where they don’t fall into producer/consumer scenario?
  • JG: just worker thread pools of producers would no longer work.
  • DM: we’re talking about queues, not recording. Don’t think you’ll have a queue per worker.
  • JG: sync is easier if resources are read-only.
  • DM: don’t think it makes it easier. If only on this thread you can submit cmdbufs that write this resource, still have to sync with other queues that read this resource.
  • CW: think we should move on to the next topic at this point.

AttachmentState #102 (previously #92)

https://github.com/gpuweb/gpuweb/pull/102

  • DM: after our discussion, removed AttachmentState as handle, and provide all the data at BeginRenderPass and pipeline creation. Added bits of MSAA support. Now knows about resolve attachments. E.g.: create RenderPipeline, set up to rasterize with sample count = 1, and one attachment has sample count = 2: can assume this is the resolve attachment. This is the reasoning for the MSAA counts to be added.
  • MM: not possible to do MSAA resolve from e.g. 8 samples to 4 samples?
  • DM: on some architectures, maybe, but in general, no.
  • MM: so, modeling resolves as “putting magic number 1 in a few places” isn’t the right design. Should mark places where resolves happen, or where they should happen.
  • CW: what about: create render pipeline, tell it how many samples, and wehther it resovles at end of renderpass. Descriptor: you just have extra store resolve attachments. First attachment resolved in that texture at end of render pass. StoreOpResolve kind of thing. Idea is you have sample count + will resolve on renderer pipeline. Resolve attachment on renderpass descriptor.
  • DM: isn’t that just more to go wrong? Instead of additional flag per attachment, can just have sample count in renderpipeline descriptor. In your proposal there are just more constraints.
  • JG: there are also cases where you don’t want to resolve a multisampled buffer, but rather discard it.
  • DM: good question: what if you have a resolved attachment + StoreOpDisard?
  • CW: would resolve, but not store the multisampled data.
  • JG: I mean: render pass, have depth buffer, writing to it + resolving + testing, but don’t care about seeing it. You just discard.
  • RC: could also be a renderpass where you write to 2 things, one multisampled and one not. Maybe you have some attachments that are 4x and some 2x? What is a source? What’s resolved to where?
  • DM: every attachment has a samples field; so does descriptor. We can compare those when creating render pipeline.
  • RC: why does render pipeline descriptor need samples?
  • JG: render pass compatibility.
  • DM: think there’s something incorrect in the PR. A bit incomplete in this area.
  • JG: so we still have AttachmentsState in render …?
  • CW: we need a multisampling investigation.
  • MM: agree.
  • CW: getting back to this PR: do we have an AttachmentState describing all the attachments?
  • DM: what if I remove all the multisampling stuff from this PR and we figure that part out later?
  • CW / MM: yes sounds good.
  • DM: will do.

Submission model #100

https://github.com/gpuweb/gpuweb/pull/100

  • CW: most people happy with it. Any concerns?
  • MM: meta-question: this is a markdown file that doesn’t contain any IDL chages, and does contain a bunch of description about how to use the API. Presumably this would be put into a spec if we had one? What’s the plan? Will we keep amassing lots of tiny markdown files?
  • CW: for MVP, I suggest we keep amassing these markdown files. While we need these details, we don’t need details of triangle rasterization. As soon as we have an MVP, would be great to create a spec and start doing changes like these in the spec itself.
  • RC: before we get it reviewed before the TAG it can’t be these small files which are out-of-date. Need explainer (not how mipmapping works, but putting everything together) for noobs that don’t know WebGL. Also: think we’ll need more explanation. People changing IDLs in pull request, and that PR had the description rather than a spec like document.
  • JG: that’s fair. I’m with Corentin that the overhead’s higher right now because of where we are in the IDL changes.
  • RC: if you don’t want to change explanation text then maybe we should delete it. Wrong text is better than no text.
  • JG: I treat the IDL as the source of truth, and the other stuff is explanation around it. Hard to know what to go back and invalidate. Happy to review changes to fix up IDL vs. description.
  • RC: keeping things up do date would help me; I don’t remember what’s up to date, esp when giving documents to others.
  • MM: is it going to be one person’s job to mark broken things?
  • JG: no. If someone wants to take that on, great. But let’s treat these like normal bugs.
  • CW: there aren’t that many md files now, but they’ll grow. You want one doc that specifies the whole API?
  • MM: don’t think it’s necessary yet, but at some point ~100 md files is not workable.
  • JG / CW: agree.
  • CW: AI: go over files and figure out correct/incorrect ones, and fix them up.
  • JG: when we’re writing these md files I know line wrap is helpful, but please don’t make paragraphs all on one line. Hard to give comments on specific paragraphs. Hit return after every period?
  • CW: SGTM. I can also fix this up.
  • DM: what’s incomplete about #100?
  • CW: there were 2 comments. Command buffer reuse, question of whether there should be a loop between ready and recording. Need clarification that passes are all inside recording.
  • MM: if passes are all inside recording then is there an explicit call to go from Recording to Ready state?
  • CW: Finish().

Push constants #75

https://github.com/gpuweb/gpuweb/issues/75

  • CW: let’s try to work this out on Github. Next week’s meeting is booked.
  • MM: I’m trying to write a small benchmark of push constants and buffer accesses.

TPAC Summary

  • CW: people were worried about market penetration (how many devices will support WebGPU). Hardest part will probably be penetration on Vulkan devices; largely Google’s problem.
  • CW: lots of questions about, “is this supported”? Mostly “yes”. Raytracing? Not right now.
  • JG: e.g. on Android ES 3.1 is ~70%? Vulkan will be lower.
  • CW: it was mainly about telling other people about WebGPU.
  • MM: 2 people independently asked if we could replace GLSL in WebGL with WHLSL. I said no. :)
  • KR: we can keep our minds open. :)

Agenda for next meeting

  • Shading language
  • (stretch) push constants.
Clone this wiki locally