Minutes 2019 11 04

GPU Web 2019-11-04

Chair: Corentin

Scribe: Ken, Austin

Location: Google Meet

TL;DR

#426, clarify buffer usage validation rules
- Dawn’s restriction comes from the D3D12 heap types
- Concern was also that it could lead to performance pitfalls in non-UMA
- Could implement something akin to Metal managed buffers in WebGPU.
Tessellation #445
- Performance can go from slightly worse to much better than compute + vertex shader
- D3D12 and Vulkan have 2 stage tesselation while Metal replaces the first one with compute shaders
- Discussion about when we should look at such extension, also around mesh and task shaders.

Tentative agenda

#426, clarify buffer usage validation rules
Tessellation #445
Sparse resources #455 (maybe?)
(Your topic here)
Agenda for next meeting

Attendance

Apple
- Dean Jackson
- Justin Fan
- Myles C. Maxfield
Google
- Austin Eng
- Corentin Wallez
- James Darpinian
- Kai Ninomiya
- Ken Russell
- Shrek Shao
- Ryan Harrisson
Intel
- Yunchao He
MicrosoftDamyan Pepper
- Rafael Cintron
Mozilla
- Dzmitry Malyshau
- Jeff Gilbert
Mehmet Oguz Derin
Timo de Kort

Administrative item

F2F schedule
- OK to have WebGL on Tuesday Feb 11, and WebGPU on Wednesday and Thursday Feb 12-13?
- DM: works.
- DJ: sounds fine.
- MM: same days, just rotated?
  - KR: yes.
- RC: I assume people will go to all 3 days. Might have fewer people the first day?
Schedule discussion with the Khronos liaison (Neil Trevett) on how WebGPU and other external standards can and cannot use SPIR-V.
- Would need to be an IP-free meeting.
- Would that work for everyone?
  - DJ: great idea. Don’t mind doing it ASAP. Not sure whether we need to discuss anything before.
  - CW: just need to be clear what to discuss
  - KR: some topics: viability of forking; any way the group can collaborate with Khronos given the new simple liaison between Khronos and the W3C, etc.
  - MM: Is there an email thread about this?
  - CW: No, not yet.
  - KR: Discussions have been at F2F meetings and didn’t want to start the discussion via email.
  - MM: There are people who would like to join that aren’t in the calls.
  - CW: Yes, that’s why I suggested planning ahead of time; I’ll email about this.
CW: At a point in WebGPU where we’re investigating advanced features or further explorations. It seems the API is in a fairly good shape with small places we need improvements. The spec and test suite haven’t moved super fast recently. I think we should spend time focusing on these to build the parts we agree upon more. We should make sure we stabilize and polish what we have now.

Buffer Partitioning

CW: I don’t think this is going to work, so I’ve sort of bailed on the proposal. I will do a write-up of the issues and someone else can take it if they like.
MM: who’s going to move this forward?
CW: I expect either devsh.programming@, or DM seemed interested. Hoping someone will take it.
MM: I hope someone will take it too.
DM: can we formalize the exact problem that this solution is meant to solve? Only talking about transfers to CPU? Or multiple usages on GPU as well?
CW: don’t think we want to use part of the buffer as storage and part as uniform, for example.
DM: that’s not very convincing. But if it’s the case, then having a transfer queue or something similar could be the solution instead of the buffer views.
MM: should we have this conversation after CW writes up his description?
DM: I’m discussing the problem space here. CW doesn’t believe this is a solution. Trying to figure out what we’re solving.
CW: I think an API which allows different parts of a single buffer be used in different ways as buffers will run into a large number of problems. I’ll write down why. Being able to use buffers in different ways at different times was intended to be a solution for data upload - keep it in shared memory. Try to replicate what some people do in native, where they have a giant ring buffer used by the GPU in different ways at different times.
DM: I can try to make the queue idea more shaped up by the next meeting.
MM: problem you just described sounds like the problem that Metal heaps try to solve. One big allocation, then sub-allocate inside it.
CW: yes...the thing is that you don’t have explicit control in Metal heaps. You say, I want 50 MB, then allocate resources out of it until no more space. Sounds more like a cache. Another thing people wanted was to do dynamic offsets across the whole buffer so they don’t need to dynamically allocate bind groups per frame.
MM: maybe when you write this up, could you add a sizeable chunk describing the problem, so we make sure we aren’t narrowly identifying it?
CW: will do.

#426, clarify buffer usage validation rules

CW: Francois was writing tests against Dawn’s implementation. MAP_WRITE can only be used with COPY_SRC, MAP_READ only with COPY_DEST. Mimicking D3D’s. Can be an arbitrary limitation that doesn’t need to exist on other APIs; can be emulated.
CW: do we want to have a restriction where mapped buffer types can only be used with a certain set of usages?
RC: does this mean you can map something for writing and then scribble into it when you’re drawing into it?
CW: no. Lack of data races is still there. Usage here: usage at buffer creation. Can’t create a MAP_READ buffer that’s also a uniform buffer. Can only be COPY_DEST in Dawn’s impl today. Similar for COPY_SRC. We put this limitation in Dawn because D3D12 has upload and readback that require this.
DP: what Corentin said is correct. Readback resource has to be created in COPY_DEST. Opposite for COPY_SRC.
MM: once you’ve created it in the right copy state can you change that state?
DP: no. Have to copy it into a different heap to manipulate it in that way. That’s the sort of thing D3D11 drivers would do behind the scenes.
DP: extra complication: I don’t know all the details. On UMA systems, can have custom heaps, where you can map them and also use as shader resource. But for general upload and readback heap these restrictions apply.
DM: didn’t we say we want zero copy for vertex buffer updates? This limitation would make that impossible.
MM: sounds like it’s already impossible on one API - the question is whether we should apply it to others.
DP: it’s a hardware restriction on one GPU. Not sure about the others.
DM: is it a HW restriction? Vulkan lets you do some of these.
DP: have to get back to you. Painful restriction.
JG: you can at least flow toward the GPU if I’m reading the heap rules correctly. Can’t you still use MAP_READ as shader resource?
DP: yes but it’ll be terribly slow.
JG: OK. Vulkan, you can make maps of memory. Theoretically, CPU-distant, not mappable, but GPU usable. Not sure what presence is in the wild, but Vulkan supports impls where you can’t map GPU memory. Have to map into intermediary buffer, then copy on the GPU.
DM: Vk doesn’t support any system that doesn’t support using cpu-visible memory for whatever you want.
CW: we did talk about this a long time ago - does UMA let you do more things, or have it as a UMA extension.
- KR: I remember this discussion.
- CW: had open questions about this in the NXT memory mapping document.
- DM: boundary between UMA and non-UMA isn’t that strict. Recent discrete AMD GPUs have a small unified memory pool. Would prefer we do a workaround on D3D12 systems that don’t have UMA, because everyoen else has this workflow.
- CW: do we want to allow people to create a uniform buffer in shared memory on discrete GPUs where it’ll be slow?
- JG: if they’re not reading from it all the time.
- DM: shared memory isn’t dog slow.
- JG: In Vulkan, it’s memory that’s not device local but host visible.
- DM: But it can be both device local and host visible.
- JG: No guarantee that all systems have this. That’s what the upload buffer is. Non-device-local, host-visible memory.
- RC: concern that people will see the flags, think they can write them from the GPU?
- DP: beginner scenario to create all textures and vertex buffers in upload heaps, not realizing you’re leaving a lot of perf on the floor.
- RC: risk is that we let people do this, works fine on Intel, and have perf surprises on other system.
- CW: the new AMD memory DM talked about is uncached memory. Don’t think we want to give an arraybuffer to uncached memory to JS.
- DM: it’s not textures here, only buffers. Textures, we don’t allow mapping anyway.
DM: I’d need more time to think about this.
CW: sounds like on D3D12 side, have to figure out, can we always create heaps with the right properties?
JG: not a fan of having to do that heuristic.
MM: On Metal the driver does more under the hood than on Vk or D3D. We expose the idea that you can have infinite shmem between device and GPU, but that’s not true, and we do shuffling under the hood to give the illusion.
CW: sounds like we should investigate more from the D3D side and figure out what we want apps to do?
RC: yes.
MM: for an impl that doesn’t support mapping, there’s a map/unmap call where impl could do copying. If you didn’t want this restriction, and working on D3D too, those map/unmap calls would be copying to the staging buffer.
RC: yes. Have to preserve the illusion that it’s all shared.
MM: on a D3D app, devs already use these staging buffers naturally, right? Already using that code path. Question is, should that code path be in the browser or the website’s code?
CW: following this reasoning, anything MAP_WRITE and anything but COPY_SRC will be like Metal Managed Memory resource.
MM: that’s one way to do it.
DM: this is all within the assumption that we have map. Buffer views, whether we can map them, is still to be finalized. If there’s a transfer pass then impl would do scheduling of it. It could replace mapping. Idea: since you know where you are in the queue and know what you’re copying into, could write into resource right away. Not written down in full detail yet.

Tessellation #445

MM: bunch of benefits touted. Only some are true. Idea is: instead of tessellating on the CPU and uploading tessellated mesh, have GPU do it on-chip. Memory benefit where you store the high res mesh. Has perf benefit on some GPUs; no fetch from main memory. Amount of tessellation can change frame to frame. Have model which fluidly changes between low and high res.
MM: tried to measure perf on a few GPUs. Tested 3 different GPUs. One of the GPUs had a big perf improvement. Other, about the same. Third, slight perf regression.
MM: perf benefit not really panning out. The other benefits are real. The better memory use and correctness mean that it’s a benefit.
MM: as far as mechanically enabling it: all of the APIs have tessellation support. Not option in Metal? (Not sure.) Optional in Vk, not sure in D3D.
MM: question is what should the shader stages be?
RC: think it’s worth exploring perhaps in a WebGPU 2 or an extension, after we ship WebGPU 1 / MVP / whatever.
MM: has to be an extension because it’s optional in at least one API.
CW: Definitely an extension. I think it’s worth it because a lot of people voice they want tessellation.
MM: did you run a poll?
CW: just people coming on the issue tracker / mailing lists. It would be interesting to ask important customers to know if they want it sooner or farther.
DM: one data point: it’s been one of the most requested features for Vk Portability. Was implemented over Metal at one point. Not covered in investigation, though the investigation is very good.
CW: surprised that the Vulkan-style tessellation can be implemented on Metal.
DM: I didn’t dive deeply into this but seems it mostly works. Myles’ proposal where you prepare a tessellation factor buffer on the GPU doesn’t play well with render passes. Not only you’d have to preallocate large amount of vram for tessellation buffers, you lose locality. Producing completely separated from consuming.
MM: locality sounds like a perf argument. If locality is a problem then we should see a perf hit because of them.
CW: one more thing: NVIDIA has advertised task / mesh shaders. D3D announced they’re going to add them. Intended to replace everything from vertex input to rasterization, including tessellation
MM: totally new pipeline.
KR: I’ve been reading that @Humus has recently done task / mesh shader emulation in compute shaders. He has that in a metaballs demo and it looks quite good. Might be a nice simplification of the mesh / tessellation pipelines.
- KR: I was wrong about this; it’s just a compute shader fallback for the specific demo and not a fully general compute shader emulation of the task and mesh shader pipeline
- https://twitter.com/Humus/status/1187856097815728130?s=20
- http://www.humus.name/index.php?page=News
- http://www.humus.name/index.php?page=News&ID=385
- http://www.humus.name/index.php?page=News&ID=386
- http://www.humus.name/index.php?page=News&ID=387
MM: How did he manage the work creation? Spawning workgroups which I don’t think is possible in compute.
KR: I’m not sure. I’ll find it and link.
CW: It seems at least in Vk portability there’s appetite for tessellation and that translates a bit to WebGPU. I think people will ask for it.
MM: I think it’s probably important we only have one way of doing tessellation, for now. Simplicity and a selfish desire to not write a lot of two tests that accomplish the same thing. It would be nice if we could expose one API that can work on top of Metal tessellation and fancy mesh stuff that Metal doesn’t support.
RC: If we want to be forward looking should we only do the future one?
MM: Metal doesn’t support it. We could also punt it for five years. Don’t think we should do it if it’s only on a few cards.
JD: I agree with RC that it can wait till after 1.0.
RC: I think that’s the earliest we should take it up again.
MM: Okay fine to wait then. I would like to know if we punt it, what the criteria is for when we look at it again so it’s not indefinite?
RC: Reason I have is that we want the basic stuff to work portably everywhere, and then we can talk about other kinds of shaders not present on all hardware.
MM: I don’t think we want to version the WebGPU API. I think most of the proposals I’ve been making are extensions. I agree the “meat and potatoes” are more important, but I don’t think we want to say “version 2” of the API.
CW: I think closer to v1 we’ll have a discussion about “versioning” or not. Regardless, I think this is difficult enough it will take a lot of effort to get right. I’d like to take homework on this to see if mesh / task shaders can emulate on top of regular pipelines.
- KR: I was wrong about this; please see the links above. Was just a compute shader fallback for the specific demo.
MM: Or, if we can emulate tessellation on mesh / task shaders. I believe they are more flexible, but you do need a completely different type of pipeline.
DM: if someone wants to champion it, can look at the MoltenVk code, see how it works, enumerate the differences, etc.
CW: we’re getting into the arena of advanced features & investigations, but the current spec is bare bones.
MM: I’m happy to punt this, just want to know what the gating factor is. If it’s Chrome shipping something in stable then I’d like to know the dates.
CW: It’s more that the WebGPU API is stamped by Tim Berners-Lee.
KN: it becomes a Recommendation.
MM: Recommendations take forever. Most W3C specs that have wide adoption aren’t Recommendations.
KN: think people who have other work to do on WebGPU should focus on that, but anyone free to spend time on it if they want.
CW: agree. This is a group on a voluntary basis anyway.

Sparse resources #455 (maybe?)

[not discussed, ran out of time]

Agenda for next meeting

DM: revisit upload pass / have transfer queue?
CW: to write issues with buffer views.
- #426, clarify buffer usage validation rules
KR: Some process to break down parts of the spec and CTS.
- CW: can start with the result of the spec editor meeting next Monday

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minutes 2019 11 04

GPU Web 2019-11-04

TL;DR

Tentative agenda

Attendance

Administrative item

#426, clarify buffer usage validation rules

Tessellation #445

Sparse resources #455 (maybe?)

Agenda for next meeting

Clone this wiki locally