Buffer partitioning (real bad proposal but it was promised) #481

Kangz · 2019-10-25T15:08:20Z

Edit: this proposal originates in #156 known as "buffer views"

I tried really hard to make some proposal for buffer partitioning that makes sense but the best I could get is what's below. It has major flaws and I wouldn't want it as part of WebGPU but I'm posting it because it was promised, and because maybe someone will inspiration for a better idea.

Feel free to not read if you're not interested in my rambling.

Buffer partitioning

This is a proposal for how buffers could be split in such a way that one piece of the buffer if mapped while another is in use by the GPU.

The basic requirement was that UBOs could be ring-allocated in a mappable buffer, such that a part of the buffer is in use as a UBO by the GPU while the other is being populated by that application. Doing this is important because a single bind-group could be used to address the whole ringbuffer with a dynamic offset.

Basics

GPUBuffer gains three internal slots:

[[is partition]] a boolean that defaults to false.
[[partitions]] a list of partitions for this buffer (always empty if [[is partition]]).
[[parent buffer]] the parent of the buffer (always null if not [[is partition]])

Buffer partitioning is done through a method of GPUBuffer:

partial interface GPUBuffer {
    GPUBuffer partition(offset, size);
};

buffer.partition(offset, size) creates a new Buffer representing the range [offset, offset + size) of the buffer. A validation error happens and an error GPUBuffer is returned if one of these conditions isn't respected:

offset and size must be aligned to 256.
buffer.[[is partition]] is true.
buffer must not be in the destroyed or mapped state. It can already be partitioned.
the range must not intersect the range of any other partition of buffer that's not in the destroyed state.

Upon success a new partition of the buffer, result is returned. The steps are roughly:

buffer is put in the partitioned state.
a new GPUBuffer called result is created.
result is inserted in buffer's [[partitions]]
result's [[is partition]] is set to true.
result's [[parent buffer]] is set to buffer
result is put in the unmapped state.
result's data store represents the correct range of buffer's data store' (hands waving).

Interactions with other WebGPU features

Interaction with destroy()

Calling destroy on a partitioned buffer recursively destroys all its partitions.
Calling destroy on a partition removes it from it's parent's list of partitions, and if the list of partition is empty the parent is put in the unmapped state.

Interaction with dynamic offset

The intent was that you would be able to make a dynamic UBO on the parent buffer in the bind group and somehow choosing an offset that is in a correct (unmapped) partition at submit time would be allowed.

But this requires, at submit time, to check dynamic offsets are in a correct partition if the dynamic uniform buffer is partitioned. It doesn't look very tractable.

Interaction with GPUQueue.submit()

It is an error to have a partitioned buffer used in the command buffers, except as dynamic UBO / storage buffers. If that case happens, the dynamic offset must be such that [offset, offset + binding size) is entirely contained in a single unmapped parition.

Interaction with mapping.

And this proposal breaks down even more. The different partitions of a large ring buffer will be mapReadAsynced independently, but then we might want to stitch them together to make bigger partitions that are all in the mapped state. That and splitting a large mapped region into two sub-regions one that's going to be used as a UBO on the GPU and one that stays mapped for the next frame.

The text was updated successfully, but these errors were encountered:

kvark · 2019-10-25T15:51:08Z

Thank you for writing down the proposal!
It's interesting how the confidence level differs from the time we were discussing this :)

Calling destroy on a partition removes it from it's parent's list of partitions, and if the list of partition is empty the parent is put in the unmapped state.

I don't understand why this needs to switch the state to unmapped. Partitioning a buffer is useful even outside of the mapping semantics.

But this requires, at submit time, to check dynamic offsets are in a correct partition if the dynamic uniform buffer is partitioned. It doesn't look very tractable.

I also don't feel good about doing this at submit time... Perhaps we could require that a partition exists by the time setBindGroup is called, so that we can track the partition within render pass recording like we track all the other things?

It is an error to have a partitioned buffer used in the command buffers, except as dynamic UBO / storage buffers.

Why is this limited to UBO/Storage? What if I need to have the vertex data partitioned, for example?

magcius · 2019-10-25T16:22:36Z

Why is this limited to UBO/Storage? What if I need to have the vertex data partitioned, for example?

You could just call setVertexBuffers with a new buffer, right? My understanding is that this is intended to cut down on the intermediate GPUBindings objects which Vulkan requires. Vulkan has implicit partitioning since you can barrier for different subregions of a buffer.

devshgraphicsprogramming · 2019-10-28T08:47:43Z

The intent was that you would be able to make a dynamic UBO on the parent buffer in the bind group and somehow choosing an offset that is in a correct (unmapped) partition at submit time would be allowed.
But this requires, at submit time, to check dynamic offsets are in a correct partition if the dynamic uniform buffer is partitioned. It doesn't look very tractable.

The original idea in #156 was that you'd be passing the pair <partitioned buffer,offset in partitioned buffer>, as the dynamic offset, not a raw offset into the parent, ergo no need to search in the parent.

Generally speaking you'd never be using the parent for anything else than just bind-group creation (and whatever else needs a permament immutable native API buffer in its native API state).

I actually wanted a separate API object for the partition, and never use the buffer object for anything else than creating partitions, so its more clear. Much how in Vulkan you create a VkImage but you can't really use it for anything else than creating VkImageViews which then can be used...

I also don't feel good about doing this at submit time... Perhaps we could require that a partition exists by the time setBindGroup is called, so that we can track the partition within render pass recording like we track all the other things?

For all I care you could make it a requirement

It is an error to have a partitioned buffer used in the command buffers, except as dynamic UBO / storage buffers. If that case happens, the dynamic offset must be such that [offset, offset + binding size) is entirely contained in a single unmapped parition.

You could just call setVertexBuffers with a new buffer, right? My understanding is that this is intended to cut down on the intermediate GPUBindings objects which Vulkan requires. Vulkan has implicit partitioning since you can barrier for different subregions of a buffer.

#156 intended it to be used for everything @magcius .
That's why having a relationship of Buffer to BufferPartition mirror VkImage to VkImageView would have been better (don't use the parent in the API at all, except for bindgroup creation).

I don't understand why this needs to switch the state to unmapped. Partitioning a buffer is useful even outside of the mapping semantics.

The idea was also that you'd only be able to map the partition, not the parent.

kvark · 2019-10-28T20:18:33Z

I actually wanted a separate API object for the partition, and never use the buffer object for anything else than creating partitions, so its more clear. Much how in Vulkan you create a VkImage but you can't really use it for anything else than creating VkImageViews which then can be used...

I find this very elegant: having the duality between Texture/TextureView and Buffer/BufferView with all the strong typing. 👍

Kangz · 2019-11-07T15:39:42Z

In the last meeting I mentioned that I'll like to un-assign myself from designing a buffer partitioning scheme because I don't believe in the approach anymore. Here's a dump of my current thoughts.

Problems buffer partitioning is trying to address

Buffer partitioning (or buffer views) came up when we were discussing proposals for buffer mapping. The problem is that the "state" of the buffer, mapped or unmapped is global to the entire buffer, so you cannot put data in the buffer for follow up frames if you submitted commands using the buffer this frame.

For static data this doesn't matter, but applications often have dynamic data (uniforms, vertex, and index) and here are two usecases that buffer partitioning meant to address:

Having the dynamic data in CPU-GPU-visible, using some of it on the GPU while updating it on the CPU in a data-race-free manner. This is to have a single large allocation, and a path to update the data with a single copy (the CPU -> buffer copy).
Reducing GC pressure by using a single large uniform buffer in shared memory, so that all uniforms come from a different dynamic offset of the same buffer. (otherwise there would need to be a bindgroup per uniform buffer, which another dimension of bindgroup to create). The other constraint here is to update the uniform buffer with the one copy path, instead of a pipelined copyBufferToBuffer.

Note that allowing a mix of read-only and read-write usages of the buffer in the same renderpass was a non-goal.

Also both usecases above are solved if we're ok pipelining copyBufferToBuffer copies at the start of each frame.

Why I don't believe in this approach.

State tracking complexity

Contrary to GPUTexture that have no API state (except from destroyed, which is one way and acts on all subresources), GPUBuffers can be mapped or unmapped and for a buffer view proposal to be useful it will need to allow parts of the buffer to be mapped while others are unmapped, at the same time. A lot of validation in WebGPU has to be updated to take into account the state of a buffer and its children in non-obvious ways.

Splitting buffers isn't a complete algebra of operations

If you are able to split ranges of buffers, you want to be able to merge ranges of the same buffer as well and this adds even more complexity. The reason why you want to merge ranges is the following: imagine you have a ringbuffer of uniforms parts of which are mapped, being mapped, or in use by the GPU. Imagine you have two neighboring buffer views that are mapped but you'd like to have staging space for data that doesn't fit in only one of the two views. Do you split the data? It would require a lot of complexity on the application side. Instead you want to be able to join both views and have a merged staging area.

This adds a bunch mroe questions: is the merging operation synchronous (it would require the whole buffer to be allocated in shmem for remoting implementations), do you see data you previously wrote to the views? etc.

Garbage

One of the goals of buffer partitioning was to reduce garbage for the usecase with a large uniform buffer and dynamic offsets into it. But with buffer splitting and merging we are introducing garbage in a different place so it doesn't really help. The proper solution would be to pipeline a copyBufferToBuffer.

Bad interactions with other parts of the API

The point was to be able to use dynamic offsets efficiently, but see the discussion above for what the issue is. Passing the buffer view in setBindGroup adds another level of complexity to an already overloaded method imho.

I actually wanted a separate API object for the partition, and never use the buffer object for anything else than creating partitions, so its more clear. Much how in Vulkan you create a VkImage but you can't really use it for anything else than creating VkImageViews which then can be used...

I find this very elegant: having the duality between Texture/TextureView and Buffer/BufferView with all the strong typing. +1

It's conceptually elegant but not convenient to use because in most cases you'll create a GPUBuffer just to call CreateView() on it, it adds verbosity and garbage. Why does BufferView help with strong typing?

devshgraphicsprogramming · 2019-11-07T18:07:05Z

First of all, sorry I can't attend the meetings, they're in a bad timezone for me.

The problem is that the "state" of the buffer, mapped or unmapped is global to the entire buffer, so you cannot put data in the buffer for follow up frames if you submitted commands using the buffer this frame.

GPUBuffers can be mapped or unmapped and for a buffer view proposal to be useful it will need to allow parts of the buffer to be mapped while others are unmapped, at the same time.

The whole idea was that mapped/unmapped would be per-view not per-buffer, explicitly to ensure that only a view is being tracked, not the whole buffer.

I have absolutely no idea why that was somehow lost in translation over the multiple instances when I've explained the idea.

Also both usecases above are solved if we're ok pipelining copyBufferToBuffer copies at the start of each frame.

This hurts multi-queue and multi-threading, most developers want to step outside of the traditional "one command stream linearly executed and synched to vblank".

The reason why you want to merge ranges is the following: imagine you have a ringbuffer of uniforms parts of which are mapped, being mapped, or in use by the GPU. Imagine you have two neighboring buffer views that are mapped but you'd like to have staging space for data that doesn't fit in only one of the two views. Do you split the data? It would require a lot of complexity on the application side.
This adds a bunch mroe questions: is the merging operation synchronous (it would require the whole buffer to be allocated in shmem for remoting implementations), do you see data you previously wrote to the views? etc.

Could be solved by:

requiring the BufferView is not mapped, nor used by the GPU (no uses submitted) or anywhere else (command buffers) before you DELETE it.
"Merging" actually would be deleting several views and re-creating less views in their place (the above requirement would not be problem because whenever you'd want to merge you'd be in the middle of writing to a mapping with the CPU)
Naturally two views cannot be created over the same area.
Deleting and Creating Views does not affect the data store of the underlying buffer

My GPU buffer sub-allocator that has been running in production conforms to all the above.

Unfortunately in Vulkan and OpenGL the whole buffer would need to be mapped behind the scenes (only D3D12 allows for multiple mappings).

Remember that the proposal #156 required the user to KNOW EXPLICITLY WHERE THEY WANT TO CREATE THE VIEW, not for WebGPU to magically allocate them an offset given a size.

One of the goals of buffer partitioning was to reduce garbage for the usecase with a large uniform buffer and dynamic offsets into it. But with buffer splitting and merging we are introducing garbage in a different place so it doesn't really help. The proper solution would be to pipeline a copyBufferToBuffer.

Recreating bind-groups in the number of every single buffer combination is faar more garbage.

Think of the following simple scenario

Per-Frame UBO (updates 1 per frame)
Per-Skinned-Character UBO (updates 24 Hz)
Physics Object Transforms (updates 60 Hz)

Now if I need to have (to avoid stalls between CPU, API and GPU):

Buffer to be mapped
Buffer to wait in submission queue
Buffer to currently draw with

Then this gives me 3 buffers for each of the above resources, so 9 in total.

But beause the framerate is variable I need every single combination of 3 buffers as a bind group because the update frequencies are independent of each other.

Assume 100 FPS (f=frame, p=physics, s=skinning)

f0 s0 p0 
f1 s0 p0
f2 s0 p1
f0 s0 p1
f1 s1 p2
f2 s1 p2
f0 s1 p2
f0 s1 p3
....
and so on, you'll eventually see every single combo
....

So say hello to 3^3=27 bind groups!

Now think about how much fun its going to be for an application developer to go back to number theory and work out the common divisors and modular arithmetic to put a lower bound on the nnumber of bind groups they'll need!

magcius · 2019-11-07T18:24:21Z

My talks about garbage have meant required garbage -- that is, places where I cannot reuse an object, e.g. Promise, which is create-and-forget and cannot be reset or reused.

If I can save off a BufferView, and continue to reuse it for multiple command encoders, that is not garbage in my eyes.

Kangz · 2020-05-20T17:01:56Z

Closing since the second round of discussions on buffer mapping are finished and resulted in #708 (plus some follow ups)

* Add texture usage validation test for createTexture * Address Corentin's comment: depth/stencil formats can be used as copy src/dst. * Remove the TODOs (wrongly recovered during git merge) * Address feedback from Corentin * Unconditionally test all texture formats for copy usages * One more minor change to be consistent at least in this file

Kangz mentioned this issue Oct 25, 2019

Make it easier to upload data into buffers correctly #418

Closed

kvark mentioned this issue Nov 7, 2019

Proposal: queued data transfers #491

Closed

Kangz closed this as completed May 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Buffer partitioning (real bad proposal but it was promised) #481

Buffer partitioning (real bad proposal but it was promised) #481

Kangz commented Oct 25, 2019 •

edited by kvark

kvark commented Oct 25, 2019

magcius commented Oct 25, 2019 •

edited

devshgraphicsprogramming commented Oct 28, 2019

kvark commented Oct 28, 2019

Kangz commented Nov 7, 2019

devshgraphicsprogramming commented Nov 7, 2019

magcius commented Nov 7, 2019

Kangz commented May 20, 2020

Buffer partitioning (real bad proposal but it was promised) #481

Buffer partitioning (real bad proposal but it was promised) #481

Comments

Kangz commented Oct 25, 2019 • edited by kvark

Buffer partitioning

Basics

Interactions with other WebGPU features

Interaction with destroy()

Interaction with dynamic offset

Interaction with GPUQueue.submit()

Interaction with mapping.

kvark commented Oct 25, 2019

magcius commented Oct 25, 2019 • edited

devshgraphicsprogramming commented Oct 28, 2019

kvark commented Oct 28, 2019

Kangz commented Nov 7, 2019

Problems buffer partitioning is trying to address

Why I don't believe in this approach.

State tracking complexity

Splitting buffers isn't a complete algebra of operations

Garbage

Bad interactions with other parts of the API

devshgraphicsprogramming commented Nov 7, 2019

magcius commented Nov 7, 2019

Kangz commented May 20, 2020

Kangz commented Oct 25, 2019 •

edited by kvark

magcius commented Oct 25, 2019 •

edited