Skip to content

Proposals for buffer operations (immediate uploads, buffer mapping) #138

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Kangz opened this issue Nov 29, 2018 · 37 comments
Closed

Proposals for buffer operations (immediate uploads, buffer mapping) #138

Kangz opened this issue Nov 29, 2018 · 37 comments
Labels

Comments

@Kangz
Copy link
Contributor

Kangz commented Nov 29, 2018

PTAL, this is basically #49 but as an investigation, and with additional alternatives.

Our thoughts on this proposals are the following:

  • Buffer mapping 1: our preferred solution, but a bit complex
  • Buffer mapping 2: Mapping whole buffers seems too heavyweight for multi-process implementations
  • Immediate uploads 1: Goes very well with Buffer mapping 1
  • Immediate uploads 2: strongly in favor

Buffer operations

This describes WebGPUBuffer operations that are used by applications to interact directly with the content of the buffer's memory.
The two primitives we need to support are the CPU writing data inside the buffer for use by the GPU (upload) and the CPU reading data produced by the GPU (readback).

Design constraints are:

  • For the portability of the API, prevent data races between the CPU and the GPU.
  • For performance, minimize the number of times the data is copied around.
  • To make the API non-blocking, only allow asynchronous readbacks.
  • For performance on multi-process implementations, make an asynchronous upload path.

Two alternative proposals are described for buffer mapping, WebGPUMappedMemory and whole-buffer mapping.
Two other proposals are described for immediate data uploads that aren't mutually exclusive, one base one mapReadSync of WebGPUMappedMemory and another using setSubData.

Buffer mapping proposal 1

map[Write|Read]Async and unmap

The way to have the minimal number of copies for upload and readback is to provide a buffer mapping mechanism.
This mechanism has to be asynchronous to ensure the GPU is done using the buffer before the application can look into the ArrayBuffer.
Otherwise on implementation where the ArrayBuffer is directly a pointer to the buffer memory, data races between the CPU and the GPU could occur.

We want the status of a map operation to act as both a promise, and something that's pollable as there are advantages to both.
WebGPUMappedMemory is an object that is then-able, meaning that it acts like a Javascript Promise but is pollable at the same time.

The mapping operations for WebGPUBuffer are:

partial interface WebGPUBuffer {
    WebGPUMappedMemory mapWriteAsync(u32 offset, u32 size);
    WebGPUMappedMemory mapReadAsync(u32 offset, u32 size);
};

These operations return new WebGPUMappedMemory objects representing the current range of the buffer for writing or mapping.
The results are initialized in the "pending" state and transition at Javascript task boundary to the "available" state when the implementation can determine the GPU is done using the buffer.
Calling mapReadAsync or mapWriteAsync puts the buffer in the mapped state.
No operations are allowed in a buffer in that state except additional calls to mapReadiAsync or mapWriteAsync and calls to unmap.
In particular a mapped buffer cannot be used in a WebGPUCommandBuffer given to WebGPUQueue.submit.
The following must be true or a validation error occurs for mapWriteAsync (resp. mapReadAsync):

  • The buffer must have been created with the WebGPUBufferUsage.MAP_WRITE (resp. WebGPUBufferUsage.MAP_READ) usage.
  • offset + size must not overflow and be at most the size of the buffer
  • The [offset, offset + size) range must not intersect the range of another WebGPUMappedMemory on the same buffer which hasn't been previously invalidated.
  • The buffer has been destroyed.

Then a mapped buffer can be unmapped with:

partial interface WebGPUBuffer {
    void unmap();
};

This operation invalidates all the WebGPUMappedMemory created from the buffer and puts the buffer in the unmapped state.
The buffer must be in the mapped state otherwise a validation error occurs when unmap is called.

WebGPUMappedMemory

WebGPUMappedMemory is an object representing a mapped region of a buffer that's both pollable and promise-like.

It can be in one of three states: pending, available and invalidated.

The pollable interface is:

partial interface WebGPUMappedMemory {
    bool isPending();
    ArrayBuffer getPointer();
};

isPending return true if the object is in the pending state, false otherwise.
getPointer returns an ArrayBuffer representing the buffer data if the object is in the available state, null otherwise.

WebGPUMappedMemory is also then-able, meaning that it acts like a Javascript Promise:

partial interface WebGPUMappedMemory {
    Promise then(WebGPUMappedMemorySuccessCallback success,
                 optional WebGPUMappedMemoryErrorCallback error);
};

This acts like a Promise<ArrayBuffer>.then that is resolved on the Javascript task boundary in which the implementation detects the GPU is done with the buffer.
On that boundary:

  • The WebGPUMappedMemory goes in the available state.
  • If the WebGPUMappedMemory was created via WebGPUBuffer.mapWriteAsync, its content is cleared to 0.
  • success is called with the content of the memory as an argument.

If success hasn't been called when the WebGPUMappedMemory gets invalidated (meaning the object is still in the pending state), error is called instead.
When WebGPUMappedMemory goes from the available state to the invalidated state, the ArrayBuffer for its content gets neutered.
The return value of then acts like the return value of Promise.then.

The ArrayBuffer of a WebGPUMappedMemory created from a mapWriteAsync is where the application should write the data and its content is made available to the buffer when the WebGPUMappedMemory is invalidated (i.e. WebGPUBuffer.unmap is called).

Buffer mapping proposal 2

In this proposal a buffer is always mapped as a whole as an asynchronous operation.
Mapping for reading (resp.writing) is done using WebGPUBuffer.mapRead (resp WebGPUBuffer.mapWrite).
The mapping calls but the buffer in the "mapped" state.
A Javascript error is thrown under these conditions:

  • The buffer hasn't been created with the MAP_READ (resp. MAP_WRITE) usage.
  • The buffer isn't in the unmapped state.
  • The buffer has been destroyed.
partial interface WebGPUBuffer {
    void mapRead();
    void mapWrite();
}

Mapping is an asynchronous operation and after its resolution the buffer's mapping member will be updated to represent the content of the buffer (resp. filled with zero and ready to receive data from the application).
Resolution can only happen at Javascript task boundary, and after the implementation has determined it is safe to give access to the buffer to the CPU.
Resolution is guaranteed to complete before (or at the same time) as when all previously enqueued operations are finished executing (as can be observed with WebGPUFence).

partial interface WebGPUBuffer {
    readonly attribute ArrayBuffer? mapping;
}

The buffer is unmapped with a call to unmap which puts it in the unmapped state.
It is an error to call unmap while in the unmapped state.
In the mapped state it is an error to do operations in the buffer (such as setSubdata or enqueuing commands using the buffer).

partial interface WebGPUBuffer {
    void unmap();
};

Immediate data upload proposal 1

When mapping for writing, the application doesn't see GPU state since the content is cleared to 0.
This means WebGPU can expose a mapWriteSync primitive that behaves exactly like mapWriteAsync except that the returned WebGPUMappedMemory object starts in the available state.

partial interface WebGPUBuffer {
    WebGPUMappedMemory mapWriteSync(u32 offset, u32 size);
};

Immediate data upload proposal 2

Buffer mapping is the path with the least number of copies but it is often useful to upload data to a buffer right now, if only for debugging.
A WebGPUBuffer operation is provided that takes an ArrayBuffer and copies its content at an offset in the buffer.

partial interface WebGPUBuffer {
    void setSubData(ArrayBuffer data, u32 offset);
}

This operation acts as if it was done after all previous "device-level" commands and before all subsequent "device-level" commands.
"Device level" commands are all commands not buffered in a WebGPUCommandBuffer, and include WebGPUQueue.submit.
The content of data is only read during the call and can be modified by the application afterwards.
The following must be true or a validation error occurs:

  • The buffer must have been created with the WebGPUBufferUsage.TRANSFER_DST usage flag.
  • offset + data.length must not overflow and be at most the size of the buffer.
  • The buffer must not be currently mapped.

Unused designs

Persistently mapped buffer

Persistently mapped buffer are when the result of mapping the buffer can be kept by the application while the buffer is in use by the GPU.
We didn't find a way to have persistently mapped buffers and at the same time keep things data race free between the CPU and GPU.
Being data race free would be possible if ArrayBuffer could be unneutered but this is not the case.

Promise readback();

This didn't have a pollable interface and forced an extra buffer-to-buffer copy to occur if the GPU execution could be resumed immediately.

Dawn's MapReadAsync(callback);

Not a pollable interface.

Issues

GC discoverability

It isn't clear yet what happens when a buffer gets garbage collected while it is mapped.
The simple answer is that the WebGPUMappedMemory objects get invalidated but that would allow the application to discover when the GC runs.

GC pressure

The WebGPUMappedMemory design makes each mapped region create two garbage collected objects. This could lead to some GC pressure.

Side effects between mapped memory regions

What happens when WebGPUMappedMemory object's region in the buffer overlap?
Are write from one visible from the other?
If they are, maybe WebGPUMappedMemory.getPointer should return an ArrayBufferView instead.

Interactions with workers

Can a buffer be mapped in multiple different workers?
If that's the case, the pointer should be represented with a SharedArrayBuffer.

@devshgraphicsprogramming

Proposal 1

In particular a mapped buffer cannot be used in a WebGPUCommandBuffer given to WebGPUQueue.submit.

This is bad, you've just destroyed the main advantage of Vulkan and OpenGL with ARB_buffer_storage (core in 4.3+) over the older APIs.

You need to be able to use persistently mapped buffers as the developer.

If the WebGPUMappedMemory was created via WebGPUBuffer.mapWriteAsync, its content is cleared to 0.

Why? The buffer is the app's own buffer, whats the security concern here?
Remember that on desktop devices you will be writing across the PCIe bus, which has very very limited bandwidth and you will reduce it by half!

Proposal 2

Buffer mapping 2: Mapping whole buffers seems too heavyweight for multi-process implementations

The problem with this one is that its unimplementable on Vulkan.

The reason is because its not the buffers that get mapped, its the bound memory.

Also only one subrange in a memory object can be mapped, you cannot map two subranges (even non intersection) like you can in DirectX.

So if you do not know ahead of time what ranges you are going to be mapping and reading+writing, then you'd need to unmap and map a bigger range every single time your read/write APIs request a range which is outside of the one mapped by Vulkan. The problem with this is that it would require an implicit synchronisation that would have to wait for all previous uses of mapped memory to finish, then unmap, map bigger range and then continue with the reading/writing.

In this proposal a buffer is always mapped as a whole as an asynchronous operation.

I kind of like this approach, but dislike the design which allows for simultaneous mapping of the buffer asynchronously.
Maybe mapping the buffer from the start at creation time would be more useful.

Immediate Upload 1

Making a synchronous buffer write operation is silly, you will have to wait for 1 or more (usually 3) frames to complete before the buffer becomes available (not used by GPU) for writing, and you will starve the GPU of work to do and it will idle while you're manipulating the buffer.

Immediate Upload 2

I'm not a fan of API staged uploads in style of glBufferSubData, simply because they are always an order of magnitude slower than persistent mapping, involve an extra copy (so you can modify the input memory argument after the call) and present an implementation challenge (you need to stage and buffer-up the uploads) especially for large buffers.

This is why Vulkan only allows directs updates of 64kb or less (ringbuffer used for upload doesn't grow in an unwieldy manner).

Secondly how are you going to do this outside of a command buffer?
I thought you dislike data races?

General PRoblems

It isn't clear yet what happens when a buffer gets garbage collected while it is mapped.

We already have an implementation of that in our engine, a fence is placed (multi queue->many fences are placed) after the last use of the buffer, then this fence is paired with a functor that deletes the object.
All of this goes on a list when the user thinks they are "deleting" an API object, then from time to time this list of event+functor get checked for event signalling and the ones signalled get functors executed and removed.

The key to this approach is to try and coalesce the events (use one fence per many objects) and defer the checking until you are sure you have the time to do it, i.e. check the fences just before new objects are being created, but DONT check every frame on swap or other operation.

The WebGPUMappedMemory design makes each mapped region create two garbage collected objects. This could lead to some GC pressure.

YES

You can only place around 200k fences per second in OpenGL (my own test) on a laptop Intel i7, most of your approaches would require 1 fence per buffer update operation, you'd quickly run out of CPU time just on the part of the graphics driver.

And this is another reason why you want to support large mapped ranges, persistent mapped buffers and possibly zero-copy approaches.

Technically speaking any buffer that does not have the INDEX_BUFFER or INDIRECT_DRAW usage hint, you could allow the user to write directly to gpu memory with no intermediate copies.
This is because it doesn't matter if the user makes a mess of the buffer or has a data race.
The special case is for the mentioned two usage hints where you actually have to validate the data written for security.

What happens when WebGPUMappedMemory object's region in the buffer overlap?

This seems to be a weird exception to your "no data races rule", somehow you allow cpu2cpu core races but when its gpu2gpu or cpu2gpu you say "noo gpu's are scary".

While yes mapped ranges can be represented by any JS object you like that makes GC happy, you need to only issue the native API map operation of map() once from only one thread on a non-overlapping range in both Vulkan and DX12.

@Kangz
Copy link
Contributor Author

Kangz commented Nov 30, 2018

Thanks for your comments, before the detailed answer, here's some context:

  • Some WebGPU implementations will have the GPU driver in a different process and will use IPC and shared memory to transfer commands to the "GPU process". A strong constraint for WebGPU is that the API should behave the same (at least in term of correctness) between single-process and multi-process implementations.
  • For security constraints we need to avoid races between the CPU and the GPU, both because they CPU could potentially see uninitialized data via racy calls, and because we should avoid creating wildly available high-precision GPU timers (though developers will certainly have an option to have high-precision timestamp queries).
  • People in this group know about graphics APIs and will usually have thought about how to implement things before making a proposal.

Mapping proposal 1

You need to be able to use persistently mapped buffers as the developer.

Persistently mapped buffers are not implementable in the general case for multi-process implementations so we cannot expose them at the API level. Single-process WebGPU implementation can (and are likely to) use persistently mapped buffers internally.

Why? The buffer is the app's own buffer, whats the security concern here?

In multi-process implementations, mapWriteSync could just return a pointer to shared memory between the two processes. For security reasons this memory will be cleared to 0, and for consistency, the same should happen on single-process implementations.

Remember that on desktop devices you will be writing across the PCIe bus, which has very very limited bandwidth and you will reduce it by half!

PCIe bandwidth is huge (X GB/s) so I'm not too concerned. If this becomes a real issue, we could be clearing with compute shaders or find other solutions.

Mapping proposal 2

The problem with this one is that its unimplementable on Vulkan. [...]

When a buffer is mapped on single-process implementations using Vulkan (which might actually be none of them), the whole memory can be mapped persistently but only a partial view of it given to Javascript. The browser will have the opportunity to do data races, but will provide Javascript only with views that are data-race free.

Maybe mapping the buffer from the start at creation time would be more useful.

That's an interesting thought, we could have a fast-path saying that if the buffer has never been used, the mapping is instantaneous.

Immediate upload proposal 1

Making a synchronous buffer write operation is silly, you will have to wait for 1 or more (usually 3) frames to complete before the buffer becomes available (not used by GPU) for writing, and you will starve the GPU of work to do and it will idle while you're manipulating the buffer.

The name is misleading, the operation would be instantaneous, just that it doesn't give you a pointer to the real buffer but staging memory instead. It's just like setSubData but with one less copy. Though we could imagine optimized browser implementation that are able to detect the buffer isn't in use and give a pointer to the buffer directly when possible.

Immediate upload proposal 2

I'm not a fan of API staged uploads in style of glBufferSubData

Agreed, though I feel such an API is important for ease-of-use when starting a project (or for beginners) and when adding random debugging code in places in your code. It's just "put this data on the GPU even if it is a bit slow".

This is why Vulkan only allows directs updates of 64kb or less

It's also because NVIDIA hardware (and probably others) have a fast path to inline the content of a buffer update inside the command stream itself.

Secondly how are you going to do this outside of a command buffer?

It is as if it were in a command buffer and submitted immediately (though in practice I expect implementations to write the commands and wait until the next application submit). There are no data races.

General Problems

We already have an implementation of that in our engine, a fence is placed [...]

This isn't what I meant, our implementation already has what you described including "fence coalescing". This is about Javascript GC and what happens on the application's side when the buffer is GCd. In general GC shouldn't be visible to Javascript and we need to make sure WebGPU doesn't provide "GC discoverability".

You can only place around 200k fences per second in OpenGL (my own test) on a laptop Intel i7 [...]

Likewise this is about Javascript GC, and implementations are expected to be optimized using "fence coalescing" or other mechanisms. Ours is already.

Technically speaking any buffer that does not have the INDEX_BUFFER or INDIRECT_DRAW usage hint, you could allow the user to write directly to gpu memory with no intermediate copies.

Even index and indirect buffers could skip validation with backend API support for robust buffer access. We know persistently mapped buffers are something native developers want but our constraints mean we can't provide them directly. We striving to provide the same usability and performance given our constraints but can't just hand out pointers to GPU-visible memory with no checks.

This seems to be a weird exception to your "no data races rule", somehow you allow cpu2cpu core races but when its gpu2gpu or cpu2gpu you say "noo gpu's are scary".

While yes mapped ranges can be represented by any JS object you like that makes GC happy, you need to only issue the native API map operation of map() once from only one thread on a non-overlapping range in both Vulkan and DX12.

I don't understand what your point is.

@devshgraphicsprogramming

I will address all your points soon.

But can you fill me in on:

Persistently mapped buffers are not implementable in the general case for multi-process implementations so we cannot expose them at the API level.

What exactly is the problem here?

@Kangz
Copy link
Contributor Author

Kangz commented Dec 3, 2018

Persistently mapped buffers are not implementable in the general case for multi-process implementations so we cannot expose them at the API level.

What exactly is the problem here?

Imagine the GPU driver is in process A and the application in process B. All APIs except maybe Vulkan with difficult to use extensions, mapping a buffer in A will give you a pointer that's only valid in A. It isn't possible to transfer the memory region in B.

We could have a memory allocation in B that mirrors the mapped pointer. However the point of persistently mapped buffers is that CPU and GPU accesses to the data happen concurrently. This is not possible because 1) we don't know when the GPU writes data that needs to be forwarded from A to B, 2) we don't know when the CPU writes data that needs to be forwarded from B to A. Also, races.

@grorg
Copy link
Contributor

grorg commented Dec 3, 2018

bool isPending();

Why is this a function rather than:

readonly attribute bool isPending;

?

@grorg
Copy link
Contributor

grorg commented Dec 3, 2018

This proposal was discussed at the 3 Dec 2018 Teleconference

@devshgraphicsprogramming
Copy link

devshgraphicsprogramming commented Dec 3, 2018

All APIs except maybe Vulkan with difficult to use extensions

I wouldn't call the memory import/export extensions difficult to use.

Also ekhm:
https://docs.microsoft.com/en-us/windows/desktop/direct3d12/shared-heaps
Direct3D seems to support some sharing of memory between processes.

So maybe Metal is lacking?

@devshgraphicsprogramming

Update: It does indeed appear that Apple Metal is severely limited
KhronosGroup/MoltenVK#316

However macOS and iOS only make <20% total browsing OS market share, while the other 70% is Windows, Android and Linux all of which should support Vulkan (or D3D12) and hence sharing device/driver memory across process boundaries.

So it would make sense if persistently mapped buffers were an ubiquitous extension to webGPU until (if at all) Metal provides the same functionality or webGPU is a single-process implementation on macOS & iOS.

You really don't want to drag the performance and usability down without a way out for developers just because of one ecosystem.

Sure the default can be "managed" buffers but developers expect a fast-path to be made available.

@Kangz
Copy link
Contributor Author

Kangz commented Dec 6, 2018

Well in this case it settles it for WebGPU core. An extension could expose persistently mapped buffers but for browsers to accept to implement it I think it is safe to assume they'll require it to be deterministic.

Note that in both buffer mapping proposals the buffers aren't "managed" in the Metal sense of having a staging and a GPU-local copy. Instead it is possible for browsers to implement them with persistently mapped buffers but give pointer access only during specific times.

@devshgraphicsprogramming

I would like to note that RenderDoc provides just the functionality that you're claiming is impossible to provide with webGPU. RenderDoc is an intermediate library that intercepts all Vulkan/OpenGL/DirectX calls and hence hijacks any map and unmap calls presenting a pointer to its own memory instead.
It somehow manages to work with persistently mapped buffers both with and without the COHERENT flag. Naturally persistently mapped buffers without the COHERENT flag (the ones where you flush/invalidate sub-ranges you want to be visible, manually) work much faster (near native) with RenderDoc as it doesnt have to examine the whole mapped buffer range for changes before every draw, dispatch or transfer command.

There are still things I do not like about the proposals, such as:

  1. Non-mapped buffer updates not being a command that is recorded in a command buffer
  2. Excessive clearing of buffers on every map (its not needed)
  3. You can still pretend to have persistently-mapped-buffer-like API, as if the COHERENT flag was not available (where you need to flush and invalidate caches manually, this would actually debug really fast and nice in RenderDoc)
  4. Just because you will treat a buffer update (without mapping) like a separate command buffer containig only one command will not absolve you of data races, you need pipeline barriers and synchronisation between command buffers for that.... also its just a bad idea, if buffer updates were buffered up normally as commands (like in all other sane GPU APIs) then at least they would have some implicitly well defined order of execution (but not visibility or availability, obviously) against the other commands and renderpasses in the same command buffer.

In general, expecting a user to allocate 3x the memory they need, fence their buffer subranges to not overstep their own data that is in-flight would alleviate most of your performance, usability and sanity issues.
If you just allowed the app to data race itself and like Vulkan only allow it to map a single range per-buffer then you'd have no portability problem of overlapping mapped ranges being modified simultaneously and undefined data being present in the actual GPU buffer.

@devshgraphicsprogramming

I also read the minutes of the Dec 3 meeting and here are my 2 cents.

That’s why there are 2 proposals: both buffer mapping and immediate data upload. (Either setSubData or mapWriteSync.)

Both approaches are valid and complementary, we should be able to map and read/write (with appropriate invalidation/flush) with almost "immediate" result as well as buffer up buffer uploads in the command buffer which are then queued up.

These two flags are separate because they help avoid data races. Something I missed: we can probably limit buffers to be either both MAP_READ or MAP_WRITE but not both at the same time.

These flags are completely not for that purpose, basically the spec reserves the right for a driver implementation to handle you garbage (its actually an error in Vulkan) if you read with CPU from a buffer without MAP_READ and to not actually make the CPU writes available and visible to the GPU if MAP_WRITE was not specified.

I.e. If you create a non-coherent MAP_READ|MAP_WRITE buffer in Vulkan (there actually is not such impl yet, everything has coherent heaps only) then you can map it with any of these flags but to get valid data:

  1. When you read with CPU you must first vkInvalidate the ranges you are going to read after the last update of the ranges by the GPU
  2. When you write with CPU you must vkFlush the ranges you wrote before the first use of the ranges by the GPU

CW: when you have a vertex buffer you want to be device local. In GPU RAM, and only GPU RAM. When you’re going to write something once from the CPU (or not at all), and read or write it many times from the GPU. Then you want that allocated in GPU RAM, and you don’t put the map flags.

There are so many other considerations that go into this, you should not assume that host-visible means host-local... there are plenty of GPU architectures (even discrete) and Vulkan drivers where host-visible mappable memory is actually device local.

WebGPU should adopt the vulkan approach of exposing numbered driver memory-heaps corresponding directly to memory type flags (device local, map for read, map for write).

Sounds like we would potentially retain that ability with this API?
CW: what does moving buffers between heaps mean?
MM: not exactly sure: basically, residency and eviction. We can’t pin pages on the GPU at any particular location for an arbitrary period of time. But not sure that this affects this API.
CW: we thought about this for Dawn a bit. All these APIs should be OK. Can do magic behind the app’s back. Biggest change for migrating buffers: always create buffers with TRANSFER_SOURCE and _DEST flags.
Other way: make sure the buffer can be used by the app no matter where it is. Still have freedom to move things around.

This is something only a middleware engine should do such as Unity/Unreal, you should absolutely not move the buffers between heaps for the developer working directly with webGPU.
We end up with OpenGL usage hint-like behaviour that is driving developers crazy (such as NVidia driver deciding where and how to move your buffer depending on its whim).

having buffers be able to move around the GPU means you have to record at command submit time and not command record time.

And now this completely destroys the point of command buffers and next-gen API's lower CPU usage promise.

CW: assuming no multi-queue. If multiple queues, things become more complicated. In a single-queue world, all these operations become visible and are ordered the same way submits are between each other.

I do not understand why you've neutered the webGPU queue system so far, even with a single queue in Vulkan the submits can execute out-of-order.
You're going to insert pipeline barriers and even synchs between every single queue submit in your Vulkan implementation, you will kill everything from DMA async transfers, streaming, triangle visibility culling all the way through to Async Compute.

@Kangz
Copy link
Contributor Author

Kangz commented Dec 6, 2018

I would like to note that RenderDoc provides just the functionality that you're claiming is impossible to provide with webGPU.

RenderDoc strongly suggests doing explicit flushes and invalidates because otherwise it has to essentially snapshot the state of mapped buffers on every queue submit, which is not acceptable for WebGPU.

  1. Non-mapped buffer updates not being a command that is recorded in a command buffer

Implementations are free to record a vkCmdUpdateBuffer in a staging command buffer to implement setSubData. Having the equivalent of vkCmdUpdateBuffer would require implementations to store data inside the WebGPUCommandBuffer at least on Metal and maybe on D3D12.

  1. Excessive clearing of buffers on every map (its not needed)

Agreed that this is suboptimal if you only care about perf, but because we are making a Web API, we have to care about strong portability too. (Vulkan is not portable in the same sense as the Web platform is).

  1. You can still pretend to have persistently-mapped-buffer-like API, as if the COHERENT flag was not available (where you need to flush and invalidate caches manually, this would actually debug really fast and nice in RenderDoc)

If the application wants that semantic they can implement it on top of what proposal 1 provides. To provide this at the platform level would require it to either be like Metal's "managed" buffers, or will have data races which are not acceptable.

  1. Just because you will treat a buffer update (without mapping) like a separate command buffer containig only one command will not absolve you of data races

It does because WebGPU has implicit synchronization.

In general, expecting a user to allocate 3x the memory they need, fence their buffer subranges to not overstep their own data that is in-flight would alleviate most of your performance, usability and sanity issues.

You can implement that using proposal 1 except that instead of subranges of a single large buffer you have subranges of several buffers.

Both approaches are valid and complementary, we should be able to map and read/write (with appropriate invalidation/flush) with almost "immediate" result as well as buffer up buffer uploads in the command buffer which are then queued up.

The two things you are describing are not the proposals. This is what the proposals are:

  • mapWriteSync doesn't need invalidate and flushes because the WebGPU implementation takes care of it. Thing of WebGPUMappedMemory as an RAII helper you would use to prevent racy accesses.
  • setSubData is not buffered.

These flags are completely not for that purpose, basically the spec reserves the right for a driver implementation to handle you garbage (its actually an error in Vulkan) if you read with CPU from a buffer without MAP_READ and to not actually make the CPU writes available and visible to the GPU if MAP_WRITE was not specified.

The part you quoted was talking about WebGPU's "MAP_READ" and "MAP_WRITE" flags.

There are so many other considerations that go into this, you should not assume that host-visible means host-local... there are plenty of GPU architectures (even discrete) and Vulkan drivers where host-visible mappable memory is actually device local.

I am aware, and that's why the quoted sentences say "device-local".

WebGPU should adopt the vulkan approach of exposing numbered driver memory-heaps corresponding directly to memory type flags (device local, map for read, map for write).

Exposing memory heaps leads to non-portability (and finger printing). Exposing heaps also means you need to expose vkBuffer/TextureGetMemoryRequirements that make the API non-portable and would introduce a synchronous IPC in multiprocess implementation. Instead in WebGPU the implementation will make the decision of which heap to use based on the presence of MAP_ flags.

This is something only a middleware engine should do such as Unity/Unreal, you should absolutely not move the buffers between heaps for the developer working directly with webGPU.
We end up with OpenGL usage hint-like behaviour that is driving developers crazy (such as NVidia driver deciding where and how to move your buffer depending on its whim).

Think of it more like WDDM which migrates resources to CPU memory when there is memory pressure. We are not interested in optimizing behind the application's back like OpenGL and D3D11 drivers.

And now this completely destroys the point of command buffers and next-gen API's lower CPU usage promise.

The good thing is that this is an implementation detail and not mandated by the spec.

I do not understand why you've neutered the webGPU queue system so far, even with a single queue in Vulkan the submits can execute out-of-order.

Do you know of a single driver where that happens? I looked at all open-source drivers and they never do this. Driver engineers we talked to also confirmed that.

You're going to insert pipeline barriers and even synchs between every single queue submit in your Vulkan implementation, you will kill everything from DMA async transfers, streaming, triangle visibility culling all the way through to Async Compute.

Metal uses this model and has good performance.

@devshgraphicsprogramming
Copy link

devshgraphicsprogramming commented Dec 6, 2018

RenderDoc strongly suggests doing explicit flushes and invalidates because otherwise it has to essentially snapshot the state of mapped buffers on every queue submit, which is not acceptable for WebGPU.

Yes, I know, so I was advocating providing persistently mapped buffers but without a COHERENT-like flag/behaviour.

Having the equivalent of vkCmdUpdateBuffer would require implementations to store data inside the WebGPUCommandBuffer at least on Metal and maybe on D3D12

It would require it on Vulkan as well if you placed no limits on the update size.

Agreed that this is suboptimal if you only care about perf, but because we are making a Web API, we have to care about strong portability too. (Vulkan is not portable in the same sense as the Web platform is).

Please argue the benefit of this, why cant the app just see its own recycled previously "mapped" staging arrays whenever possible ?

If the application wants that semantic they can implement it on top of what proposal 1 provides. To provide this at the platform level would require it to either be like Metal's "managed" buffers, or will have data races which are not acceptable.

The application developer wants that semantic for performance, not extra work.
Metal like "managed buffers" are the lesser of many evils, so I'd use that.
Data races will happen anyway, and the various proposals in webGPU are full of holes that still make that possible, especially overlooked are the GPU2GPU races... all I see you protecting against are CPU2CPU and CPU2GPU data races.
The user should already be familiar with parallel programming and know how to prevent data races by synchronisation and manuall flushing

It does because WebGPU has implicit synchronization.
You can implement that using proposal 1 except that instead of subranges of a single large buffer you have subranges of several buffers.
setSubData is not buffered.

Yes, I've collated the different pieces of data as well as the minutes, also tackled that in my second/last post. The level of implicit synchronization is excessive.
Unbuffered/unqueued buffer copies will require synch. that will starve and stall the GPU if the same resource is being used and updated, and we're back to messy OpenGL 2.1 and Directx 9 techniques of buffer round robining and orphaning.
We've had that, we hated it and its suboptimal, wasting especially the CPU's time.

https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=2ahUKEwj3iMDCgozfAhXHmLQKHaZTCksQFjAAegQIAxAC&url=https%3A%2F%2Fwww.seas.upenn.edu%2F~pcozzi%2FOpenGLInsights%2FOpenGLInsights-AsynchronousBufferTransfers.pdf&usg=AOvVaw07MXzed05NNYoFe2B_TaH1

http://www.java-gaming.org/index.php?PHPSESSID=lnmc92bhung9t2sb1s0d32lnk6&topic=32169.msg300554#msg300554

The difference between the different buffer update techniques can be as much as 4x!

The part you quoted was talking about WebGPU's "MAP_READ" and "MAP_WRITE" flags.

Well they're called the same in all APIs, I knew it meant webGPU flags, but the way they were described in the meeting gave them a different meaning to Vulkan's, OpenGL's and DirectX's mapping flags.
And its reasonable to expect the same semantics regarding the similarily named flags as from the other 3 or 4 graphics APIs.

You should also not make the flags mutually exclusive, a read&write buffer should be possible to create and map in that mode.
It would be possible to implement if you agreed to provide a manual flush and invalidate.

Instead in WebGPU the implementation will make the decision of which heap to use based on the presence of MAP_ flags.

I can accept the fingerprinting and IPC synch argument, however...
You need far more than MAP_ flags to choose the heap to use. Do you want caching, do you want device local or host local for your host-visible memory?

Do you know of a single driver where that happens? I looked at all open-source drivers and they never do this. Driver engineers we talked to also confirmed that.

The Vulkan spec explicitly leaves room for that, the fact that its not taken advantage as of yet is slightly orthogonal to that.
Also as we can see from this AMD capture, the commands within a command buffer execute overlapped (not exactly out-of-order).
https://mynameismjp.files.wordpress.com/2018/01/rgp_bindlessdeferred.png?w=1024
https://mynameismjp.files.wordpress.com/2018/01/pix_timeline.png

Also, have the same engineers confirmed that render sub-passes within a single command buffer in Vulkan cannot happen out of order if no dependencies are specified?

Metal uses this model and has good performance.

I think you're introducing even more barriers, implicit synch, etc. than even Metal has.
Also what does it have good performance in comparison to? OpenGL?
Please define "good performance" because I sincerely doubt it has the same performance as Vulkan on AMD hardware, especially when streaming and async compute is involved.

@Kangz
Copy link
Contributor Author

Kangz commented Dec 7, 2018

Please argue the benefit of this, why cant the app just see its own recycled previously "mapped" staging arrays whenever possible ?

That would work if we force "MAP_WRITE" buffers to only have read-only usages. It would mean that multi-process browsers will have to keep the data in the tab's process, maybe in shared memory. This sounds pretty good.

The application developer wants that semantic for performance, not extra work.
Metal like "managed buffers" are the lesser of many evils, so I'd use that.

Having managed buffers always requires at least one copy. The whole point of approach 1 is that it can give you zero-copy when in a single-process or with Vulkan and D3D12 cross-process mapping. Am I missing something?

Data races will happen anyway, and the various proposals in webGPU are full of holes that still make that possible, especially overlooked are the GPU2GPU races... all I see you protecting against are CPU2CPU and CPU2GPU data races.

GPU2GPU races in WebGPU are only through UAVs in a single dispatch or a single render-pass (and compute shader shared memory). That's only a small number of holes imho.

The user should already be familiar with parallel programming and know how to prevent data races by synchronisation and manuall flushing

We can't assume that of developers currently using WebGL which are an important target for WebGPU. Native developers get barriers and synchronization wrong too and driver engineers told us they sometimes have to go at game companies and fix their code themselves.

Unbuffered/unqueued buffer copies will require synch. that will starve and stall the GPU if the same resource is being used and updated

I'm confused, the same would happen for Vulkan developers using vkCmdUpdateBuffer with proper synchronization so I don't see what the issue is for WebGPU.

Also as we can see from this AMD capture, the commands within a command buffer execute overlapped (not exactly out-of-order).

That's because of warp occupancy, not because of Vulkan allowing submits to overlap: a renderpass has to be fully contained in a command buffer (as its beginning and end, but not necessarily the content) so the parallelism we see in the image is not thanks to that part of the Vulkan spec.

Also, have the same engineers confirmed that render sub-passes within a single command buffer in Vulkan cannot happen out of order if no dependencies are specified?

WebGPU doesn't have multi-subpass renderpasses. I don't know if drivers take advantage of renderpasses being pre-compiled to optimize and allow parallelism.

@grorg
Copy link
Contributor

grorg commented Dec 10, 2018

This was discussed in the 10 Dec 2018 meeting

@devshgraphicsprogramming
Copy link

devshgraphicsprogramming commented Dec 10, 2018

I will address all of your points soon @Kangz but for now let me deal with the most important ones.

I'm confused, the same would happen for Vulkan developers using vkCmdUpdateBuffer with proper synchronization so I don't see what the issue is for WebGPU.

Absolutely not, vkCmdUpdateBuffer bakes the updates into the command buffer which defers them until the GPU is ready to proceed, having finished all previous work, the driver handles this and does not stall itself (due to multiple queues and deferring the update), other apps or your application.

EXTRA NOTE: Yes, vkCmdUpdateBuffer needs a pipeline barrier, but a pipeline barrier is a far cheaper operation than a CPU-GPU sync.

This is complete opposite to what you are proposing that would require a wait event in user space that is not deferrable.
A Vulkan app that would submit vkCmdUpdateBuffer commands into a separate command buffer and force the splitting of other command buffers whenever a buffer has to be updated is an anti-thesis of good API usage.

That's because of warp occupancy, not because of Vulkan allowing submits to overlap: a renderpass has to be fully contained in a command buffer (as its beginning and end, but not necessarily the content) so the parallelism we see in the image is not thanks to that part of the Vulkan spec.

WebGPU doesn't have multi-subpass renderpasses. I don't know if drivers take advantage of renderpasses being pre-compiled to optimize and allow parallelism.

These things are called "Render Graphs", the folks at DICE did a lot of research into them and how to overlap the different parts of the rendering process and ConfettiFX is following in their footsteps.

This is the new hot-topic in low-level graphics programming and Vulkan with its subpasses and explicit resource dependencies actually already has all the necessary meta-data to create such "Render Graphs" should driver implementations choose to go that extra mile in the near future.

@RafaelCintron
Copy link
Contributor

RafaelCintron commented Dec 11, 2018

Thank you for putting this together, @Kangz

Re: Promises and ".thenable" The problem with the proposed approach is there are more methods on the promise object than just .then. We would need to implement all of them in order for the object to truly act like a promise to web developers. Instead, I suggest we have an attribute on WebGPUMappedMemory that returns a real promise web developers can .then. The Promise can be lazily created.

States: The document should clearly define the states of WebGPUMappedMemory and WebGPUBuffer as well as which specific actions transition each one. When I first read the proposal, it took me a while to figure out the two objects had a completely different set of states.

setSubData and mapWriteSync: We should consider punting these from MVP since using the async APIs is what we want people to do anyways, even beginners. If people want debugging, consider having buffers accept an initial ArrayBuffer at creation time.

Polling: Spec should be clear that results are delivered on Javascript task boundaries and do not change during a callback or Promise resolution. In other words, if you query isPending and receive false, continued queries are not going to change the answer until you exit Javascript.

GC discoverability As discussed during the call, this can be solved by having the ArrayBuffer keep WebGPUMappedMemory alive. WebGPUMappedMemory should keep the WebGPUBuffer alive. If a WebGPUBuffer is placed in a command buffer as part of a command, the command buffer will keep it alive.

isPending and getPointer: I agree with @grorg that isPending should be an attribute. getPointer should also be an attribute. Both should be readonly.

Questions I'd like the eventual PR to answer:
Is it an error for the developer to call mapReadAsync followed by mapWriteAsync without an unmap in between or vice versa?

If the web developer maps 5 bytes of data, does the array buffer returned by getPointer only contain 5 bytes or the whole buffer?

Are the calls to mapReadAsync and mapWriteAsync nestable? In other words, if I call mapReadAsync 5 times, do I need to call unmap 5 times in order for the buffer to transition from the mapped state to the pending state?

@kainino0x
Copy link
Contributor

Thenable: FYI, the idea making this Thenable was not to make it look like a Promise exactly; "thenable" is a concept from the spec (although it only barely mentions "thenable" - it was a more prominent concept before Promises were specced afaik). That said, I don't think I have any issue with just returning a real Promise (although it is one more piece of garbage for the GC).

setSubData and mapWriteSync: I suspect these are important for more than hacking/debugging. First of all, even though they're synchronous, they're not "blocking" (they just do a memcpy of your data). Second (and I could be wrong about this), they seem useful for a lot of usecases. For example, with WebXR, as soon as you receive the pose data, you want to upload it (and it's a tiny amount of data). Having only async write would mean the application has to wait for both the mapWriteAsync to complete AND the pose data to arrive, before it can upload it and continue rendering. This could probably be handled with Promise.all (or polling equivalent), but it might be difficult to shoehorn into some application architectures, and it might have a latency impact.

GC discoverability: (nit) All that really matters is that the ArrayBuffer not suddenly become detached (as if the buffer has been unmapped). So technically the ArrayBuffer only has to point at the WebGPUBuffer, not the WebGPUMappedMemory, I think.

@grovesNL
Copy link
Contributor

That said, I don't think I have any issue with just returning a real Promise (although it is one more piece of garbage for the GC).

I think it's going to be important to keep a GC-less path for most of the API currently marked with thenable, because this applies to some functions which will usually be called at high frequency. There was some mention of a spinloop use case for workers too.

I'd still prefer to use callbacks in the IDL at this point unless we have a good idea whether a special thenable would be allowed.

@Kangz
Copy link
Contributor Author

Kangz commented Dec 11, 2018

setSubData and mapWriteSync: like @kainino0x said it is super important to have one of them in the MVP.

Is it an error for the developer to call mapReadAsync followed by mapWriteAsync without an unmap in between or vice versa?

I failed to mention this in the proposal but the MAP_READ and MAP_WRITE buffer creation usage flags would be mutually exclusive. so the problem wouldn't happen.

If the web developer maps 5 bytes of data, does the array buffer returned by getPointer only contain 5 bytes or the whole buffer?

The ArrayBuffer will contain 5 bytes.

Are the calls to mapReadAsync and mapWriteAsync nestable? In other words, if I call mapReadAsync 5 times, do I need to call unmap 5 times in order for the buffer to transition from the mapped state to the pending state?

Calling WebGPUBuffer.unmap invalidates all WebGPUMappedMemory for this buffer.

Kangz added a commit to Kangz/gpuweb that referenced this issue Dec 11, 2018
@kainino0x
Copy link
Contributor

Being data race free would be possible if ArrayBuffer could be unneutered but this is not the case.

Just was looking at this issue again and thought about this.
We don't necessarily need to "reattach" ArrayBuffers to have "persistent" mapping underneath - we can return a new ArrayBuffer each time, but it could point at the same chunk of memory as before (e.g. IPC shmem or a persistently mapped buffer).

@RafaelCintron
Copy link
Contributor

GC discoverability: (nit) All that really matters is that the ArrayBuffer not suddenly become detached (as if the buffer has been unmapped). So technically the ArrayBuffer only has to point at the WebGPUBuffer, not the WebGPUMappedMemory, I think.

@kainino0x , what you suggest will also work.

@RafaelCintron
Copy link
Contributor

setSubData and mapWriteSync: like @kainino0x said it is super important to have one of them in the MVP.

I am confused by the behavior of mapWriteAsync. Suppose you create a buffer and use it to upload some data in one frame. 100 frames later you call mapWriteAsync on the buffer. Will the returned WebGPUMappedMemory return false for pending and give you an ArrayBuffer right away? If so, how is this different than using mapWriteSync? I must be missing something.

Should we change the API such that we do away with WebGPUMappedMemory altogether and have mapWriteAsync return a promise that is resolves to an ArrayBuffer and mapWriteSync returns an nullable ArrayBuffer?? You will know the buffer is pending if mapWriteSync returns null.

@Kangz
Copy link
Contributor Author

Kangz commented Dec 12, 2018

Will the returned WebGPUMappedMemory return false for pending and give you an ArrayBuffer right away? If so, how is this different than using mapWriteSync?

It will always be returned with pending = true, and it is an implementation detail whether it becomes available on the next JS task boundary or in a couple frames. It is different from the synchronous case, because it gives a chance to the browser to give you an ArrayBuffer that points directly into GPU-visible memory.

mapWriteSync is really just like setSubData but avoids one copy.

Should we change the API such that we do away with WebGPUMappedMemory altogether and have mapWriteAsync return a promise that is resolves to an ArrayBuffer and mapWriteSync returns an nullable ArrayBuffer?? You will know the buffer is pending if mapWriteSync returns null.

This is basically proposal 2 but then mapWriteSync cannot be used for immediate data uploads, and by that I mean letting the application "upload data right now", even if the buffer is still in use by the GPU.

@kvark
Copy link
Contributor

kvark commented Dec 14, 2018

It's hard to keep up with this discussion... Just shows that we are reaching the limit of where Github UI works, and may need to consider different spaces for hot topics.

Zeroing out the data

It would be unfortunate to zero out the data (that is map for writing only) on every map operation, especially since we don't have persistent mapping, and so we can expect it to be called more often. Can this be avoided by:

  • ensuring resources are either zeroed on creation, or we track their initialization flag and error out
  • keeping the data the user put in if the GPU didn't modify it
  • only zeroing out data if we know GPU did some writes to a resource

Persistent mapping

My understanding here is that a few popular use cases would be helpful for us to find the right solution. I've recently seen one such case: in Dota2, there is a big persistently mapped UBO, and CPU writes down chunks of it and then binds it as a dynamic uniform buffer (providing the offsets) over and over to different draw calls (with advancing offsets).

In this case, since GPU doesn't mutate the data, the user can just map/fill/unmap the buffer a few times per frame, and we as a browser implementation can turn those map/unmap calls into simple flush/invalidate on a persistently mapped buffer. So it doesn't appear that exposing it in the API would be required to get that efficiency.

@devshgraphicsprogramming I'm sure you have more cases under your belt. Let's talk about them and see if we are missing something important.

Overlapped submissions

I don't think we are preventing much of a parallelism within a single queue. The most controversial point is implicit synchronization between dispatches, but outside of that we aren't inserting any more barriers than the user would need for correctness anyway (or at least that's the idea). And for dispatches, it came down to providing the strong test cases, so I guess the next step is to try to see how our API would match cases provided by @devshgraphicsprogramming in #64 (comment) (thank you!)

@devshgraphicsprogramming
Copy link

devshgraphicsprogramming commented Dec 14, 2018

@kvark it's not that its impossible to implement most algorithms or engines with your buffer API, but it would be a royal pain in the ass to port existing D3D11/12, OpenGL or Vulkan engines and games to webGPU that already use persistently mapped buffers exclusively and/or manage their own staging memory in Vulkan/D3D12.

Lastly we have a big performance issue, because all of these "compromises" multiply like compound interest. I.e. as we notice in OpenGL insights a non-persistently mapped API-backed BufferSubData path can be up to 3x slower. So lets do a back-of-the-envelope calculation on how much slower than nicely optimized persistently mapped but non-coherent buffers this can be.

Psuedo-mapping proposal:

  • 2x for zeroing out mappings
  • 1.2x to 3x for the synch (depending on contention)
    End result is at least 2.5x to 6x slower performance than with OpenGL persistently mapped buffers that have been available for the past 10 years.

Buffer-update proposal:

  • 2x for the extra copy
  • Unknown cost for extra CPU synch, dedicated command list/buffer construction and submit
    At least 2x slower, and screws up command lists/buffer building.

My main issue is that webGPU is taking on the task of what used to be the video-driver's jobs back in the OpenGL and pre-D3D12 era, so the performance of my App's webGPU buffer updates (which are a corner-stone of all that a GPU app does) would largely hinge on the quality of the webGPU implementation.
After all the sizes of the buffers that stage pseudo-BufferSubData updates or the buffers that correspond to the mapped ranges in a multi-process implementation, their allocation strategies, etc. etc. is now all the responsibility of the webGPU implementation. For example, I don't think having 1 JS object per pseudo-BufferSubData update would work out very well performance-wise if lots of very small sub-ranges were updated, whereas in Vulkan or D3D12 these updates would at least be buffered up in one object that uses the command buffer/list allocator with an appropriate growth strategy as opposed to being treated as 1 submit per 1 command buffer/list that only has 1 buffer update range.

Past experience tells us that it was rare for every driver team to do the job well enough to satisfy all the use cases, most of the time a problem surfaced during the production of a new and popular game and got fixed after it reared its head in production.
The now very old article in OpenGL Insights as well as a presentation (who's name I forgot at this moment) comparing the perf of DirectX11's best attempt (DISCARD_OVERWRITE) vs OpenGL Persistently Mapped Buffers (30 vs 70 FPS) showed the massive disparities between vendors, where effectively if you wanted best performance you had to pick 2 different code-paths (i.e. mapping for Intel, update subdata for AMD) for different GPU Vendors. Now with all your attempts to prevent fingerprinting, the developer couldn't even resort to such a workaround of different per-vendor paths.

This would likely repeat some of my past grievances with Intel's OpenGL drivers in the form of grievances with Browser XYZ's implementation of webGPU, because after all can you guarantee that each major browser will throw enough people with enough expertise and enough money at the webGPU team as you would at an OpenGL driver impl. ?
And most probably on top of that I could layer on all the future and present issues with the underlying native-API implementation by different vendors on different OSes.

@Kangz
Copy link
Contributor Author

Kangz commented Dec 14, 2018

Zeroing out the data

This would be extremely difficult to specify unless we say that MAP_WRITE can only be used in combination with read-only usages, which seems like an ok restriction. WDYT? For multi-process browsers without mechanisms to share GPU visible memory in the renderer process, it would require either keeping a shadow copy in the renderer process and sending invalidations to the GPU process, or reading back the data from the GPU process asynchronously when writing. Both are probably ok.

@devshgraphicsprogramming we agree that setSubData is sub-optima. It exists so that developers can more easily get started with WebGPU, and it also helps with debugging because you can put data in the buffer "right now" without having to care about what the rest of the code is doing. More advanced developers would use mapWriteAsync instead of setSubData in almost all cases for production code.

Let's assume that, like you and @kvark suggested, mapWriteAsync doesn't zero-out the mapping. At this point engines can use mapWriteAsync the same way they have been using persistently mapped buffers: instead of having a single ringbuffer for data uploads and avoiding races themselves with fences, they keep a recycle queue of large mappable WebGPUBuffers and let the browser tell them when it is safe to use a WebGPUBuffer without races.

The biggest disadvantages are:

  • increased memory utilization because you cannot map a part of the buffer while the rest is in use by the GPU so you have to overallocate a bit.
  • native applications using persistently mapped buffers would have to either emulate them on top of this abstraction (incurring an additional copy) or specialize their upload mechanisms for WebGPU.

The advantages are:

  • the guarantee of having no races.
  • on some APIs browsers will be able to give you a pointer to the underlying API mapping, which allows for the minimum of copies.
  • an efficient implementation even if the pointer isn't the underlying mapping pointer.
  • if zero-copy is deemed a security risk, for example for some form of side-channel attack, we don't have to disable WebGPU and can just fallback to not giving the underlying pointer.

@devshgraphicsprogramming

increased memory utilization because you cannot map a part of the buffer while the rest is in use by the GPU so you have to overallocate a bit.

That would be a serious understatement.

native applications using persistently mapped buffers would have to either emulate them on top of this abstraction (incurring an additional copy) or specialize their upload mechanisms for WebGPU.

They did that to be much faster than the usual OpenGL (ES) and DirectX11 methods, now they will have to be even slower than the old methods due to emulation.

The advantages are:

You'd keep all of them, except the "guarantee of having no races" if you went for a non-cached approach with explicit flushes and invalidations (as well as beautiful RenderDoc integration).

@Kangz Kangz closed this as completed Dec 14, 2018
@Kangz Kangz reopened this Dec 14, 2018
@kvark
Copy link
Contributor

kvark commented Dec 17, 2018

We discussed this with @devshgraphicsprogramming a bit and came to conclusion that just adding mapping ability to WebGPUBuffer is not the best way to expose the data upload/download workflows. I'll try to explain why.

On native, persistent mapping became the way of exchanging data because:

  1. user can use some of the buffer for CPU while GPU working with the other
  2. user can explicitly tell the driver when and what to synchronize, thus avoiding copies and precautionary workarounds by the driver

This doesn't match our current policy of having a resource to consistently be in only a single mutable usage at a time (we can't map half of the buffer while GPU works on the other half). Thus, providing a buffer API to the user and expecting them to map (portions of) it will never be efficient. Note that for textures, it's still fine to have a single-mutable-usage policy since it applies to subresources individually, and textures don't need mapping, and it's fine to assume the user will only use each subresource as a whole.

I think we should step back a bit and think of a solution in terms of what workflows need to be exposed: uploading and downloading chunks of data. Perhaps, we can design an API in such a way that the implementation would be managing a persistently mapped buffer under the hood, but its subranges are exposed to the user as individual objects? Something like this:

interface WebIDLUploadBuffer {
  attribute ArrayBufferView data;
}

partial interface WebIDLDevice {
  Promise<WebIDLUploadBuffer> createUploadBuffer(u32 size);
}

The Promise<> and naming here are non-critical. The idea is that the user would be able to copy the data from it into something else using command buffer operations, and we define at what point the contents of ArrayBufferView get used and the object is discarded (current candidates are: command buffer encoding or submission).

@devshgraphicsprogramming

The above design is indeed better in my opnion.

Perhaps, we can design an API in such a way that the implementation would be managing a persistently mapped buffer under the hood, but its subranges are exposed to the user as individual objects?

I would still like to see the whole mapped range of the buffer as one object that I can explicitly flush and invalidate distinct ranges and subranges of. Think about it as a git push/pull (with some rather non-existent conflict resolution) except that local is the CPU and the remote is the GPU.

That would fit nicely into existing engines that would like to port to webGPU, as many of these already have their own allocators for CPU and GPU memory.

and we define at what point the contents of ArrayBufferView get used and the object is discarded

I'm having some questions here whether this extra tracking does not destroy the whole point of native persistently mapped buffers.

Also what would createUploadBuffer(u32 size) do?

@kainino0x
Copy link
Contributor

kainino0x commented Dec 17, 2018

Perhaps, we can design an API in such a way that the implementation would be managing a persistently mapped buffer under the hood, but its subranges are exposed to the user as individual objects?

I thought about this briefly as well. It is probably already possible to some extent with the proposed API, but could be more powerful (e.g. with sub-range tracking, and a hint to use a persistently mapped buffer).

@devshgraphicsprogramming

A non-coherent Persistently Mapped Buffer is implementable and should be the default.

Since OpenGL AZDO days Persistently Mapped Buffers (coherent and non-coherent) have been advertised as Best Practice by all 3 major desktop GPU vendors.
And this has been for a very good reason, for Nvidia ARB_buffer_storage persistently mapped buffers beat D3D11's Map(NO_OVERWRITE) performance by a factor of 3 in a benchmark designed to measure upload speed specifically. D3D11 didn't loose for the lack of trying, it was mostly the implicit "sub-range tracking" that produced the disparities in the performance.
They are practically the only consistent interface available in D3D12 and Vulkan.

Modern Engines are already tooled around persistently mapped buffers and have their own allocators, etc. that would have to be dumbed down and stripped of their performance for the current webGPU proposed API.

@kvark
Copy link
Contributor

kvark commented Dec 17, 2018

Exposing large mapped buffers is problematic because it conflicts with the current synchronization policy of a resource having exclusive usage (i.e. can't CPU write to one part while GPU uses the other). Adding manual flush/invalidate on top of that would unfortunately break any portability guarantees, so we can't do this.

As for the sketch I proposed, it would only work if we say that CPU-visible buffers can't be used for anything else. That would imply an extra copy (on GPU or DMA) into the actual private memory for anything that the native engines currently are trying to use on GPU while having as CPU visible. As soon as we start thinking about parts of the native buffer being exposed as individual WebGPUBuffer objects on the client side (that need to be used by GPU), we hit the problem of descriptor sets that need to be managed (since the client doesn't know it's the same native buffer!).

So, for me, the discussion is blocked on the following questions:

  1. can we consider stating that CPU-visible memory is only used for uploads/downloads (and not anything on GPU)?
  2. for the original proposal (see subject), we need a good pseudo-code explaining how the existing workflows would be ported on it.

For the latter, here is a rough description of what Valve's engine is doing:

  • for each draw call, find out how much memory it needs for the uniforms
  • sub-allocate this memory from a giant CPU-visible uniform buffer and fill it up on CPU
  • re-bind the descriptor set containing the CPU buffer with the new offset (it's a dynamic buffer, see Proposal: Dynamic uniform and storage buffer offsets #116)
  • issue the draw

@devshgraphicsprogramming
Copy link

devshgraphicsprogramming commented Dec 17, 2018

it would only work if we say that CPU-visible buffers can't be used for anything else.

That actually wouldn't be bad, it would be promoting good usage. Since immutable unaccessible buffers are fastest in all benchmarks 😄

That would imply an extra copy (on GPU or DMA) into the actual private memory for anything that the native engines currently are trying to use on GPU while having as CPU visible.

As long as relaxed synchronisation is used that actually allows overlapped DMA transfers with cmd buffer graphics and compute operation executions, all is well.

For the latter, here is a rough description of what Valve's engine is doing:

There is a ton of manual synchronisation there inbetween, as these happen on 3 different timelines (GPU queue that you submitted the draw call cmd buff, some arbitrary timeline of device-scope flushes and invalidates, CPU timeline).

@devshgraphicsprogramming

That actually wouldn't be bad, it would be promoting good usage. Since immutable unaccessible buffers are fastest in all benchmarks

Note: that would be only in the case you use the same UBO data more than once or twice.

@devshgraphicsprogramming
Copy link

devshgraphicsprogramming commented Dec 18, 2018

I did some more digging into vkCmdUpdateBuffer and vkCmdCopyBuffer and unfortunately both can only be called outside of a render-pass.

Which would make @kvark 's idea a little bit limiting.

As for the sketch I proposed, it would only work if we say that CPU-visible buffers can't be used for anything else. That would imply an extra copy (on GPU or DMA) into the actual private memory for anything that the native engines currently are trying to use on GPU while having as CPU visible.

In the light of the Vulkan spec it looks like PSMs are the only way to update/copy data into a GPU-side buffer inside of a render pass (since VkEvent functions and commands are allowed in any scope, both outside and inside a renderpass).
Update: Chapter 5.8 of the Vulkan spec explicitly disallows it, too bad the vkCmdWaitEvents documentation doesn't even link to that chapter.

By consequence if we are to believe @Kangz about current drivers and Vulkan implementations not taking the advantage of "render graphs" or the possibility for out-of-order execution within a command buffer yet, PSMs would be the only way to truly achieve asynchronous and fully overlapped data transfers (but the synchronisation must happen either before or after a renderpass instance).

Also all other buffer manipulation (inline update, fill, copy between buffers, etc.) commands can only take place outside of a renderpass.

@Kangz
Copy link
Contributor Author

Kangz commented Sep 2, 2021

Closing. Buffer mapping has been added a while ago and fairly stable. Further discussion can go in new issues.

@Kangz Kangz closed this as completed Sep 2, 2021
ben-clayton pushed a commit to ben-clayton/gpuweb that referenced this issue Sep 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants