New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mapSync on Workers - and possibly on the main thread #2217
Comments
Some customers are very interested in this, so I'd like to put it on the meeting agenda. |
Without synchronous access to data, we are running into problems with TFJS on WebGPU backend (See tensorflow/tfjs#1595 which suggests that dataSync might never be supported for WebGPU). Our rendering strategy calls TFJS as a part of the render loop (since we might want to run an augmented video feed through a neural network) and requires access to network results synchronously for further processing. Yielding the render loop to wait for TFJS results would require some significant refactoring and also introduces an overhead to preserve the renderer state while waiting for the results. Otherwise, we need to build some hacks like trying to reuse latest available results and hope that they are applicable to the frame being processed. When rendering is happening on a worker thread, there shouldn't be an issue in synchronously waiting for GPU to finish enqueued tasks (and thus support |
Wanted to raise our hand as someone who needs blocking operations on workers as well! ✋ (we're compute-focused and don't have concepts of frames or clearly delineated times when we could yield to the browser, and run all our code in workers) This class of issues is currently our biggest blocker with using WebGPU in the browser - we've only been able to get things working by using We do also need onSubmittedWorkDone so long as it's not possible to receive a callback for a submission without first yielding to the browser - we're running our entire program in a worker thread without yielding at any point (as would practically anyone coming from webassembly). In our ideal situation this does not imply blocking but that the callback could be issued from another thread such that we can signal a semaphore that a blocking thread may be waiting on. We really would prefer there to be native semaphores/fences in WebGPU but barring that we need to emulate them and the minimum to do so would be a worker thread we kept around to wait for completion events we used to signal potential waiter threads. Something like: std::thread event_processing_thread([&]() {
while (true) {
// we don't want this to be a spin-loop - we're really just telling the API we'd
// like callbacks to be made from this thread when there are any pending
wgpuInstanceProcessEvents(/*blocking=*/true);
}
});
FenceFutex fence;
wgpuQueueOnSubmittedWorkDone(..., [](WGPUQueueWorkDoneStatus status, void *userdata) {
fence.signal();
});
wgpuQueueSubmit(....);
// can do other work here, non-blocking
// maybe wait - if already fired then this won't block and otherwise it'll wait for the event
// to be processed on the thread and signal the fence
fence.wait(); |
An interesting discussion in the Dawn matrix room with @austinEng where he had a really nice idea and I wanted to document here: There may be a single primitive that solves this for javascript/native and workers: allow async operations to take a pointer (element in a SharedArrayBuffer) and a value to have the WebGPU implementation write to that pointer when the callback completes. Browser environments could then use Something like: WGPU_EXPORT void wgpuBufferMapAsync2(WGPUBuffer buffer, WGPUMapModeFlags mode, size_t offset, size_t size, void*
addr, uint32_t value, WGPUBufferMapAsyncStatus* status); A synchronous version can then be written easily by users: bool wgpuBufferMapSync(WGPUBuffer buffer, WGPUMapModeFlags mode, size_t offset, size_t size) {
WGPUBufferMapAsyncStatus status;
uint32_t futex = 0;
wgpuBufferMapAsync2(buffer, mode, offset, size, &futex, 1, &status);
// could do other work here, switch to another thread/worker, etc - here just blocking to emulate sync behavior
futexWait(&futex, 1);
return checkStatusSuccess(status); // could be device loss, etc
} If all the async methods ( The other advantage of this approach is that it solves the multiple codebase issue: if two frameworks are used in the same application - even if one is written in javascript and the other compiled via webassembly - there's no need for complex scheduling coordination or cross-language interop goo or global WGPUInstance/WGPUDevice waits as everything is scoped. If we had this I think all our concerns around synchronization would be solvable in user code (we'd be able to perform async compilation, async and overlapped submission, and sync mapping where required) and there'd be little additional needed in the WebGPU API. This could be emulated with a spin-loop as in my previous example (callbacks take userdata of the addr/value and perform the write themselves), but not needing to have user threads/workers spun up to do this - especially if there was one such thread/worker per framework in an application - would be a real win for both ease of use and resource consumption. |
@kainino0x helped clarify that this could be seen as the way to do async when you don't have a top-level event loop - the callbacks would work for main thread javascript or a worker that was yielding, and this would cover everything else |
An unfortunate thing about using atomics is that they require SharedArrayBuffer which requires COOP+COEP: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/SharedArrayBuffer#security_requirements If SAB were always available (and COOP+COEP just enabled sharing it between threads), |
Darn, good point :/ This signal style would be most useful in native or worker contexts with SharedArrayBuffer available (as you're not compiling pthreads code to wasm unless you have them). I think in the case of workers with private memory a blocking tick method that issued the callbacks on the worker that issues the tick would cover the same use cases, though I think the signal style even if not waitable could be useful for ergonomics. I'm imagining the case of dispatching a lot of work and then mapping several of the results: that becomes nice straight-line code of submit->map->map->map->loop on tick until all/any ready->use mapped resources. With the pure callback approach that'd need the user to do more juggling (and it scales with the number of mapped resources, and with each part of the code mapping the resources, etc). Not as efficient as a futex for low-latency wakes but at the point where you have a single thread in isolation you probably don't care much about that. I think even the SharedArrayBuffer method would need some kind of flush method unless the implementations flush automatically, so this may all require the same primitives regardless of approach - it's just if you don't have Atomics.wait you can't block on the signals and deadlock yourself. |
A possible direction on this issue is that we just give up and start allowing blocking readbacks in general, even on the main thread. I know we're operating under strong architectural guidance that we don't do this. But in practice it's just forcing people to do terrible workarounds with canvases (reshaping/re-encoding data into 2D rgba8unorm data, writing it into a canvas, and then using existing synchronous readback APIs like toDataURL to read them). Maybe we should get out of the business of trying to force people do to the "right thing" when the wrong thing is already possible, and just provide the primitives. The performance consequences of synchronous readback are not that bad (compared with e.g. synchronous XHR). |
As someone who participated in the WebGL 1 discussions about the same kind of functionality when trying to do compute/media work there I'd agree with most of that - the abominations we had to create still make me sad, but no lack of functionality eliminated our need to create them. Your example is actually one I had to do as "render this canvas to an animated gif" was a product requirement and I couldn't just say "well reading pixels back is really hard in WebGL" - I just had to build a convincing progress spinner because it took so long :P What we learned then is that there's definite implementation complexity that comes from this kind of stuff but it's really a tradeoff of what's difficult to do in an implementation vs. what's impossible to do on top of the implementation. Needing users to do some extra work to get the precise behavior they want - knowing each user will want a slightly different behavior - works best when the user is actually able to do that work. IME as long as there is a primitive that allows for asynchronously observable multi-threaded semaphores everything else can be built on top of it - blocking sync, callback-style async, pure polling, or mixed futex-based sync/async. If there's only async callbacks with event-loop driven flushes or synchronous blocking operations then the others can't be (practically) implemented by users and there will always be tension (or whining from people like me :). Requiring the performance-destroying use of asyncify when targeting WebGPU or spinning up multiple workers whose only job is to sit and block on waits both fall into that abomination category (this also relates to discussions around efficient timeline-ordered readback via a hypothetical GPUQueue.readBuffer) - people will still need to do those kind of things regardless of whether the implementations make it easy or efficient, and it'd be in the end-user's (web users, etc) best interest for them to be able to be efficient (fewer workers, less large wired allocations, etc). |
This issue is focused on mapSync so I've retitled the issue, and pasted my comment in the summary about allowing it on main thread too. |
I would also like to raise my hand ✋ in support of this. Due to a large existing codebase (that I cannot change) we have to do the following (all in a single synchronous method inside WebAssembly) to implement buffer readback:
I realize this is pretty inefficient and I would definitely prefer to submit a group of these at the same time and wait on the first to complete, but I can't break the existing codebase. It also occurs to me that due to internal synchronization, step 5 should not be needed because the map should internally wait until the copy operation is complete. Because of internal synchronization I am not sure there are too many instances where one would need to explicitly wait on |
@mrshannon if mapSync were limited to workers, would you be able to get all of this to run on a background thread (assuming support for transferControlToOffscreen)?
That's correct.
Good point it might be more useful to query onSubmittedWorkDone than to synchronously wait on it. (We would define it so it can never change during a task so it's impossible to spin-wait on it.) |
@kainino0x Yes, my preference is to use a background worker, though until offscreen canvas is universally supported I may have to proxy calls to the main thread.
That would be a problem in a worker as a program (especially WebAssembly) may not yield to the event loop very often. |
I think that restriction is important to discourage spin looping behavior, so if it's needed then synchronous wait-for-work-done would be preferred. |
So the problem is when porting code originally written with desktop API's in mind (so really only an issue in WebAssembly). WebGPU's |
I just ran into a case where it cannot be known ahead of time whether to call
The obvious solution is to make it so Without this we must do a busy wait loop with Correction: |
Maybe we can have interface GPUEvent {
Promise<void> promise;
[Exposed=(Worker)] // idk if this is allowed here
void wait();
} |
This is a good point to consider. Similar to #2085: in general you may want to start the work as early as possible, but then block on it later. |
We also need this for Ruffle. We emulate Flash and, even though it's not at all recommended by modern standards, we need to be able to read back gpu buffers synchronously whenever the script calls for it. WebGPU requiring async because it's best modern practice, means we can't port anything that predates that advice to switch to WebGPU. |
IIUC the intended solution for simulating sync reads for wasm is JSPI |
See also Blink's decision to raise the synchronous I really don't understand how this didn't make it into the MVP. Synchronous readback is hugely important for basically any GPGPU task (as many devs above have pointed out), many common features in games (e.g. knowing which object the user is looking at / pointing their crosshair at), and basically every GPU-accelerated 2D render engine (for collision detection). There are plenty of ways for JS users to shoot themselves in the foot with performance already. Even beyond the contrived "really slow function" example measured above, I've measured 500ms GC pauses and >1s React re-renders on real-world websites. Disallowing an extremely useful feature or just going "welp, just rewrite your entire renderer in WASM so you can use JSPI" is extremely frustrating. I guarantee you that the types of developers who use the raw WebGPU API will be just as concerned about performance and minimizing jank as the people writing the spec for it. Optimization is every render engine hacker's breakfast, lunch, and dinner. If a junior coder is allowed to turn a phone into a space heater by calling |
+1 for the comment above. |
I happened to be chatting to the author of that intent to ship today, he added some color in that if you look at the thread, there is significant pushback: making WebAssembly modules block for 1 second for compilation is terrible for the Web platform. After some negotiation they settled on the 8MB limit that's blocking for much less time (basically a heavy function call, but at the limit of considered "blocking"). It's clear that there is convenience in being able to map synchronously, but JS has a ton of primitive to handle asynchrony, and synchronous readback in the GPU-picking cases you mentioned would easily stall for a couple frames due to the depth of the browser rendering pipeline. For GPU picking it is still better to do async readbacks. Also as an additional data point, playing Baldur's Gate 3 I noticed an extremely slight latency in item picking, which I assume is due to the async GPU picking, but however there is no stall when clicking on something. |
That is just one example. And even that is not always acceptable.
It would be still preferable to leave this decision up to the developers using the spec, and not simply taking away one of the choices. While it may not be ideal, in some cases it's a perfectly valid tradeoff. See for example: #2217 (comment) |
An argument I think I agree with that I've heard against sync is that there should be no performance difference between sync and async. One way or another the GPU process has to signal to the process waiting for the result that the result is ready. After that, you can imagine in a perfect implementation, the difference between sync and async is just a few instructions. So, if async is slow in some browser, that's an implementation bug, not a reason to add sync. An exception would be wasm but as mentioned above, JSPI is supposed to be the solution for WASM so ports will be easy. Picking was brought up as an example and I worried about that too. it's possible it is a problem but I wrote this example and it doesn't seem like a problem. The example makes 1000 black circles in SVG that turn red on hover. It also makes 1000 black circles with WebGPU and uses GPU picking to mark one as red. I thought I might be able to see a noticable difference between an SVG circle being highlighted vs WebGPU circle being highlight but I don't. Maybe my eyes just aren't sensitive enough or maybe you need to make a much heavier scene for it to stick out. |
According to https://github.com/WebAssembly/js-promise-integration/blob/main/proposals/js-promise-integration/Overview.md#supporting-responsive-applications-with-reentrancy |
No, it does not. It has callbacks, syntactic sugar over callbacks (promises), and syntactic sugar over syntactic sugar over callbacks (async/await). Promises can be implemented entirely as a library, and async/await is a very simple syntactic transformation that's existed since ES6's generators. You are one of maybe a few dozen people in the world who actually knows what a microtask is and exactly which operations queue them up. Every other developer sees the If JS included real async primitives, and by primitives I mean language features that are fundamental enough that they cannot be polyfilled, I suspect that fewer people would react the way they do when some extremely useful function is made async because the Web Platform™ gods have deemed it too janky for the uneducated masses to use without potentially causing a 1-2 frame slowdown (the horror!)
"Would" or "could"? WebGL's |
Debating what to qualify It is clearly an unpopular opinion among folks that are used to WebGL's
Would. It requires all previous operation on the GPU to be completed, and because most GPUs have a single graphics queue, it forces all the compositing of the browser to be finished etc. If you block every frame, then the amount of queued work stays low all the time, but you still create lots of bubbles in the execution. If you look at the performance profile of your page, you should be able to see that each readPixels still take a large amount of time (~1ms at least because of the IPC roundtrip and submit + immediate wait on GPU work). (unless you use non-blocking async readback but that's basically like |
(Apologies if this idea has been made already) What if the native implementations, but not the web implementations, add support for synchronous readback on the main thread? Pros:
Cons:
|
"a couple frames" to "~1ms" is quite a big pivot! Even on a high-end phone with a 120hz display, that's still only an 8th of a frame. I've benchmarked readPixels, and it indeed takes about 1ms per call. Here are some points of comparison (that you can try for yourself) for what else takes 1ms:
Does this mean we should make number-crunching code, SVG DOM operations, or all reflow-forcing operations asynchronous? 1ms is, IMO, absolutely not heavy enough to justify forcing async. |
This sounds particularly un-nice.
Still agreeing hard. |
If you keep flushing the GPU pipeline, yes you get ~1ms, but then the GPU spends the majority of its time idle because there is no pipelining. |
The GPU also spends the majority of its time idle if you hit your frame budget. It seems like your objection here is not actually that synchronous readback causes any jank or inevitably results in a worse user experience, but that it feels icky for the main thread to ever be idle. I get that it feels wrong to "block" the main thread, but the definition of "blocking" is not as objective as it seems. Is writing to a file descriptor a "blocking" action? It's IO-bound. What about writing to stdout? The Web platform's reluctance to perform any "blocking" operations whatsoever on the main thread seems to have led to no small number of gripes over the years, and for good reason. In particular, the decision to forbid
"A development cost" is underselling things a fair bit. As pointed out in the But OK, let's assume the worst1 and consider the impact of the main thread being blocked for several frames. Let's say some dev forgot to consider an edge case and allowed a certain operation to take way too long, or it's running on lower-end hardware than was envisioned. Which would you rather have:
Footnotes
|
This is basically the current plan - we're adding a synchronous-wait primitive to webgpu.h. In Wasm, it will be supported when JSPI (or Asyncify or similar emulation) is available.
The 1ms isn't the justification for async (it probably could be even faster than that - I think non-Chrome browsers do better than 1ms already). That's just the overhead - the actual justification is there's an arbitrary amount of work that's already been queued to the GPU that must complete before the mapSync/readPixels completes. It could be trivial, sure, but it could also take a long time (especially on a low end phone). That said, while it's technically possible for the webpage (most importantly touch-scrolling) to remain responsive while a lot of WebGPU work is queued, it's often going to be the case that the webpage or even the entire browser stops displaying frames while it completes. In which case the mapSync is probably not making anything worse. Anyway, TBH, I think we're rehashing all the same arguments. I still personally am on the side of adding mapSync in all contexts, partly because of historical precedent with WebGL and 2D canvas, and partly because of the thing in the last paragraph. But regardless of whether mapSync happens, JSPI is on its way, and we'll make the best we can with that. |
I'll tentatively triage this into Milestone 1 as it's clearly still a hot topic. Can't guarantee it will see any real progress though. |
Speaking purely as as an independent observer here: If you don’t add mapSync on the main thread, the drawbacks are that some applications will use mega-janky toDataURL workarounds, while others just won’t be ported to WebGPU (or possibly even the entire web platform) at all. If you do add mapSync on the main thread, the drawbacks are that some apps which don’t need mapSync may elect to use it anyway, thereby resulting in unnecessarily-janky apps. If I had to choose between those two outcomes, I’d personally choose to have those apps supported on WebGPU, and choose a scary name for mapSync to try to make it clear that authors should avoid using it if possible. Personally, I’d care way more about seeing Jet Set Radio or whatever running in a browser on WebGPU than worrying about how janky some one-off app someone wrote for who-knows-what reason is. |
Here are some notes that I can pick up from the discussion so far: a) In the previous comments, Emscripten's b) JSPI is the more robust incarnation/evolution of Asyncify (obsoleting Emscripten's original Asyncify model), and fixes many of its problems. However, even then, problems remain to be resolved with JSPI:
c) There was some mentions above about having waitable map()ing support rather than synchronous mapping support. In a hypothetical waitable map() scheme, user could initiate an asynchronous map(), but then at some point later (while still computing in that same event handler), they could decide to either poll if completion has occurred, or to synchronously pause to wait for completion. This kind of waitable model could be useful, since it would allow renderers to initiate several map requests at once, and then wait for those to resolve in first(), any() or all() fashion. That would allow renderers opportunities to sink waits and perform other computations meanwhile. (maybe some renderers might be able to issue maps at beginning of frame, and then resolve those at the late stages of frame, or similar?) If there would be a sequential completion guarantee for multiple mappings, then waiting for any() would be trivially easy (just wait in order of submission, e.g. wait for first()). Though if the completion order might be arbitrary, then this would result in the select problem. (this could be solvable via a SAB futex wait style method) d) As a followup to the above: at this point conversation has revolved around the async->good, sync->bad dogma. It is easy to slip from this to think async->fast, sync->slow by construction. But there is no precedent to show that this would actually be the case. It is very possible that such a waitable mapping function, or an outright mapSync() function could actually result in a higher throughput/shorter CPU time spent in a renderer. Just because the maps resolve asynchronously via the JS event queue does not mean that these mappings would be fast - on the contrary, they could be pretty slow in comparison to a synchronous/waitable map due to JS event queue processing latency and computing context management overhead. But since there doesn't exist support for sync/waitable maps in browsers, there are no current benchmarks to verify this behavior. To this effect, I would recommend browsers would add mapSync() support behind a flag e.g. in Nightly/Canary builds, so that WebGPU renderer experimentors would be able to measure the cost of the async mapping model to enable validating that there is no overhead to performing asynchronous mapping. It is known that async mapping requires utilizing more GPU memory for round robin dynamic buffers compared to being able to fence+map when unused. How much more? Well, that is harder to analyze. A behind-a-flag mapSync() would allow measuring this overhead in renderers as well. (in codebases that are able to do both async and sync maps) I would say that for example, if waitable/sync mapping would give a +x% more FPS or -y% less GPU memory used by a renderer compared to async mapping, that would give a new angle to look at the dogmatic/pragmatic axis of sync vs async computation. e) Strong +1 to adding mapSync() support permanently to Workers. f) Has there been consideration to add a This kind of API might be faster than a JSPI'd mapAsync()+read+unmap, e.g. by virtue of avoiding JSPI, and it could even be faster than a mapSync()+read+unmap by avoiding an extra JS->Wasm copy. |
GPU Web CG 2024-03-13 Atlantic-time
|
There has been a historical precedent towards synchronous APIs being allowed in Workers. (So I expect TAG would be ok with this, unless there has been a change in direction somehow) Examples:
Btw the discussion of synchronous work in Workers reminded me of a priority inversion problem. Maybe a bit tangential, but maybe worth mentioning here: An important detail about browser APIs in Workers *plus* SharedArrayBuffer is the fact that SAB makes it possible to observe in user JS code if such operations require assistance/forward progress from the main browser thread in order to advance. Such work requests can result in priority inversion deadlocks. Some examples:
(one might argue that these are examples of browser vendors writing the very type of poor practices code into browser codebases, that they are trying to keep bad JS developers from writing, but best not go there...) It is good to note that before the advent of SharedArrayBuffer, none of this was a problem, since there was no synchronous shared state between the Worker and main thread. So even if any APIs in Workers might synchronously halt to wait for a lock/event from the main thread, it would never be possible to functionally observe. There was little traction in attempting to update wording on already shipped W3C specs about this, but rather these types of issues have been handled as individual bugs against browsers. For example we used to proxy all console.logs in Emscripten manually over to the main thread to log, until later when Firefox no longer had the problem. The point of this is that when specifying behavior of WebGPU in Workers, it would be good to have a paragraph to mention that implementations would be required to be able to make forward progress without synchronous help from the main browser thread. Otherwise there is a possibility of priority inversions to occur when talking about Workers, and SAB is in town. Of course this is not something that would be a potential challenge only with synchronous operations like |
For everyone trying to do same thing with native webgpu. macos m1. Tried many variants with wgpu-rs, found that mapAsync is terrible slow and its just wgpu-rs implementation issue. Only one good variant is to take google's dawn implementation, make ring buffers for reading and do this (every frame): cpp code with glfw and dawncpu_buffer = buffers_for_reading.next_buffer();
encoder.copyBufferToBuffer(gpu_buffer, cpu_buffer);
queue.submit( ...encoder.finish()... );
// map and wait
bool done = false;
wgpuBufferMapAsync(cpu_buffer, mode, offset, size, [](WGPUBufferMapAsyncStatus status, void* user_data) {
(*((bool*)user_data)) = true;
}, &done);
while(!done) {
glfwPollEvents();
wgpuInstanceProcessEvents(instance);
}
// getConstMappedRange and do smth with data
cpu_buffer.unmap(); Also for everyone fighting with wgpu-rs -> dawn migration. Google decided to have "undefined" flags not as 0 and have more validations on it. So You may get strange errors and you should solve it by specifying all descriptor fields. |
In principle, AFAIU, synchronous blocking APIs are only problematic on the main thread. Web workers can block on operations without really causing any problems - especially since our operations should never block for very long (unlike network operations for example, although synchronous XHR is not a good example because it's old).
We could mirror some APIs into synchronous versions with
[Exposed=DedicatedWorker]
.Definitely useful - I'd start with just this one:
EDIT: There are other possible entry points but let's ignore them for now
Maybe useful but possibly not worth the complexity:
Most likely not needed:
Synchronous map (mapSync) is known to be particularly handy. For example, TensorFlow.js can implement its
dataSync()
method to synchronously read data back from a gpu-backed tensor (even if only available on workers).Finally, there's no avoiding the fact that poorer forms of synchronous readback are always going to be available (e.g. synchronous canvas readback) so we should at least consider supporting mapSync on workers where it's OK in principle.
EDIT:
Originally posted by @kainino0x in #2217 (comment)
The text was updated successfully, but these errors were encountered: