Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mapSync on Workers - and possibly on the main thread #2217

Open
kainino0x opened this issue Oct 26, 2021 · 53 comments
Open

mapSync on Workers - and possibly on the main thread #2217

kainino0x opened this issue Oct 26, 2021 · 53 comments
Labels
api WebGPU API
Projects
Milestone

Comments

@kainino0x
Copy link
Contributor

kainino0x commented Oct 26, 2021

In principle, AFAIU, synchronous blocking APIs are only problematic on the main thread. Web workers can block on operations without really causing any problems - especially since our operations should never block for very long (unlike network operations for example, although synchronous XHR is not a good example because it's old).

We could mirror some APIs into synchronous versions with [Exposed=DedicatedWorker].

Definitely useful - I'd start with just this one:

  • mapAsync
EDIT: There are other possible entry points but let's ignore them for now

Maybe useful but possibly not worth the complexity:

  • compilationInfo
  • onSubmittedWorkDone
  • popErrorScope

Most likely not needed:

  • requestAdapter
  • requestDevice
  • device.lost

Synchronous map (mapSync) is known to be particularly handy. For example, TensorFlow.js can implement its dataSync() method to synchronously read data back from a gpu-backed tensor (even if only available on workers).

Finally, there's no avoiding the fact that poorer forms of synchronous readback are always going to be available (e.g. synchronous canvas readback) so we should at least consider supporting mapSync on workers where it's OK in principle.


EDIT:

A possible direction on this issue is that we just give up and start allowing blocking readbacks in general, even on the main thread. I know we're operating under strong architectural guidance that we don't do this. But in practice it's just forcing people to do terrible workarounds with canvases (reshaping/re-encoding data into 2D rgba8unorm data, writing it into a canvas, and then using existing synchronous readback APIs like toDataURL to read them). Maybe we should get out of the business of trying to force people do to the "right thing" when the wrong thing is already possible, and just provide the primitives. The performance consequences of synchronous readback are not that bad (compared with e.g. synchronous XHR).

Originally posted by @kainino0x in #2217 (comment)

@kainino0x kainino0x modified the milestone: post-V1 Oct 26, 2021
@kainino0x kainino0x added this to Needs Discussion in Main Nov 10, 2021
@kainino0x kainino0x added this to the V1.0 milestone Nov 10, 2021
@kainino0x
Copy link
Contributor Author

Some customers are very interested in this, so I'd like to put it on the meeting agenda.

@shanumante-sc
Copy link

Without synchronous access to data, we are running into problems with TFJS on WebGPU backend (See tensorflow/tfjs#1595 which suggests that dataSync might never be supported for WebGPU).

Our rendering strategy calls TFJS as a part of the render loop (since we might want to run an augmented video feed through a neural network) and requires access to network results synchronously for further processing. Yielding the render loop to wait for TFJS results would require some significant refactoring and also introduces an overhead to preserve the renderer state while waiting for the results. Otherwise, we need to build some hacks like trying to reuse latest available results and hope that they are applicable to the frame being processed.

When rendering is happening on a worker thread, there shouldn't be an issue in synchronously waiting for GPU to finish enqueued tasks (and thus support dataSync in TFJS) since the main thread can continue processing its event loop without stalls.

@kainino0x
Copy link
Contributor Author

Related to #1629 since we need to make sure (e.g.) map failures are exposed in a consistent and non-destructive way in both versions (sync and async).

meeting: action item @litherum to discuss internally

@benvanik
Copy link

benvanik commented Dec 10, 2021

Wanted to raise our hand as someone who needs blocking operations on workers as well! ✋ (we're compute-focused and don't have concepts of frames or clearly delineated times when we could yield to the browser, and run all our code in workers)

This class of issues is currently our biggest blocker with using WebGPU in the browser - we've only been able to get things working by using wgpuDevicePoll(..., /*force_wait=*/true) available in wgpu-native to simulate this behavior - but even that's not great.

We do also need onSubmittedWorkDone so long as it's not possible to receive a callback for a submission without first yielding to the browser - we're running our entire program in a worker thread without yielding at any point (as would practically anyone coming from webassembly). In our ideal situation this does not imply blocking but that the callback could be issued from another thread such that we can signal a semaphore that a blocking thread may be waiting on. We really would prefer there to be native semaphores/fences in WebGPU but barring that we need to emulate them and the minimum to do so would be a worker thread we kept around to wait for completion events we used to signal potential waiter threads. Something like:

std::thread event_processing_thread([&]() {
  while (true) {
    // we don't want this to be a spin-loop - we're really just telling the API we'd
    // like callbacks to be made from this thread when there are any pending
    wgpuInstanceProcessEvents(/*blocking=*/true);
  }
});
FenceFutex fence;
wgpuQueueOnSubmittedWorkDone(..., [](WGPUQueueWorkDoneStatus status, void *userdata) {
  fence.signal();
});

wgpuQueueSubmit(....);
// can do other work here, non-blocking

// maybe wait - if already fired then this won't block and otherwise it'll wait for the event
// to be processed on the thread and signal the fence
fence.wait();

@benvanik
Copy link

benvanik commented Dec 10, 2021

An interesting discussion in the Dawn matrix room with @austinEng where he had a really nice idea and I wanted to document here:

There may be a single primitive that solves this for javascript/native and workers: allow async operations to take a pointer (element in a SharedArrayBuffer) and a value to have the WebGPU implementation write to that pointer when the callback completes. Browser environments could then use Atomics.wait (directly in javascript or via futex emulation in webassembly from emscripten) and native environments could use their system futex/WaitOnAddress/etc. Since browsers all have to implement Atomics.wait anyway (if they support workers) there shouldn't be much additional infra required here, and users can easily interact with this mechanism themselves via Atomics.notify.

Something like:

WGPU_EXPORT void wgpuBufferMapAsync2(WGPUBuffer buffer, WGPUMapModeFlags mode, size_t offset, size_t size, void* 
 addr, uint32_t value, WGPUBufferMapAsyncStatus* status);

A synchronous version can then be written easily by users:

bool wgpuBufferMapSync(WGPUBuffer buffer, WGPUMapModeFlags mode, size_t offset, size_t size) {
  WGPUBufferMapAsyncStatus status;
  uint32_t futex = 0;
  wgpuBufferMapAsync2(buffer, mode, offset, size, &futex, 1, &status);
  // could do other work here, switch to another thread/worker, etc - here just blocking to emulate sync behavior
  futexWait(&futex, 1);
  return checkStatusSuccess(status);  // could be device loss, etc
}

If all the async methods (wgpuDeviceCreateComputePipelineAsync and wgpuQueueSubmit) did this then synchronous versions could be written, or more importantly semi-synchronous versions: an application or framework is free to use multiple threads/workers to coordinate work, watch for signals to perform fencing (always be running one frame ahead, etc) without needing spin-loops. In the browser the main thread could submit work and workers could wait on it without dealing with the transfer of promises or other higher-level constructs while still allowing the main thread to poll for completion.

The other advantage of this approach is that it solves the multiple codebase issue: if two frameworks are used in the same application - even if one is written in javascript and the other compiled via webassembly - there's no need for complex scheduling coordination or cross-language interop goo or global WGPUInstance/WGPUDevice waits as everything is scoped.

If we had this I think all our concerns around synchronization would be solvable in user code (we'd be able to perform async compilation, async and overlapped submission, and sync mapping where required) and there'd be little additional needed in the WebGPU API.

This could be emulated with a spin-loop as in my previous example (callbacks take userdata of the addr/value and perform the write themselves), but not needing to have user threads/workers spun up to do this - especially if there was one such thread/worker per framework in an application - would be a real win for both ease of use and resource consumption.

@benvanik
Copy link

@kainino0x helped clarify that this could be seen as the way to do async when you don't have a top-level event loop - the callbacks would work for main thread javascript or a worker that was yielding, and this would cover everything else

@kainino0x
Copy link
Contributor Author

kainino0x commented Dec 14, 2021

An unfortunate thing about using atomics is that they require SharedArrayBuffer which requires COOP+COEP: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/SharedArrayBuffer#security_requirements

If SAB were always available (and COOP+COEP just enabled sharing it between threads), or if atomics were allowed on non-shared arraybuffers, then this would work better.

@benvanik
Copy link

Darn, good point :/

This signal style would be most useful in native or worker contexts with SharedArrayBuffer available (as you're not compiling pthreads code to wasm unless you have them). I think in the case of workers with private memory a blocking tick method that issued the callbacks on the worker that issues the tick would cover the same use cases, though I think the signal style even if not waitable could be useful for ergonomics. I'm imagining the case of dispatching a lot of work and then mapping several of the results: that becomes nice straight-line code of submit->map->map->map->loop on tick until all/any ready->use mapped resources. With the pure callback approach that'd need the user to do more juggling (and it scales with the number of mapped resources, and with each part of the code mapping the resources, etc). Not as efficient as a futex for low-latency wakes but at the point where you have a single thread in isolation you probably don't care much about that. I think even the SharedArrayBuffer method would need some kind of flush method unless the implementations flush automatically, so this may all require the same primitives regardless of approach - it's just if you don't have Atomics.wait you can't block on the signals and deadlock yourself.

@kainino0x
Copy link
Contributor Author

kainino0x commented Dec 15, 2021

A possible direction on this issue is that we just give up and start allowing blocking readbacks in general, even on the main thread. I know we're operating under strong architectural guidance that we don't do this. But in practice it's just forcing people to do terrible workarounds with canvases (reshaping/re-encoding data into 2D rgba8unorm data, writing it into a canvas, and then using existing synchronous readback APIs like toDataURL to read them). Maybe we should get out of the business of trying to force people do to the "right thing" when the wrong thing is already possible, and just provide the primitives. The performance consequences of synchronous readback are not that bad (compared with e.g. synchronous XHR).

@benvanik
Copy link

As someone who participated in the WebGL 1 discussions about the same kind of functionality when trying to do compute/media work there I'd agree with most of that - the abominations we had to create still make me sad, but no lack of functionality eliminated our need to create them. Your example is actually one I had to do as "render this canvas to an animated gif" was a product requirement and I couldn't just say "well reading pixels back is really hard in WebGL" - I just had to build a convincing progress spinner because it took so long :P

What we learned then is that there's definite implementation complexity that comes from this kind of stuff but it's really a tradeoff of what's difficult to do in an implementation vs. what's impossible to do on top of the implementation. Needing users to do some extra work to get the precise behavior they want - knowing each user will want a slightly different behavior - works best when the user is actually able to do that work. IME as long as there is a primitive that allows for asynchronously observable multi-threaded semaphores everything else can be built on top of it - blocking sync, callback-style async, pure polling, or mixed futex-based sync/async. If there's only async callbacks with event-loop driven flushes or synchronous blocking operations then the others can't be (practically) implemented by users and there will always be tension (or whining from people like me :). Requiring the performance-destroying use of asyncify when targeting WebGPU or spinning up multiple workers whose only job is to sit and block on waits both fall into that abomination category (this also relates to discussions around efficient timeline-ordered readback via a hypothetical GPUQueue.readBuffer) - people will still need to do those kind of things regardless of whether the implementations make it easy or efficient, and it'd be in the end-user's (web users, etc) best interest for them to be able to be efficient (fewer workers, less large wired allocations, etc).

@Kangz Kangz modified the milestones: V1.0, post-V1 Jan 12, 2022
@kainino0x kainino0x modified the milestones: post-V1, V1.0 Jan 12, 2022
@kainino0x kainino0x changed the title Expose blocking operations on workers mapSync on Workers - and possibly on the main thread Jan 12, 2022
@kainino0x
Copy link
Contributor Author

This issue is focused on mapSync so I've retitled the issue, and pasted my comment in the summary about allowing it on main thread too.

@mrshannon
Copy link
Contributor

mrshannon commented Jan 24, 2022

I would also like to raise my hand ✋ in support of this. Due to a large existing codebase (that I cannot change) we have to do the following (all in a single synchronous method inside WebAssembly) to implement buffer readback:

  1. Get/create a staging buffer
  2. Get/create a command encoder
  3. Issue a buffer to buffer copy: from the buffer to read to the staging buffer
  4. Submit the command encoder to the default queue
  5. Wait for the copy to complete (synchronous wait on onSubmittedWorkDone which is not currently possible)
  6. Map the staging buffer
  7. Wait for mapping to complete (synchronous wait, which is also not possible at the moment)
  8. Get and return the mapped range

I realize this is pretty inefficient and I would definitely prefer to submit a group of these at the same time and wait on the first to complete, but I can't break the existing codebase.

It also occurs to me that due to internal synchronization, step 5 should not be needed because the map should internally wait until the copy operation is complete. Because of internal synchronization I am not sure there are too many instances where one would need to explicitly wait on onSubmittedWorkDone, but I can think of one where it would be useful is to query it's status in order to determine if the buffer mapping is likely to stall and instead do other work.

@kainino0x
Copy link
Contributor Author

kainino0x commented Jan 25, 2022

@mrshannon if mapSync were limited to workers, would you be able to get all of this to run on a background thread (assuming support for transferControlToOffscreen)?

It also occurs to me that due to internal synchronization, step 5 should not be needed because the map should internally wait until the copy operation is complete.

That's correct.

Because of internal synchronization I am not sure there are too many instances where one would need to explicitly wait on onSubmittedWorkDone, but I can think of one where it would be useful is to query it's status in order to determine if the buffer mapping is likely to stall and instead do other work.

Good point it might be more useful to query onSubmittedWorkDone than to synchronously wait on it. (We would define it so it can never change during a task so it's impossible to spin-wait on it.)

@mrshannon
Copy link
Contributor

@kainino0x Yes, my preference is to use a background worker, though until offscreen canvas is universally supported I may have to proxy calls to the main thread.

We would define it so it can never change during a task so it's impossible to spin-wait on it.

That would be a problem in a worker as a program (especially WebAssembly) may not yield to the event loop very often.

@kainino0x
Copy link
Contributor Author

I think that restriction is important to discourage spin looping behavior, so if it's needed then synchronous wait-for-work-done would be preferred.

@mrshannon
Copy link
Contributor

I think that restriction is important to discourage spin looping behavior.

So the problem is when porting code originally written with desktop API's in mind (so really only an issue in WebAssembly). WebGPU's onSubmittedWorkDone is the closest it has to fences which in other API's support both getting the status (vkGetFenceStatus) and synchronously waiting (vkWaitForFences). I understand wanting to discourage spin looping, but existing code bases rely on periodically polling the status of queue work.

@mrshannon
Copy link
Contributor

mrshannon commented Feb 18, 2022

I just ran into a case where it cannot be known ahead of time whether to call mapAsync or mapSync. The situation is:

  1. mapAsync is called, in hopes that by the time the mapping is required that it will resolve
  2. Sometime later, the mapping is needed now, and thus a synchronous wait is required and we would like to call mapSync but the buffer state is now mapping pending and thus it would be illegal to map again.

The obvious solution is to make it so mapSync can be called on a buffer which already had a mapAsync called on it as long as the range to be mapped is contained within the range requested in the mapAsync call.

Without this we must do a busy wait loop with process.nextTick(), waiting for the original mapAsync to resolve and set a flag.

Correction: setTimeout(..., 0) must be used instead of process.nextTick() as the nextTickQueue is ran before promises are resolved, which sadly makes this hack slower than before.

@austinEng
Copy link
Contributor

Maybe we can have mapAsync return

interface GPUEvent {
  Promise<void> promise;

  [Exposed=(Worker)] // idk if this is allowed here
  void wait();
}

@kainino0x
Copy link
Contributor Author

I just ran into a case where it cannot be known ahead of time whether to call mapAsync or mapSync.

This is a good point to consider. Similar to #2085: in general you may want to start the work as early as possible, but then block on it later.

@Dinnerbone
Copy link

We also need this for Ruffle. We emulate Flash and, even though it's not at all recommended by modern standards, we need to be able to read back gpu buffers synchronously whenever the script calls for it.

WebGPU requiring async because it's best modern practice, means we can't port anything that predates that advice to switch to WebGPU.

@greggman
Copy link
Contributor

IIUC the intended solution for simulating sync reads for wasm is JSPI

@valadaptive
Copy link

valadaptive commented Feb 21, 2024

See also Blink's decision to raise the synchronous WebAssembly.compile() module size limit from 4KB to 8MB, even if the main thread is blocked for almost a second.

I really don't understand how this didn't make it into the MVP. Synchronous readback is hugely important for basically any GPGPU task (as many devs above have pointed out), many common features in games (e.g. knowing which object the user is looking at / pointing their crosshair at), and basically every GPU-accelerated 2D render engine (for collision detection).

There are plenty of ways for JS users to shoot themselves in the foot with performance already. Even beyond the contrived "really slow function" example measured above, I've measured 500ms GC pauses and >1s React re-renders on real-world websites. Disallowing an extremely useful feature or just going "welp, just rewrite your entire renderer in WASM so you can use JSPI" is extremely frustrating.

I guarantee you that the types of developers who use the raw WebGPU API will be just as concerned about performance and minimizing jank as the people writing the spec for it. Optimization is every render engine hacker's breakfast, lunch, and dinner. If a junior coder is allowed to turn a phone into a space heater by calling items.reverse()[0] to get the last element of an array once per paragraph in their React component, why are we significantly restricting the useful range of WebGPU's functionality and applicability across domains?

@torokati44
Copy link

+1 for the comment above.

@Kangz
Copy link
Contributor

Kangz commented Feb 21, 2024

See also Blink's decision to raise the synchronous WebAssembly.compile() module size limit from 4KB to 8MB, even if the main thread is blocked for almost a second.

I happened to be chatting to the author of that intent to ship today, he added some color in that if you look at the thread, there is significant pushback: making WebAssembly modules block for 1 second for compilation is terrible for the Web platform. After some negotiation they settled on the 8MB limit that's blocking for much less time (basically a heavy function call, but at the limit of considered "blocking").

It's clear that there is convenience in being able to map synchronously, but JS has a ton of primitive to handle asynchrony, and synchronous readback in the GPU-picking cases you mentioned would easily stall for a couple frames due to the depth of the browser rendering pipeline. For GPU picking it is still better to do async readbacks. Also as an additional data point, playing Baldur's Gate 3 I noticed an extremely slight latency in item picking, which I assume is due to the async GPU picking, but however there is no stall when clicking on something.

@torokati44
Copy link

For GPU picking it is still better to do async readbacks.

That is just one example. And even that is not always acceptable.

and synchronous readback in the GPU-picking cases you mentioned would easily stall for a couple frames due to the depth of the browser rendering pipeline.

It would be still preferable to leave this decision up to the developers using the spec, and not simply taking away one of the choices.

While it may not be ideal, in some cases it's a perfectly valid tradeoff. See for example: #2217 (comment)

@greggman
Copy link
Contributor

greggman commented Feb 21, 2024

An argument I think I agree with that I've heard against sync is that there should be no performance difference between sync and async. One way or another the GPU process has to signal to the process waiting for the result that the result is ready. After that, you can imagine in a perfect implementation, the difference between sync and async is just a few instructions. So, if async is slow in some browser, that's an implementation bug, not a reason to add sync.

An exception would be wasm but as mentioned above, JSPI is supposed to be the solution for WASM so ports will be easy.

Picking was brought up as an example and I worried about that too. it's possible it is a problem but I wrote this example and it doesn't seem like a problem. The example makes 1000 black circles in SVG that turn red on hover. It also makes 1000 black circles with WebGPU and uses GPU picking to mark one as red. I thought I might be able to see a noticable difference between an SVG circle being highlighted vs WebGPU circle being highlight but I don't. Maybe my eyes just aren't sensitive enough or maybe you need to make a much heavier scene for it to stick out.

@lvyitian
Copy link

lvyitian commented Feb 21, 2024

An argument I think I agree with that I've heard against sync is that there should be no performance difference between sync and async. One way or another the GPU process has to signal to the process waiting for the result that the result is ready. After that, you can imagine in a perfect implementation, the difference between sync and async is just a few instructions. So, if async is slow in some browser, that's an implementation bug, not a reason to add sync.

An exception would be wasm but as mentioned above, JSPI is supposed to be the solution for WASM so ports will be easy.

Picking was brought up as an example and I worried about that too. it's possible it is a problem but I wrote this example and it doesn't seem like a problem. The example makes 1000 black circles in SVG that turn red on hover. It also makes 1000 black circles with WebGPU and uses GPU picking to mark one as red. I thought I might be able to see a noticable difference between an SVG circle being highlighted vs WebGPU circle being highlight but I don't. Maybe my eyes just aren't sensitive enough or maybe you need to make a much heavier scene for it to stick out.

According to https://github.com/WebAssembly/js-promise-integration/blob/main/proposals/js-promise-integration/Overview.md#supporting-responsive-applications-with-reentrancy
JSPI won't deal with reentrancy automatically, the existing projects' architecture will have to be reconstructed to be "event-loop style" to work properly.

@valadaptive
Copy link

JS has a ton of primitive to handle asynchrony

No, it does not. It has callbacks, syntactic sugar over callbacks (promises), and syntactic sugar over syntactic sugar over callbacks (async/await). Promises can be implemented entirely as a library, and async/await is a very simple syntactic transformation that's existed since ES6's generators.

You are one of maybe a few dozen people in the world who actually knows what a microtask is and exactly which operations queue them up. Every other developer sees the async keyword in front of a function and starts dreading the prospect of rearchitecting their entire rendering engine to use async/await, including bug-filled and half-baked versions of every synchronization mechanism you've used in the WebGPU spec to prevent race conditions.

If JS included real async primitives, and by primitives I mean language features that are fundamental enough that they cannot be polyfilled, I suspect that fewer people would react the way they do when some extremely useful function is made async because the Web Platform™ gods have deemed it too janky for the uneducated masses to use without potentially causing a 1-2 frame slowdown (the horror!)

synchronous readback in the GPU-picking cases you mentioned would easily stall for a couple frames due to the depth of the browser rendering pipeline

"Would" or "could"? WebGL's readPixels certainly doesn't seem to have this problem. I can call it dozens of times per frame to read data from an offscreen buffer with no slowdown. Is there something fundamental to WebGPU that causes every pipeline to be inherently much deeper than the equivalent WebGL state configuration?

@Kangz
Copy link
Contributor

Kangz commented Feb 22, 2024

Debating what to qualify async/await and other features of Javascript isn't going to help get to a consensus. It makes it possible to write code that does async GPU readbacks without dealing with a mess of callbacks. Of course the rest of the code often needs to be adapted to handle the asynchrony because it might try to do something in the meantime, and there's a development cost to that.

It is clearly an unpopular opinion among folks that are used to WebGL's readPixels, but adding blocking calls on the JS main thread is unbelievably difficult to motivate nowadays and extremely frowned upon by Web standards folks and browser implementers. However as discussed in this issue, the same guideline doesn't exist for workers, so there could be a mapSync on workers. There will be a lot of details to figure out still.

"Would" or "could"? WebGL's readPixels certainly doesn't seem to have this problem. I can call it dozens of times per frame to read data from an offscreen buffer with no slowdown. Is there something fundamental to WebGPU that causes every pipeline to be inherently much deeper than the equivalent WebGL state configuration?

Would. It requires all previous operation on the GPU to be completed, and because most GPUs have a single graphics queue, it forces all the compositing of the browser to be finished etc. If you block every frame, then the amount of queued work stays low all the time, but you still create lots of bubbles in the execution. If you look at the performance profile of your page, you should be able to see that each readPixels still take a large amount of time (~1ms at least because of the IPC roundtrip and submit + immediate wait on GPU work). (unless you use non-blocking async readback but that's basically like mapAsync).

@litherum
Copy link
Contributor

(Apologies if this idea has been made already)

What if the native implementations, but not the web implementations, add support for synchronous readback on the main thread?

Pros:

  • The web is where the hard requirement to not block the main thread comes from. (Native apps have a similar desire to not block the main thread, but there aren’t as many institutions built up around that as there are around the web not blocking the main thread.)
  • Some (but not all) authors get what they want

Cons:

  • Authors targeting the web don’t get what they want
  • Authors targeting both native and the web still have to pay the cost of no synchronous readback on the main thread

@valadaptive
Copy link

synchronous readback in the GPU-picking cases you mentioned would easily stall for a couple frames

each readPixels still take a large amount of time (~1ms at least...)

"a couple frames" to "~1ms" is quite a big pivot! Even on a high-end phone with a 120hz display, that's still only an 8th of a frame.

I've benchmarked readPixels, and it indeed takes about 1ms per call. Here are some points of comparison (that you can try for yourself) for what else takes 1ms:

  • 10000 Uint8Array#subarray calls. It's a bit of a microbenchmark: no other operations but subarray.
  • Updating the positions of 150 SVG <circle>s.
  • 10 forced reflows.

Does this mean we should make number-crunching code, SVG DOM operations, or all reflow-forcing operations asynchronous?

1ms is, IMO, absolutely not heavy enough to justify forcing async.

@torokati44
Copy link

What if the native implementations, but not the web implementations, add support for synchronous readback on the main thread?

This sounds particularly un-nice.

"a couple frames" to "~1ms" is quite a big pivot!
...
1ms is, IMO, absolutely not heavy enough to justify forcing async.

Still agreeing hard.

@Kangz
Copy link
Contributor

Kangz commented Feb 23, 2024

"a couple frames" to "~1ms" is quite a big pivot! Even on a high-end phone with a 120hz display, that's still only an 8th of a frame.

If you keep flushing the GPU pipeline, yes you get ~1ms, but then the GPU spends the majority of its time idle because there is no pipelining.

@valadaptive
Copy link

The GPU also spends the majority of its time idle if you hit your frame budget. It seems like your objection here is not actually that synchronous readback causes any jank or inevitably results in a worse user experience, but that it feels icky for the main thread to ever be idle.

I get that it feels wrong to "block" the main thread, but the definition of "blocking" is not as objective as it seems. Is writing to a file descriptor a "blocking" action? It's IO-bound. What about writing to stdout?

The Web platform's reluctance to perform any "blocking" operations whatsoever on the main thread seems to have led to no small number of gripes over the years, and for good reason. In particular, the decision to forbid Atomics.wait on the main thread has caused a lot of developers a lot of real pain.

Of course the rest of the code often needs to be adapted to handle the asynchrony because it might try to do something in the meantime, and there's a development cost to that.

"A development cost" is underselling things a fair bit. As pointed out in the Atomics.wait GitHub thread, there is no programming model for developing applications where every function is re-entrant. The Rust standard library itself, as well as Emscripten, are forced to busy-wait because the Web platform folks have deemed it not "best practices" to yield from the main thread. If Emscripten can't figure it out, who can?

But OK, let's assume the worst1 and consider the impact of the main thread being blocked for several frames. Let's say some dev forgot to consider an edge case and allowed a certain operation to take way too long, or it's running on lower-end hardware than was envisioned.

Which would you rather have:

  • The webpage freezes for several frames, worsening the UX.
  • Everyone gets a window of several frames to accidentally stress-test whether your codebase is really reentrant.

Footnotes

  1. If you extend things to atomics, as referenced above, things become worse: there is the possibility of deadlocks. But browsers already have machinery in place if code spends too much time in a loop. JavaScript, being Turing-complete, can already block the main thread, and indeed I've accidentally written a few infinite loops myself. I've also seen an "a script is slowing down the page" prompt on many a production website.

@kainino0x
Copy link
Contributor Author

What if the native implementations, but not the web implementations, add support for synchronous readback on the main thread?

This is basically the current plan - we're adding a synchronous-wait primitive to webgpu.h. In Wasm, it will be supported when JSPI (or Asyncify or similar emulation) is available.

1ms is, IMO, absolutely not heavy enough to justify forcing async.

The 1ms isn't the justification for async (it probably could be even faster than that - I think non-Chrome browsers do better than 1ms already). That's just the overhead - the actual justification is there's an arbitrary amount of work that's already been queued to the GPU that must complete before the mapSync/readPixels completes. It could be trivial, sure, but it could also take a long time (especially on a low end phone).

That said, while it's technically possible for the webpage (most importantly touch-scrolling) to remain responsive while a lot of WebGPU work is queued, it's often going to be the case that the webpage or even the entire browser stops displaying frames while it completes. In which case the mapSync is probably not making anything worse.

Anyway, TBH, I think we're rehashing all the same arguments. I still personally am on the side of adding mapSync in all contexts, partly because of historical precedent with WebGL and 2D canvas, and partly because of the thing in the last paragraph. But regardless of whether mapSync happens, JSPI is on its way, and we'll make the best we can with that.

@kainino0x kainino0x modified the milestones: Polish post-V1, Milestone 1 Feb 27, 2024
@kainino0x
Copy link
Contributor Author

I'll tentatively triage this into Milestone 1 as it's clearly still a hot topic. Can't guarantee it will see any real progress though.

@litherum
Copy link
Contributor

litherum commented Feb 27, 2024

Speaking purely as as an independent observer here:

If you don’t add mapSync on the main thread, the drawbacks are that some applications will use mega-janky toDataURL workarounds, while others just won’t be ported to WebGPU (or possibly even the entire web platform) at all.

If you do add mapSync on the main thread, the drawbacks are that some apps which don’t need mapSync may elect to use it anyway, thereby resulting in unnecessarily-janky apps.

If I had to choose between those two outcomes, I’d personally choose to have those apps supported on WebGPU, and choose a scary name for mapSync to try to make it clear that authors should avoid using it if possible. Personally, I’d care way more about seeing Jet Set Radio or whatever running in a browser on WebGPU than worrying about how janky some one-off app someone wrote for who-knows-what reason is.

@juj
Copy link

juj commented Mar 13, 2024

Here are some notes that I can pick up from the discussion so far:

a) In the previous comments, Emscripten's Asyncify feature has been mentioned a couple of times. While it is a feat of engineering, Asyncify really never was a production quality feature, due to the dramatic collateral problems it has in increased code size, wasm build times, reduced runtime performance, and re-entrancy challenges. Unfortunately Asyncify results in trading problems with other problems.

b) JSPI is the more robust incarnation/evolution of Asyncify (obsoleting Emscripten's original Asyncify model), and fixes many of its problems. However, even then, problems remain to be resolved with JSPI:

  • The one problem it does not fix is the re-entrancy challenge, which was mentioned already above, picturing quite well how this creates a surface area for novel bugs as opposed to a simple-to-reason known quantity that mapSync() would be. The problem here is that when the call stack is suspended to wait for the async operation to finish, the general browser/JS event loop will continue running. This means that the developer will need to have global awareness of the whole codebase/site in question to know which JS events they can allow to keep resolving, and which they should hold in a queue to resolve only later after the paused async operation has completed (and whether there are transitive ordering guarantees between such events that require holding other events). This makes composability of software libraries more difficult, as a middleware rendering library may have a hard time coordinating such global information with the embedder, or other JS libraries that fire JS events.

  • Another problem here is that JSPI interacts with the presentation model of the DOM Canvas element. Canvas element is specced to present implicitly after finishing any event handler that performed draw operations to the front buffer. In JSPI's case there is currently no special handling to this: the event handler finishes at the moment when the JSPI pause occurs, not when the paused operation resolves. This means that sites that would utilize JSPI'd await mapAsync() would need to render into an offscreen temp render target (to-be-blitted-on-screen at the very end) so that they would not risk presenting intermediate rendered data. This could lead to requiring an extra fullscreen blit in codebases (unless a renderer can sink it into some other backbuffer draw and statically reason that their last blit will never straddle mapAsyncs in the middle of that final presentation blit at a bad time)

  • Third question is how the JSPI'd pause-resumed event handlers will interact with requestAnimationFrame() timings. It is common for renderers to need to perform several sync buffer maps within a single frame, not necessarily just one. There could be dozens or more of such mappings per frame, especially at scene load times or at large dynamic object instantiation times. With JSPI, This would be akin to having a rAF() event handler first perform some WebGPU ops, and then several subsequent "setTimeout() style" events perform pieces of more rendering, before the next rAF() to come.

    The question to ponder here is if this kind of operation will be optimally performant with the rAF()/canvas vsync timing models? For example, if the timings don't quite match up, the next rAF() might arrive while in the middle of a stream of these JSPI paused mapAsync() calls: the codebase would likely need to detect that and skip out from handling anything in that rAF() (i.e. it is detecting re-entrancy). But how will that affect frame presentation timing? Is there danger of performance being unduly pessimized? (e.g. to renderers that are close to their frame time budget already?) Some guidance will likely be needed in this area to confirm this won't be a possible can of worms.

c) There was some mentions above about having waitable map()ing support rather than synchronous mapping support. In a hypothetical waitable map() scheme, user could initiate an asynchronous map(), but then at some point later (while still computing in that same event handler), they could decide to either poll if completion has occurred, or to synchronously pause to wait for completion. This kind of waitable model could be useful, since it would allow renderers to initiate several map requests at once, and then wait for those to resolve in first(), any() or all() fashion. That would allow renderers opportunities to sink waits and perform other computations meanwhile. (maybe some renderers might be able to issue maps at beginning of frame, and then resolve those at the late stages of frame, or similar?)

If there would be a sequential completion guarantee for multiple mappings, then waiting for any() would be trivially easy (just wait in order of submission, e.g. wait for first()). Though if the completion order might be arbitrary, then this would result in the select problem. (this could be solvable via a SAB futex wait style method)

d) As a followup to the above: at this point conversation has revolved around the async->good, sync->bad dogma. It is easy to slip from this to think async->fast, sync->slow by construction. But there is no precedent to show that this would actually be the case. It is very possible that such a waitable mapping function, or an outright mapSync() function could actually result in a higher throughput/shorter CPU time spent in a renderer. Just because the maps resolve asynchronously via the JS event queue does not mean that these mappings would be fast - on the contrary, they could be pretty slow in comparison to a synchronous/waitable map due to JS event queue processing latency and computing context management overhead. But since there doesn't exist support for sync/waitable maps in browsers, there are no current benchmarks to verify this behavior.

To this effect, I would recommend browsers would add mapSync() support behind a flag e.g. in Nightly/Canary builds, so that WebGPU renderer experimentors would be able to measure the cost of the async mapping model to enable validating that there is no overhead to performing asynchronous mapping. It is known that async mapping requires utilizing more GPU memory for round robin dynamic buffers compared to being able to fence+map when unused. How much more? Well, that is harder to analyze. A behind-a-flag mapSync() would allow measuring this overhead in renderers as well. (in codebases that are able to do both async and sync maps)

I would say that for example, if waitable/sync mapping would give a +x% more FPS or -y% less GPU memory used by a renderer compared to async mapping, that would give a new angle to look at the dogmatic/pragmatic axis of sync vs async computation.

e) Strong +1 to adding mapSync() support permanently to Workers.

f) Has there been consideration to add a getBufferSubData(dsttypedarray, offset, size) style of readback API (i.e. synchronously read bytes from a buffer without persisting a mapping to it)?

This kind of API might be faster than a JSPI'd mapAsync()+read+unmap, e.g. by virtue of avoiding JSPI, and it could even be faster than a mapSync()+read+unmap by avoiding an extra JS->Wasm copy.

@Kangz
Copy link
Contributor

Kangz commented Mar 26, 2024

GPU Web CG 2024-03-13 Atlantic-time
  • KN: I’d like to take the temperature on this proposal. Start with taking the temperature on mapSync on workers.
    • People keep asking for it, and as far as I know we've never formally asked the TAG for input on it (only relayed opinions from folks at the browsers). If there is general interest from the CG, I would like us to ask the TAG for review on a proposal defending it (many points have been raised in favor of it, and I honestly think synchronous GPU readbacks are a pretty trivial hazard).
    • Only negative feedback we got on this was someone from Google not wanting worker-specific APIs.
    • Think this would make a lot of people happy.
      • (BTW, there's been some noise pushing for Atomics.wait on the main thread too, which I'd love to see, but it's much bigger.)
  • KG: +1 to supporting this on workers.
  • MW: catching up on the convo, but why do we need this, as opposed to worker waiting asynchronously?
  • KN: couple reasons. Pure JS - can write logic in a way that doesn't require reentrancy. Doesn't force you to yield. Can start using it in requestAnimationFrame. Don't know whether we necessarily want to encourage this. It makes it possible to do mapping from Wasm without yielding - proposal getting standardized, JSPI (JS Promise Integration), still has a bunch of reentrancy problems. Very difficult in C++ code - these problems are major. On workers, there's no harm in allowing this.
  • MW: initial input - seems fine with workers, not sure about the main thread.
  • KN: great. Do we think we need to ask the TAG about mapSync on workers?
  • KR: think we should.
  • CW: think we should ask TAG. If they take too long to answer, can push forward.
  • KG: imagine that if standards folks say OK, that's a strong enough signal.
  • BJ: WebGPU wouldn't be the first API to add blocking support on workers. I have a comment on here from 2 years ago about FileSystemAccessHandle and sync operations. Think talking with
  • KR: Would this enable WebGPU in AudioWorklet?
    • BJ/KN: Would go a long way but other problems
  • KR: Many customers that need the data back on CPU “now”.
  • GT: since I write a bunch of tools to patch the API, seems impossible in workers. Wouldn't be able to block WebGPU, disable features - extension privacy issue. Can do this in main thread. If we add a feature to workers and everyone switches over to workers, tools will stop working. Talk with extension folks?
  • KG: I'm with you. We should talk with folks who spec web-extensions. Not surprised, also not excited about this difficulty.
  • KN: have work to do to bring this to TAG.
  • CW: ok, so discussions with (1) doing this in workers (2) what happens if you do it in rAF, and block (3) etc. Think we should start a proposal doc.
  • KG: welcome to put this in proposals. I'm a bit surprised you're concerned about this since we have readPixels, getImageData on 2D canvas, etc.
  • BJ: think would be useful to present as much real-world data as possible. Hope group wouldn't object to browsers wanting to implement this behind a flag so we can gather some real data.
  • CW / KG: do whatever you want behind a flag.

@juj
Copy link

juj commented Mar 26, 2024

Start with taking the temperature on mapSync on workers.
People keep asking for it, and as far as I know we've never formally asked the TAG for input on it
Only negative feedback we got on this was someone from Google not wanting worker-specific APIs.
KN: great. Do we think we need to ask the TAG about mapSync on workers?

There has been a historical precedent towards synchronous APIs being allowed in Workers. (So I expect TAG would be ok with this, unless there has been a change in direction somehow)

Examples:

  • FileReaderSync is available in Workers, which allows synchronously reading a Blob of a File into an ArrayBuffer,
  • the synchronous Atomics.wait() operation that was already mentioned,
  • synchronous XHR in Workers (although there is some pressure to deprecate the XHR API in whole, and synchronous Fetch was never standardized)

Btw the discussion of synchronous work in Workers reminded me of a priority inversion problem. Maybe a bit tangential, but maybe worth mentioning here: An important detail about browser APIs in Workers *plus* SharedArrayBuffer is the fact that SAB makes it possible to observe in user JS code if such operations require assistance/forward progress from the main browser thread in order to advance. Such work requests can result in priority inversion deadlocks. Some examples:

  • Firefox used to have a bug that if any Worker called console.log(), then that Worker would internally synchronously pause to wait for the main browser thread to acknowledge that it has processed the logging. If the SAB code on the main JS thread was simultaneously spinwaiting to receive a result from the Worker, the main thread and the Worker would deadlock together.
  • all browsers today have a "problem" that Workers cannot spawn Workers without forward progress help from the main thread. Hence, if a Worker was spawning a Worker while the main thread was spinwaiting on SAB for something from the Worker, these would deadlock.
  • likewise, main browser thread cannot spawn Workers without help from the main thread itself, resulting in a rather nasty can of worms for SAB users.
  • a test case about sync XHRs in Workers showcases how the XHR API (in some unspecified browser) has this property. Worker launches a sync XHR, which requires help from the main browser thread to progress. If the Worker was holding a user lock on a SAB and the main thread happened to spinwait on that lock, then deadlock would occur.

(one might argue that these are examples of browser vendors writing the very type of poor practices code into browser codebases, that they are trying to keep bad JS developers from writing, but best not go there...)

It is good to note that before the advent of SharedArrayBuffer, none of this was a problem, since there was no synchronous shared state between the Worker and main thread. So even if any APIs in Workers might synchronously halt to wait for a lock/event from the main thread, it would never be possible to functionally observe.

There was little traction in attempting to update wording on already shipped W3C specs about this, but rather these types of issues have been handled as individual bugs against browsers. For example we used to proxy all console.logs in Emscripten manually over to the main thread to log, until later when Firefox no longer had the problem.

The point of this is that when specifying behavior of WebGPU in Workers, it would be good to have a paragraph to mention that implementations would be required to be able to make forward progress without synchronous help from the main browser thread. Otherwise there is a possibility of priority inversions to occur when talking about Workers, and SAB is in town.

Of course this is not something that would be a potential challenge only with synchronous operations like mapSync(), all other operations in the spec may also have this hurdle - though I was just reminded of that possibility in this context.

@Morglod
Copy link

Morglod commented Apr 3, 2024

For everyone trying to do same thing with native webgpu. macos m1. Tried many variants with wgpu-rs, found that mapAsync is terrible slow and its just wgpu-rs implementation issue. Only one good variant is to take google's dawn implementation, make ring buffers for reading and do this (every frame):

cpp code with glfw and dawn
cpu_buffer = buffers_for_reading.next_buffer();

encoder.copyBufferToBuffer(gpu_buffer, cpu_buffer);
queue.submit( ...encoder.finish()... );

// map and wait
bool done = false;
wgpuBufferMapAsync(cpu_buffer, mode, offset, size, [](WGPUBufferMapAsyncStatus status, void* user_data) {
    (*((bool*)user_data)) = true;
}, &done);

while(!done) {
    glfwPollEvents();
    wgpuInstanceProcessEvents(instance);
}

// getConstMappedRange and do smth with data

cpu_buffer.unmap();

Also for everyone fighting with wgpu-rs -> dawn migration. Google decided to have "undefined" flags not as 0 and have more validations on it. So You may get strange errors and you should solve it by specifying all descriptor fields.
I was stuck with this: WGPURenderPassColorAttachment { .depthSlice = WGPU_DEPTH_SLICE_UNDEFINED }

@kainino0x kainino0x added the api WebGPU API label Apr 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api WebGPU API
Projects
No open projects
Main
Needs Discussion
Development

No branches or pull requests