Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a proposal for multi-worker #354

Open
kainino0x opened this issue Jul 2, 2019 · 41 comments
Open

Create a proposal for multi-worker #354

kainino0x opened this issue Jul 2, 2019 · 41 comments
Assignees
Projects

Comments

@kainino0x
Copy link
Contributor

In the meeting, Myles and I agreed to collaborate on a proposal for how WebGPU should work across multiple workers. Here's a tracking bug.

@kainino0x
Copy link
Contributor Author

I've started working on this (or rather, I've started writing spec text for device/adapter, because I felt we needed it to explain multi-worker).

@kainino0x
Copy link
Contributor Author

kainino0x commented Jul 16, 2019

That work (WIP) is here: http://kai.graphics/gpuweb/spec/

Here's my investigation/thoughts so far:

Devices

GPUDevice is a "reference" to an internal concept, "device".
The device is internally synchronized.

GPUDevice is "serializable" (with forStorage = false), so it can be sent via postMessage. It is not transferable.
When it's serialized, a pointer to its "device" is sent.
When it's deserialized, a GPUDevice(Ref?) with the same "device" is created.

Error stack state is per (device, realm) pair.

Encoders

  • GPUCommandEncoder
  • GPUComputePassEncoder
  • GPURenderPassEncoder

These are neither transferable nor serializable.

Other objects

Mutable resources:

  • GPUBuffer
  • GPUTexture
  • GPUFence

Containers (objects referencing mutable resources):

  • GPUTextureView
  • GPUBindGroup
  • GPUCommandBuffer
  • GPURenderBundle

Immutable (immutable resources or immutable objects not referencing mutable resources):

  • GPUSampler
  • GPUShaderModule
  • GPUComputePipeline
  • GPURenderPipeline
  • GPUPipelineLayout
  • GPUBindGroupLayout

(Please mention if I've misclassified any of these.)

Alternative 1

All of these objects are transferable, but not serializable.

Problem: Say you have a GPUBuffer B, and two GPUBindGroup BG1 and BG2
that contain B. How do you transfer them?
Transferring BG1 would have to also detach B.

Options when BG1 is transfered but BG2 is not:

  • Implicitly detach B. Loses the reference to B, so it can't be destroyed.
  • Make B internally held by BG1 so that B must be in the transfer list too.
  • Make B visibly referenced by BG1 so that B must be in the transfer list too.

All of these options implicitly destroy BG2.

Things get quite complex once you need to transfer a GPUCommandBuffer, all of the GPUBindGroups it uses, and all of the GPUBuffers and GPUTextures they use.

Alternative 2

All of these objects are serializable, but not transferable.

Problem: Now all objects can be accessed from multiple threads in parallel (including map, destroy, etc.) It becomes necessary to implicitly synchronize between threads in order to be able to use them safely. Race conditions become more likely.

Alternative 3

Mutable resources are transferable.
Containers are neither transferable nor serializable.
Immutable objects are serializable.

Problem: This allows for threaded resource initialization, and potentially for threaded submission of command buffers to queues, but it doesn't allow creating command buffers on threads and then transferring them to another thread for submit.

Conclusion

The best idea I've come up with so far is Alternative 2.

@kainino0x
Copy link
Contributor Author

@litherum PTAL

@kvark
Copy link
Contributor

kvark commented Jul 16, 2019

Thanks @kainino0x for the investigation! You seem to have the spec written already as well, great :D

Mutable resources:

Can we consider them really as "(Immutable Resource, Device Reference)" pair? Methods like mapping to host, creating a texture view, or resetting a fence, are all internally using a device anyway in Vulkan and D3D12. Any "mutating" aspect would fall into the internal mutability (and share-ability) of the Device, which we have in place anyway.

With this in mind, I think all the objects should be serializable and not transferable (Alternative 2).

@kainino0x
Copy link
Contributor Author

Is the distinction then "(immutable resource + mutable device reference)" vs "(immutable resource + immutable device reference)"?

@litherum
Copy link
Contributor

This analysis is missing a characterization of patterns that existing apps use. I can provide some information about common patterns in Metal (in the next few days) but I can only surmise common patterns in D3D or Vulkan.

@litherum
Copy link
Contributor

In Metal, there are quite a few designs that authors have used. Here's a list (in rough importance order)

  1. Asynchronous texture & buffer uploads
  2. Asynchronous shader compilation
  3. Asynchronous pipeline state creation
  4. Using MTLParallelRenderEncoder
  5. Each thread in a thread pool records into its own command buffer. This is less performant than MTLParallelRenderEncoder because you have to split your passes, thereby incurring barriers and flushes
  6. Using multiple queues on the same device
  7. Using multiple devices on the same machine

I hear that (5) is popular on Direct3D 12, too.

@kvark
Copy link
Contributor

kvark commented Jul 31, 2019

Thanks for the list, @litherum !

Vulkan and D3D12 equally benefit from cases 1, 2, 3, 6, 7

For (4), secondary command buffers are used in Vulkan to record parts of a render pass separately. It's an open question if our GPUBundle objects would benefit the same scenario.

For primary command buffers, recording one per render target is a good and popular use case (e.g. used by Source 2 in slide 29). Imagine a typical deferred rendering pipeline, you could have a command buffer for the g-buffer pass, resolve + alpha pass, as well as for each light shadow map.

@grorg
Copy link
Contributor

grorg commented Aug 5, 2019

Discussed at 5 August 2019 Teleconference

@kainino0x
Copy link
Contributor Author

#399 #400 #401 #402

I can add more spec text for these later, but I need to do more work to figure out what the specification of internal slots looks like. For now I think we understand what they mean.

@kainino0x
Copy link
Contributor Author

More explanation on multi-threading to help facilitate discussion:
https://github.com/gpuweb/gpuweb/wiki/The-Multi-Explainer#multi-threading-javascript

@kainino0x
Copy link
Contributor Author

Tentative explainer section on multithreading: https://gpuweb.github.io/gpuweb/explainer/#multithreading

@kainino0x kainino0x added this to Needs Investigation/Proposal or Revision in Main Jun 29, 2021
@kainino0x kainino0x moved this from Needs Investigation/Proposal or Revision to Needs Specification in Main Jun 29, 2021
@kainino0x kainino0x added this to the MVP milestone Jun 29, 2021
@kainino0x kainino0x moved this from Needs Specification to Needs Discussion in Main Sep 28, 2021
@kainino0x
Copy link
Contributor Author

Marking needs-discussion so we can figure out whether this is V1 or not.

@greggman
Copy link
Contributor

greggman commented Jan 21, 2022

Just curious but how is pushErrorScope/popErrorScope supposed to work across threads? It feels like all the other calls have no issues as there's no state but not pushErrorScope/popErrorScope. Stateful synchronizing across threads would seem problematic. Maybe some thread magic where by there's an error scope per thread that only reports errors for commands issued on that thread?

@austinEng
Copy link
Contributor

I believe separate error scopes - they are intended to be per device, per execution context

@kainino0x
Copy link
Contributor Author

Yeah, separate error scope stacks per thread. It's kind of funny but at least avoids the cross-thread synchronization issues.

Unfortunately error scopes still don't play super nicely with async (if they span across an await they could catch random other stuff) so that's still kind of an open question I'm pondering.

@Kangz
Copy link
Contributor

Kangz commented Feb 3, 2022

I'd like to move this post-V1. We aren't going to have time to deal with multi-worker semantics for V1.

@kainino0x
Copy link
Contributor Author

Is the current API future compatible with mulit-worker? For example: device.addEventListener('uncapturederror', ... on multiple workers. Do errors get delivered to all workers sharing a device? Only errors for commands issued by that worker? I guess that would work?

My proposal for this is that there's an "originating" GPUDevice (returned by createDevice) which is an EventTarget, and if you Serialize (postMessage without transfer) a GPUDevice you get a different kind of GPUDevice that is not an EventTarget.

My reasoning for this is that uncapturederror is supposed to be only for capturing errors way after they happen, like for telemetry or diagnostic reports. Hence it's just painful for errors to show up in multiple places.

  • texture/query set destroyed set (happens atomically?) and other.

Agree on making it atomic, but do we worry about races between submit() on one thread and destroy() on another? 🤔

@Kangz
Copy link
Contributor

Kangz commented Feb 9, 2022

Agree on making it atomic, but do we worry about races between submit() on one thread and destroy() on another? 🤔

No, that's an application race. As long as one happens completely before the other (so it is safe), it is the application's problem that operation aren't well ordered.

@Kangz
Copy link
Contributor

Kangz commented Apr 26, 2022

We're definitely not going to be able to make this for WebGPU v1, so punting to polish-post-V1 (but maybe it should be post-V1?)

@Kangz Kangz modified the milestones: V1.0, Polish post-V1 Apr 26, 2022
@GoodLuckIce
Copy link

Why can't queue.writeBuffer accept SharedArrayBuffer type data?
This leads to worker threads, and the assembled data cannot be used. I have a lot of time consuming data wrangling and assembly needs. If I do these things in the main thread, I die. Is there any way to solve it?

@kainino0x
Copy link
Contributor Author

kainino0x commented Jun 14, 2023

@GoodLuckIce that's an unrelated issue to multi-worker/multithreading support, as it should work even without it.

writeBuffer does support views into shared array buffer. It may not support SharedArrayBuffer itself (I thought we fixed this but it doesn't appear so). So to work around the issue, just use new Uint8Array(mySharedArrayBuffer) instead, for example.

EDIT: Filed #4186

@GoodLuckIce
Copy link

GoodLuckIce commented Jun 15, 2023

Many thanks.
Before the "multi-worker/multithreading support" solution was available, I was able to use this method to get my work done.
Many thanks.

@louderspace
Copy link

Have there been any updates to this proposal? I'm actively working on a WebGPU/WebCodec video product at Adobe. Our requirement is to allow a WASM universe to generate textures that can be used in a different Worker that's rendering to the screen. In order to maintain high frame rates, we would like the WASM universe to transfer textures and avoid unneeded copying that could require moving memory from the GPU to the CPU and back.

@Kangz
Copy link
Contributor

Kangz commented Feb 21, 2024

We didn't make any progress on this as a group, in part because it is a huge effort and we tried to ship first, but also because to make WebGPU multithreading useful on WASM we need new mechanisms to allow sharing objects without post-messaging them to the other worker. It seems there is now a WASM "shared everything" proposal that could be extended to also allow shared table<externref of shareable things>. CC @tlively (I'll reach out with more details separately).

Thank you for showing that you need this though, it helps be sure that this complex feature is actually very needed.

@tlively
Copy link

tlively commented Feb 21, 2024

You can follow along with the WebAssembly proposal that lets tables, references, and other things be shared across threads here: https://github.com/WebAssembly/shared-everything-threads/blob/main/proposals/shared-everything-threads/Overview.md

@louderspace
Copy link

Thomas, does this proposal allow WASM to share a WebGPU resources between WASM and a Javascript Worker? My goal is to allow the WASM realm to render into WebGPU GPUTextures with it's own Device and OffscreenCanvas. I'd like these to be transferred to a Javascript Worker and used for compositing using its own Device and Canvas. If these GPUTextures can be transferred, it would avoid contention between these realms accessing the same shared data. But, if the table can be used in Javascript to share or transfer WebGPU resources, that would be great.

We currently in the process of updating our WASM universe from using WebGL to WebGPU.

@tlively
Copy link

tlively commented Feb 21, 2024

Once that Wasm proposal introduces a notion of shared JS and Wasm objects, there will be additional spec and implementation work on the WebGPU side to make sure that GPUTextures and any other necessary objects are shareable. After that, everything should be able to work the way you want it to AFAICT.

@syg
Copy link

syg commented Feb 22, 2024

Once that Wasm proposal introduces a notion of shared JS and Wasm objects, there will be additional spec and implementation work on the WebGPU side to make sure that GPUTextures and any other necessary objects are shareable. After that, everything should be able to work the way you want it to AFAICT.

Unfortunately it's not that easy.

Objects like GPUTexture instances are already shipped as part of WebGPU and are unshared. Whether we can transparently upgrade them to shared objects depends on the backwards compatibility. In theory it is not backwards compatible, but the question is whether it breaks in practice.

@kainino0x's sketches above are about postMessaging unshared objects and how that can work in the current world of "shared backing store, but recreate an unshared wrapper" a la SharedArrayBuffer instances. This is a very different API than the fully shared world proposed by the Wasm proposal. The wrapper objects show up as unshared and can't be referenced from shared objects.

But! Since the "shared backing store, recreate unshared wrapper" solution isn't yet shipped, is it the case that these WebGPU objects can't be postMessaged at all today? If they can't be postMessaged at all today, there is no cross-thread observable behavior to be backwards compatible to, and we can probably transparently upgrade these objects to shared objects, analogous to what's proposed in the JS API section of the Wasm proposal.

Coming back to "in theory it is not backwards compatible" from above, these shared equivalents would be have some new, exotic behaviors that's different from plain JS objects. I don't know if these differences matter in practice. Would appreciate @kainino0x to chime in:

  • Since they're now shared, you can no longer reference unshared objects from them. Primitives (e.g. strings) are okay, but unshared objects aren't.
  • The [[Prototype]] would become per-Realm. I think in practice this works as expected (e.g. a GPUTexture instance on worker 1 will get worker 1's GPUTexture.prototype and that same instance on worker 2 will get worker 2's GPUTexture.prototype), but it will also change the behavior [[SetPrototypeOf]]. Can users change the [[Prototype]] of these WebGPU objects today?
  • Mutable own properties become data racy. This seems fine since there's no cross-thread observable behavior today anyway.

@greggman
Copy link
Contributor

greggman commented Feb 22, 2024

Since they're now shared, you can no longer reference unshared objects from them. Primitives (e.g. strings) are okay, but unshared objects aren't.

I'm assuming this means this code fails?

   const t = device.createTexture(....);
   t.myDebuggingInfo = { ... };  // error!

My understanding is we want GPUTexture, GPUBuffer to be like SharedArrayBuffer. The thing in JS is a wrapper. The wrapper is different in different workers/main but the wrappers reference the same native WebGPU object, which IIUC is just an id in the render process.

@tlively
Copy link

tlively commented Feb 22, 2024

Note that the shared JS objects introduced in the new Wasm proposal are intentionally not like SharedArrayBuffer, since creating thread-local wrappers around shared backing stores has overhead we want to avoid by making the JS objects shared directly.

@greggman
Copy link
Contributor

Note that the shared JS objects introduced in the new Wasm proposal are intentionally not like SharedArrayBuffer, since creating thread-local wrappers around shared backing stores has overhead we want to avoid by making the JS objects shared directly.

That's great but is that what's needed for WebGPU? WebGPU objects are effectively just opaque ids. There's nothing else shared IIUC.

@kenrussell
Copy link
Member

@syg :

... If they can't be postMessaged at all today, there is cross-thread observable behavior to be backwards compatible to, and we can probably transparently upgrade these objects to shared objects ...

Checking - did you mean "there is no cross-thread behavior to be backwards compatible to"?

@syg
Copy link

syg commented Feb 22, 2024

Checking - did you mean "there is no cross-thread behavior to be backwards compatible to"?

Oops, quite right! Corrected in the post.

That's great but is that what's needed for WebGPU? WebGPU objects are effectively just opaque ids. There's nothing else shared IIUC.

Indeed it might not be what's needed. But let's go back to the original use case and I'll try to tie it all together for why the design is inter-related.

IIUC, the original use case is that WebGPU would like a way to set up some communication channel ahead of time (i.e. like a ConcurrentQueue) that you postMessage ahead of time, and then use that from worker threads to communicate the WebGPU objects to avoid the asynchrony of postMessage.

Okay, great.

The design challenge is that if we design a ConcurrentQueue in today's world where there are no actual shared objects, the only way the ConcurrentQueue would work is something like "serialize on write, deserialize on read", and build in some magic sharing of the backing-store on serialize/deserialize like what happens for SharedArrayBuffer today across postMessage.

But this is not an attractive design for a future world we want with actual shared objects. Since there's independent desire and already momentum to get to that future with actual shared objects via Wasm (and JS) proposals, it's worth considering how we want WebGPU objects work in that future.

In that future with actual shared memory, a ConcurrentQueue would read and write actual shared objects so no serializing/deserializing is needed. I understand for WebGPU the "wrapper recreation cost" is not a real problem since there just aren't that many textures or buffers or whatever, but this is not the case for general multithreaded programs, which is why we're designing actual shared objects. The problem in that future is that shared objects cannot reference unshared objects, as that wouldn't be thread-safe. (Where the Wasm shared-everything proposal comes in that it would provide the building blocks to write such a ConcurrentQueue in userland.)

So what are the design choices here in a future with actual shared objects? I see at least two I find palatable.

  1. Try to make WebGPU objects themselves actually shared somehow. This has the back compat thorniness I raised above.

  2. Have WebGPU objects expose some kind of opaque token that is considered an actual shared object, as well as the ability to create new WebGPU objects backed by that opaque token. IOW, expose the magically-shared-backing-store concept to userland. This token is what can go into something like ConcurrentQueue; user code can have convenience wrappers that do the serializing/deserializing themselves. This avoids the back compat issue but requires rewrapping, but since you don't consider that problematic that seems fine.

@Kangz
Copy link
Contributor

Kangz commented Feb 23, 2024

Note that the idea of ConcurrentQueue was to work around the fact that there was no concept of shared object at the time. What we'd really like is for to have shared objects to represent WebGPU things so that they can be used concurrently with as little scaffolding as possible.

We can't make the WebGPU objects shared because we need to keep back compat. So either we do something like 2), or we add a shared: true to all object descriptors that changes the method to return a GPUObjectShared instead of GPUObject.

@syg
Copy link

syg commented Feb 23, 2024

Note that the idea of ConcurrentQueue was to work around the fact that there was no concept of shared object at the time. What we'd really like is for to have shared objects to represent WebGPU things so that they can be used concurrently with as little scaffolding as possible.

Well you still have to communicate these WebGPU objects to other threads somehow, right? If you don't want to postMessage each one individually, you have to communicate some kind of shared container ahead of time so you can use that as the synchronous communication channel. ConcurrentQueue is just a standin for a synchronous communication channel.

Edit: To be more explicit, even in the future we're proposing with actual shared objects, there's still no implicitly available globally shared state. For a thread to get access to an actual shared object, you'll still need to communicate it to the thread first via postMessage, after which that thread can use it synchronously.

@Kangz
Copy link
Contributor

Kangz commented Feb 23, 2024

In our case I think we'd want to use shared table<shared externref> in emscripten to have per object tables, and then the WGPUObject C++ type will be an ID used to access the table. The tables would be passed to the new worker/thread in some way, like the SharedArrayBuffer used for the linear memory for example.

@kainino0x
Copy link
Contributor Author

is it the case that these WebGPU objects can't be postMessaged at all today?

Yes. But as Gregg said, breaking non-primitive-typed expando properties on shareable objects is almost certainly a no-go.

2. Have WebGPU objects expose some kind of opaque token that is considered an actual shared object, as well as the ability to create new WebGPU objects backed by that opaque token. IOW, expose the magically-shared-backing-store concept to userland. This token is what can go into something like ConcurrentQueue; user code can have convenience wrappers that do the serializing/deserializing themselves. This avoids the back compat issue but requires rewrapping, but since you don't consider that problematic that seems fine.

This is potentially appealing. We could even probably update some function signatures to accept the token types in addition to the wrapped types, so the application doesn't always have to re-wrap. Though in that case it becomes more similar to the shared: true idea.

  • Can users change the [[Prototype]] of these WebGPU objects today?

I'd have thought WebIDL would disallow this, but in Chrome I can Object.setPrototypeOf(b, {}) on a GPUBuffer and it is reflected in b.__proto__. So I guess yes. But I would be surprised if anyone is currently doing this, so it's probably OK to have a breaking change here.

@kainino0x kainino0x added the api WebGPU API label Apr 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Main
Needs Discussion
Development

No branches or pull requests

13 participants