Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add transfer queue support to RenderingDevice and enable multithreaded resource loading #87590

Closed

Conversation

DarioSamo
Copy link
Contributor

@DarioSamo DarioSamo commented Jan 25, 2024

A few major bottlenecks were identified in Godot that this PR addresses.

  • Access to RenderingDevice is pretty much entirely blocked using a single thread safety mutex. This means that any thread that wishes to create buffers or textures from different threads needs to wait for the others or even the main thread to free its access.
  • When Vsync is enabled, waiting is performed while inside the mutex itself, so any arbitrary waiting caused by the presentation rate of the screen potentially stalls all efforts of loading resources in background threads. This can result in major increases in loading times in most games that choose to draw a loading screen and have Vsync enabled.
  • While the hardware provides dedicated queues for transfer operations that are much faster for the process and can be run independently from the main application loop, Godot chooses to use the same staging buffer logic that it uses for resources uploading during each frame that it uses for resources uploaded from background threads.

The PR solves all these bottlenecks and also simplifies the separate setup command buffer logic that was in place in RenderingDevice. RenderingDevice is now able to create "Transfer Workers" on the fly that can use their own staging buffer and command buffer to record the upload of all resources queued by the thread. Once it reaches a certain saturation point, it'll submit all the work to the dedicated transfer queue. The main loop will wait if necessary for these transfer workers if it's been determined that one of the resources that was uploaded with a transfer worker was used in the frame. This design allows to effectively upload resources in the background without stalling the main thread at all until the resource needs to be used.

The PR also changes the flexibility of the thread safety of RenderingDevice: whatever is allowed to be called from different threads is now allowed with no mutex involved, and whatever is not allowed now has a thread guard in place instead. A thread guard is a much cheaper option than a mutex and it also verifies that RenderingDevice is being used with its intended behavior: drawing from the main loop and not from other threads that could potentially cause issues. This was mostly done following @reduz's advice as noted here.

So far we lack a public project that allows to showcase loading a large amount of resources, but one of W4's internal projects that uses a large amount of textures and meshes had some good results:

  • master with Vsync enabled (60 HZ Monitor): ~25 seconds
  • master with Vsync disabled: ~2.5 seconds (10x faster!)
  • transfer_queues with both Vsync enabled and disabled: ~2.1 seconds

We can see one of the major benefits of this approach is that it makes the behavior much more consistent regardless of the user's vsync settings and refresh rate.

Projects with lots of resources that rely on background resource loading during loading screens or even during gameplay are the ideal candidates for testing out this PR.

TODO:

  • Decide on how to configure the max amount of transfer workers (right now it is just derived from the CPU count since it matches the logic of WorkerThreadPool).
  • Experiment with taking out the thread-safety of RID Owner by adding the logic from @reduz's PR (Make RID_Owner lock-free for fetching. #86333).

@Saul2022
Copy link

Saul2022 commented Jan 26, 2024

I know big projects i needed but the most i could find was the lumberyard one so tested on my igpu (radeon vega 3) Results
the 4.3 is the 1 minute and 4.2 is 1.04 , not much difference but it an igpu vs gb's.

Screenshot_2024-01-26-11-27-09-34_cc0c40aae00121c8e1b1866ef91e05c7 Screenshot_2024-01-26-11-23-54-98_cc0c40aae00121c8e1b1866ef91e05c7

@DarioSamo
Copy link
Contributor Author

DarioSamo commented Jan 26, 2024

I know big projects i needed but the most i could find was the lumberyard one so tested on my igpu (radeon vega 3) Results
the 4.3 is the 1 minute and 4.2 is 1.04 , not much difference but it an igpu vs gb's.

On the editor or a game? The benefits mostly apply to using threaded resource loading (e.g. loading with an animated loading screen or during gameplay), it shouldn't really matter that much on situations where it's loading but stuck without drawing anything.

Also you can get fairly accurate measurements by just measuring the ticks using GDScript in a project and displaying them somewhere, no need to use a stopwatch.

@DarioSamo
Copy link
Contributor Author

I can confirm during testing that #86333 brings performance back up to par with master as far as command recording is concerned, as I noticed nearly a 10% regression when testing on a big scene that is currently CPU-bottlenecked. We should probably evaluate that PR and if we want to merge it if we don't want this one to cause regressions.

@Saul2022
Copy link

Saul2022 commented Feb 2, 2024

Well now added loading screen and nice results from (though vsync dissabled, it enabled was 15 secs ) And while they may not be the most accurate, but it works ( still figuring out how to make it in editor)

Screenshot_2024-02-02-15-07-00-95_1eeb8c37998c08f2a00dbbf329757a78
Screenshot_2024-02-02-15-09-32-82_1eeb8c37998c08f2a00dbbf329757a78

Copy link
Member

@Calinou Calinou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested locally (rebased on top of master 10e1114 + #86333), it works as expected on Linux + NVIDIA.

On https://github.com/godotengine/tps-demo, it loads a full second faster than on master with neither PR applied.

@DarioSamo
Copy link
Contributor Author

DarioSamo commented Feb 13, 2024

Now that context rewrite is merged I guess we can start discussing this PR. Right now I'm fairly happy with its current state, but some questions remain to be answered if we're fine with the current design in regards to the extra memory consumption it implies.

As threaded workers are created, that means each worker that can upload stuff in parallel can have a variable cost up to the size of the biggest resource that's been uploaded. We don't want to increase memory consumption endlessly, so the current approach I took was to take the current processor count as the max amount of workers possible. In theory this means that potentially the memory consumption can increase as much as N CPU cores multiplied by the memory consumption biggest resource (if said resources were all uploaded in parallel). On the plus side, the staging buffer could actually be much smaller now as it's no longer used for uploading large resources but for uploading per-frame data now.

I'm not sure if this is actually the approach we prefer or we should make this configurable somehow, but it does make sense that a system with more cores is likely to have more memory available (1 to 2 GB per core is usually the sweet spot or you run into paging issues very often on multi-core systems). I don't think exposing a fixed limit would make much sense either as it means it can't take advantage of the extra system resources effectively, but it might be warranted depending on the use case.

The current version also implies a CPU performance regression unless #86333 is merged first due to having to enable thread safety on the RID Owners.

In either case I'm taking this out of drafts so we can start looking into it, but I'm not sure if it should make it in 4.3 or not just yet as the testing to find the issues with it might need to be very extensive (and it also exposes existing multi-threading issues like we found out with #87721).

@DarioSamo DarioSamo marked this pull request as ready for review February 13, 2024 18:09
@DarioSamo DarioSamo requested a review from a team as a code owner February 13, 2024 18:09
Comment on lines +5617 to +5610
// TODO: How should this max size be determined?
transfer_worker_pool_max_size = OS::get_singleton()->get_processor_count();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't checked this very deeply yet. So far, I got the impression that this code is not spawning threads but having parallel bookkeeping for thr threads already existing that may be loading resources.

If that's true, a beginning would be WorkerThreadPool::get_thread_count(). Even if currently resource loading is only allowed on low priority threads, that can mean the whole thread pool being used because there's the potential of more low-prio tasks in flight than that number and also because the are no pre-assigned low-prio threads, so any thread from the pool can be running a low-prio task at a given moment.

That said, the main thead as well as arbitrary threads can be loading resources at a given point. So, I'd at least add one to that count. If user creates extra threads while a WTP-backed load is in progress, well, it's only fair that they have to wait for a transfer worker to be available.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct, it doesn't spawn threads on its own nor does it pre-create them, it just allows a maximum number of concurrent workers. Using WTP as a reference does seem like a good start.

Comment on lines +437 to +439
if (!src_buffer) {
ERR_FAIL_V_MSG(ERR_INVALID_PARAMETER, "Source buffer argument is not a valid buffer of any type.");
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (!src_buffer) {
ERR_FAIL_V_MSG(ERR_INVALID_PARAMETER, "Source buffer argument is not a valid buffer of any type.");
}
ERR_FAIL_NULL_V_MSG(src_buffer, ERR_INVALID_PARAMETER, "Source buffer argument is not a valid buffer of any type.");

Comment on lines +442 to +444
if (!dst_buffer) {
ERR_FAIL_V_MSG(ERR_INVALID_PARAMETER, "Destination buffer argument is not a valid buffer of any type.");
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (!dst_buffer) {
ERR_FAIL_V_MSG(ERR_INVALID_PARAMETER, "Destination buffer argument is not a valid buffer of any type.");
}
ERR_FAIL_NULL_V_MSG(dst_buffer, ERR_INVALID_PARAMETER, "Destination buffer argument is not a valid buffer of any type.");

@clayjohn clayjohn modified the milestones: 4.3, 4.4 Mar 23, 2024
@DarioSamo DarioSamo closed this Apr 15, 2024
@AThousandShips AThousandShips removed this from the 4.4 milestone Apr 15, 2024
DarioSamo added a commit to DarioSamo/godot that referenced this pull request Apr 23, 2024
…ice. Add ubershaders and rework pipeline caches for Forward+ and Mobile.

- Implements asynchronous transfer queues from PR godotengine#87590.
- Adds ubershaders that can run with specialization constants specified as push constants.
- Pipelines with specialization constants can compile in the background.
- Added monitoring for pipeline compilations.
- Materials and shaders can now be created asynchronously on background threads.
- Meshes that are loaded on background threads can also compile pipelines as part of the loading process.
DarioSamo added a commit to DarioSamo/godot that referenced this pull request Sep 19, 2024
…ice. Add ubershaders and rework pipeline caches for Forward+ and Mobile.

- Implements asynchronous transfer queues from PR godotengine#87590.
- Adds ubershaders that can run with specialization constants specified as push constants.
- Pipelines with specialization constants can compile in the background.
- Added monitoring for pipeline compilations.
- Materials and shaders can now be created asynchronously on background threads.
- Meshes that are loaded on background threads can also compile pipelines as part of the loading process.
DarioSamo added a commit to DarioSamo/godot that referenced this pull request Sep 19, 2024
…ice. Add ubershaders and rework pipeline caches for Forward+ and Mobile.

- Implements asynchronous transfer queues from PR godotengine#87590.
- Adds ubershaders that can run with specialization constants specified as push constants.
- Pipelines with specialization constants can compile in the background.
- Added monitoring for pipeline compilations.
- Materials and shaders can now be created asynchronously on background threads.
- Meshes that are loaded on background threads can also compile pipelines as part of the loading process.
DarioSamo added a commit to DarioSamo/godot that referenced this pull request Sep 20, 2024
…ice. Add ubershaders and rework pipeline caches for Forward+ and Mobile.

- Implements asynchronous transfer queues from PR godotengine#87590.
- Adds ubershaders that can run with specialization constants specified as push constants.
- Pipelines with specialization constants can compile in the background.
- Added monitoring for pipeline compilations.
- Materials and shaders can now be created asynchronously on background threads.
- Meshes that are loaded on background threads can also compile pipelines as part of the loading process.
DarioSamo added a commit to DarioSamo/godot that referenced this pull request Sep 20, 2024
…ice. Add ubershaders and rework pipeline caches for Forward+ and Mobile.

- Implements asynchronous transfer queues from PR godotengine#87590.
- Adds ubershaders that can run with specialization constants specified as push constants.
- Pipelines with specialization constants can compile in the background.
- Added monitoring for pipeline compilations.
- Materials and shaders can now be created asynchronously on background threads.
- Meshes that are loaded on background threads can also compile pipelines as part of the loading process.
DarioSamo added a commit to DarioSamo/godot that referenced this pull request Sep 20, 2024
…ice. Add ubershaders and rework pipeline caches for Forward+ and Mobile.

- Implements asynchronous transfer queues from PR godotengine#87590.
- Adds ubershaders that can run with specialization constants specified as push constants.
- Pipelines with specialization constants can compile in the background.
- Added monitoring for pipeline compilations.
- Materials and shaders can now be created asynchronously on background threads.
- Meshes that are loaded on background threads can also compile pipelines as part of the loading process.
DarioSamo added a commit to DarioSamo/godot that referenced this pull request Sep 20, 2024
…ice. Add ubershaders and rework pipeline caches for Forward+ and Mobile.

- Implements asynchronous transfer queues from PR godotengine#87590.
- Adds ubershaders that can run with specialization constants specified as push constants.
- Pipelines with specialization constants can compile in the background.
- Added monitoring for pipeline compilations.
- Materials and shaders can now be created asynchronously on background threads.
- Meshes that are loaded on background threads can also compile pipelines as part of the loading process.
DarioSamo added a commit to DarioSamo/godot that referenced this pull request Sep 25, 2024
…ice. Add ubershaders and rework pipeline caches for Forward+ and Mobile.

- Implements asynchronous transfer queues from PR godotengine#87590.
- Adds ubershaders that can run with specialization constants specified as push constants.
- Pipelines with specialization constants can compile in the background.
- Added monitoring for pipeline compilations.
- Materials and shaders can now be created asynchronously on background threads.
- Meshes that are loaded on background threads can also compile pipelines as part of the loading process.
DarioSamo added a commit to DarioSamo/godot that referenced this pull request Sep 25, 2024
…ice. Add ubershaders and rework pipeline caches for Forward+ and Mobile.

- Implements asynchronous transfer queues from PR godotengine#87590.
- Adds ubershaders that can run with specialization constants specified as push constants.
- Pipelines with specialization constants can compile in the background.
- Added monitoring for pipeline compilations.
- Materials and shaders can now be created asynchronously on background threads.
- Meshes that are loaded on background threads can also compile pipelines as part of the loading process.
DarioSamo added a commit to DarioSamo/godot that referenced this pull request Sep 25, 2024
…ice. Add ubershaders and rework pipeline caches for Forward+ and Mobile.

- Implements asynchronous transfer queues from PR godotengine#87590.
- Adds ubershaders that can run with specialization constants specified as push constants.
- Pipelines with specialization constants can compile in the background.
- Added monitoring for pipeline compilations.
- Materials and shaders can now be created asynchronously on background threads.
- Meshes that are loaded on background threads can also compile pipelines as part of the loading process.
DarioSamo added a commit to DarioSamo/godot that referenced this pull request Sep 26, 2024
…ice. Add ubershaders and rework pipeline caches for Forward+ and Mobile.

- Implements asynchronous transfer queues from PR godotengine#87590.
- Adds ubershaders that can run with specialization constants specified as push constants.
- Pipelines with specialization constants can compile in the background.
- Added monitoring for pipeline compilations.
- Materials and shaders can now be created asynchronously on background threads.
- Meshes that are loaded on background threads can also compile pipelines as part of the loading process.
DarioSamo added a commit to DarioSamo/godot that referenced this pull request Sep 26, 2024
…ice. Add ubershaders and rework pipeline caches for Forward+ and Mobile.

- Implements asynchronous transfer queues from PR godotengine#87590.
- Adds ubershaders that can run with specialization constants specified as push constants.
- Pipelines with specialization constants can compile in the background.
- Added monitoring for pipeline compilations.
- Materials and shaders can now be created asynchronously on background threads.
- Meshes that are loaded on background threads can also compile pipelines as part of the loading process.
DarioSamo added a commit to DarioSamo/godot that referenced this pull request Oct 1, 2024
…ice. Add ubershaders and rework pipeline caches for Forward+ and Mobile.

- Implements asynchronous transfer queues from PR godotengine#87590.
- Adds ubershaders that can run with specialization constants specified as push constants.
- Pipelines with specialization constants can compile in the background.
- Added monitoring for pipeline compilations.
- Materials and shaders can now be created asynchronously on background threads.
- Meshes that are loaded on background threads can also compile pipelines as part of the loading process.
DarioSamo added a commit to DarioSamo/godot that referenced this pull request Oct 1, 2024
…ice. Add ubershaders and rework pipeline caches for Forward+ and Mobile.

- Implements asynchronous transfer queues from PR godotengine#87590.
- Adds ubershaders that can run with specialization constants specified as push constants.
- Pipelines with specialization constants can compile in the background.
- Added monitoring for pipeline compilations.
- Materials and shaders can now be created asynchronously on background threads.
- Meshes that are loaded on background threads can also compile pipelines as part of the loading process.
DarioSamo added a commit to DarioSamo/godot that referenced this pull request Oct 1, 2024
…ice. Add ubershaders and rework pipeline caches for Forward+ and Mobile.

- Implements asynchronous transfer queues from PR godotengine#87590.
- Adds ubershaders that can run with specialization constants specified as push constants.
- Pipelines with specialization constants can compile in the background.
- Added monitoring for pipeline compilations.
- Materials and shaders can now be created asynchronously on background threads.
- Meshes that are loaded on background threads can also compile pipelines as part of the loading process.
DarioSamo added a commit to DarioSamo/godot that referenced this pull request Oct 1, 2024
…ice. Add ubershaders and rework pipeline caches for Forward+ and Mobile.

- Implements asynchronous transfer queues from PR godotengine#87590.
- Adds ubershaders that can run with specialization constants specified as push constants.
- Pipelines with specialization constants can compile in the background.
- Added monitoring for pipeline compilations.
- Materials and shaders can now be created asynchronously on background threads.
- Meshes that are loaded on background threads can also compile pipelines as part of the loading process.
DarioSamo added a commit to DarioSamo/godot that referenced this pull request Oct 2, 2024
…ice. Add ubershaders and rework pipeline caches for Forward+ and Mobile.

- Implements asynchronous transfer queues from PR godotengine#87590.
- Adds ubershaders that can run with specialization constants specified as push constants.
- Pipelines with specialization constants can compile in the background.
- Added monitoring for pipeline compilations.
- Materials and shaders can now be created asynchronously on background threads.
- Meshes that are loaded on background threads can also compile pipelines as part of the loading process.
DarioSamo added a commit to DarioSamo/godot that referenced this pull request Oct 2, 2024
…ice. Add ubershaders and rework pipeline caches for Forward+ and Mobile.

- Implements asynchronous transfer queues from PR godotengine#87590.
- Adds ubershaders that can run with specialization constants specified as push constants.
- Pipelines with specialization constants can compile in the background.
- Added monitoring for pipeline compilations.
- Materials and shaders can now be created asynchronously on background threads.
- Meshes that are loaded on background threads can also compile pipelines as part of the loading process.
DarioSamo added a commit to DarioSamo/godot that referenced this pull request Oct 2, 2024
…ice. Add ubershaders and rework pipeline caches for Forward+ and Mobile.

- Implements asynchronous transfer queues from PR godotengine#87590.
- Adds ubershaders that can run with specialization constants specified as push constants.
- Pipelines with specialization constants can compile in the background.
- Added monitoring for pipeline compilations.
- Materials and shaders can now be created asynchronously on background threads.
- Meshes that are loaded on background threads can also compile pipelines as part of the loading process.
DarioSamo added a commit to DarioSamo/godot that referenced this pull request Oct 2, 2024
…ice. Add ubershaders and rework pipeline caches for Forward+ and Mobile.

- Implements asynchronous transfer queues from PR godotengine#87590.
- Adds ubershaders that can run with specialization constants specified as push constants.
- Pipelines with specialization constants can compile in the background.
- Added monitoring for pipeline compilations.
- Materials and shaders can now be created asynchronously on background threads.
- Meshes that are loaded on background threads can also compile pipelines as part of the loading process.
laurentmackay pushed a commit to metapfhor/godot that referenced this pull request Oct 30, 2024
…ice. Add ubershaders and rework pipeline caches for Forward+ and Mobile.

- Implements asynchronous transfer queues from PR godotengine#87590.
- Adds ubershaders that can run with specialization constants specified as push constants.
- Pipelines with specialization constants can compile in the background.
- Added monitoring for pipeline compilations.
- Materials and shaders can now be created asynchronously on background threads.
- Meshes that are loaded on background threads can also compile pipelines as part of the loading process.
JayTropper pushed a commit to JayTropper/godot that referenced this pull request Nov 6, 2024
…ice. Add ubershaders and rework pipeline caches for Forward+ and Mobile.

- Implements asynchronous transfer queues from PR godotengine#87590.
- Adds ubershaders that can run with specialization constants specified as push constants.
- Pipelines with specialization constants can compile in the background.
- Added monitoring for pipeline compilations.
- Materials and shaders can now be created asynchronously on background threads.
- Meshes that are loaded on background threads can also compile pipelines as part of the loading process.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants