Unmap persistent buffers before uploading #5800

smoogipoo · 2023-05-18T05:51:21Z

This fixes the graphical issues in ppy/osu#23538

The conjecture here, going by documentation and testing, is that Map() blocks the GPU from accessing the resource, and Unmap() allows it to access it again.

Due to the undocumented nature of this entire endeavour, I'm going to spend the rest of this PR providing some documentation on my findings and testing.

Introduction

A lot of this is diving into seemingly uncharted territory as I've been unable to find anyone else doing what we're doing here, and all suggestions lead to worse performance than these changes achieve. It seems like D3D is more tailored for 3D applications where VBOs are fully populated ahead of time and are very rarely if ever updated, or, when they are; users are okay with adding artificial delays in their code around particular calls, but we can't have these delays.

History

Testing is done via this test scene (with the test button pressed).

1. Staging buffer pool

When we're only updating a subrange of the buffer, this is the memory management that Veldrid offers. It:

The performance profile looks like this:

This is testable by applying this diff:

diff --git a/osu.Framework/Graphics/Veldrid/VeldridRenderer.cs b/osu.Framework/Graphics/Veldrid/VeldridRenderer.cs
index a95cfcf12b..66a41efddb 100644
--- a/osu.Framework/Graphics/Veldrid/VeldridRenderer.cs
+++ b/osu.Framework/Graphics/Veldrid/VeldridRenderer.cs
@@ -606,7 +606,6 @@ internal IStagingBuffer<T> CreateStagingBuffer<T>(uint count)
             switch (Device.BackendType)
             {
                 // D3D11 benefits from persistently mapped buffers.
-                case GraphicsBackend.Direct3D11:
                 // Persistently mapped buffers appear to work by default on Vulkan.
                 case GraphicsBackend.Vulkan:
                     return new PersistentStagingBuffer<T>(this, count);

For this particular test scene, this is about a 65% reduction in FPS versus the non-animated state, which is pretty rough and is what prompted these changes. This is the sort of performance drop that we would also expect in the game, as we have a lot of VBOs updated during animations (e.g. the entire carousel moving).

2. Staging buffer pool with fenced return

Following on from the above, one might expect that the issue is constant syncs between the CPU and GPU as a result of reusing the buffers. This can be corroborated, because if we apply the following patch to Veldrid:

diff --git a/src/Veldrid/D3D11/D3D11GraphicsDevice.cs b/src/Veldrid/D3D11/D3D11GraphicsDevice.cs
index 87d5bcf..8c48e11 100644
--- a/src/Veldrid/D3D11/D3D11GraphicsDevice.cs
+++ b/src/Veldrid/D3D11/D3D11GraphicsDevice.cs
@@ -352,7 +352,7 @@ namespace Veldrid.D3D11
                                 buffer.Buffer,
                                 0,
                                 D3D11Formats.VdToD3D11MapMode((buffer.Usage & BufferUsage.Dynamic) == BufferUsage.Dynamic, mode),
-                                Vortice.Direct3D11.MapFlags.None);
+                                buffer.Usage == BufferUsage.Staging ? Vortice.Direct3D11.MapFlags.DoNotWait : Vortice.Direct3D11.MapFlags.None);

                             info.MappedResource = new MappedResource(resource, mode, msr.DataPointer, buffer.SizeInBytes);

The game immediately crashes with:

Unhandled exception. SharpGen.Runtime.SharpGenException: HRESULT: [0x887A000A], Module: [Vortice.DXGI], ApiCode: [DXGI_ERROR_WAS_STILL_DRAWING/WasStillDrawing], Message: [The GPU was busy at the moment when the call was made, and the call was neither executed nor scheduled.
]
   at SharpGen.Runtime.Result.ThrowFailureException()
   at SharpGen.Runtime.Result.CheckError()
   at Vortice.Direct3D11.ID3D11DeviceContext.Map(ID3D11Resource resource, Int32 subresource, MapMode mode, MapFlags flags)
   at Veldrid.D3D11.D3D11GraphicsDevice.MapCore(MappableResource resource, MapMode mode, UInt32 subresource) in Z:\Repos\veldrid\src\Veldrid\D3D11\D3D11GraphicsDevice.cs:line 351

To rectify this, I first made this Veldrid-side commit which attempts to use a fence to signal when the GPU is done with a buffer. This is similar to how other platforms work (note that this is upstreamed to ppy/Veldrid, and not to veldrid/Veldrid).

The performance profile looks like this:

Although the raw frametimes are a little bit better, it's extremely inconsistent to the point that it feels even worse than the prior method. This is better visualised by going into borderless fullscreen which uses the waitable swapchain, that is seemingly unable to decide whether to render at 144fps (nominal) or 72fps. This feels really bad:

Furthermore, it looks as if we've just transferred over half our processing time to this Signal() method, which I found to be an unexpected and unexplained result.

3. Staging buffer pool with static 6-frame return

Further testing showed that the GPU could keep a buffer in use for at most 6 frames. This is likely platform (and load) dependent, but I had to test using a static 6-frame return window instead of a signal.

This was done in this commit, based on from the above.

The performance profile lookps like this:

The results here look about identical to the fenced method. It's interesting that in both cases the spikes are due to SwapBuffers, which could be indicating that a sync point has been introduced which I haven't been able to discover yet.

4. The giga-buffer

One question arose in my mind - is it the frequency of Map()/Unmap()? I found a relevant-looking GitHub repo showcasing a fix for another game.

The fact that there are multiple sync points back-to-back makes this especially problematic since submitting those infividual copy commands that happen between calls to Map is fairly costly on the driver side.

Thus was born the idea of a single buffer living inside VeldridRenderer that is mapped and unmapped once every frame, and copied to every relevant VBO within that frame's BufferUpdateCommands. The relevant branch can be found here.

The performance profile looks like this:

Yes, you see that correctly. The single Map() call inside Reset() becomes the dominating hotspot in this case.

Those with keen eyes will notice that I've used 1 global buffer. Increasing that number to 12 makes the performance profile look like this:

Which is... At least as bad if not worse than any other attempts. Very interesting to me, is that the overhead of Map() and related functions is completely gone, but now exists somewhere inside SwapBuffers.

5. "Persistent" mapping

The only solution to performance that I've found is persistently mapping the staging buffer. I haven't found another project that does this but it leads to this performance profile:

Which, although it still has spikes from time to time, is still much smoother than any other solution. Furthermore, the game is able to hold a consistent framerate in borderless windowed mode:

6. This PR

This PR builds upon the above to fix ppy/osu#23538 in a bit of an unfortunate way - the issue appears to be fixed by unmapping the buffer. This leads to the following performance profile:

In this case we've lost some performance by bringing back a form of the Map()/Unmap(), but in general it is still better than any other solution.

smoogipoo · 2023-05-18T07:21:19Z

On another bare-metal machine (as opposed to KVM pass-thru)...

GTX 1660: Corroborates the results exactly (my pass-thru is 3060TI so this is expected).

Arc A380: Appears to have the same performance with any of the described methods. It varies between spending 100% of time in calls to the driver, or 50-50 in calls vs SwapBuffers(), but they all result in the same FPS (300fps ± 10) and overall frame smoothness.
Not sure which is the more preferable result in this case...

9800GT: Only master and this change work well.

I don't have any AMD GPUs to test right now D:

bdach

I didn't verify everything in the OP as I presume your testing capabilities are far more comprehensive than mine, but I did roughly look at FPS and can confirm that this change is worse than master but better than pre-persistent-buffers on my machine too.

smoogipoo · 2023-05-22T07:12:52Z

AMD 5500XT:

Suffers from Map()/Unmap() being expensive.
The "gigabuffer" with 12 buffers method performs the best. On-par with the performance prior to this change.
Every other method described here performs around the same.

Unmap buffer before uploading

f523d01

smoogipoo added the area:renderer-veldrid label May 18, 2023

pull-request-size bot added the size/S label May 18, 2023

Merge branch 'master' into d3d-unmap-flush

2a20e2f

peppy approved these changes May 18, 2023

View reviewed changes

bdach approved these changes May 20, 2023

View reviewed changes

Merge branch 'master' into d3d-unmap-flush

8f17687

bdach enabled auto-merge May 20, 2023 18:55

bdach merged commit c929900 into ppy:master May 20, 2023
11 checks passed

bdach mentioned this pull request May 24, 2023

Weird Graphical Issues on DX11 with new update ppy/osu#23645

Closed

bdach mentioned this pull request Jun 4, 2023

Weird texture bug in map selection ppy/osu#23740

Open

smoogipoo deleted the d3d-unmap-flush branch September 11, 2023 02:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unmap persistent buffers before uploading #5800

Unmap persistent buffers before uploading #5800

smoogipoo commented May 18, 2023 •

edited

smoogipoo commented May 18, 2023 •

edited

bdach left a comment •

edited

smoogipoo commented May 22, 2023 •

edited

Unmap persistent buffers before uploading #5800

Unmap persistent buffers before uploading #5800

Conversation

smoogipoo commented May 18, 2023 • edited

Introduction

History

1. Staging buffer pool

2. Staging buffer pool with fenced return

3. Staging buffer pool with static 6-frame return

4. The giga-buffer

5. "Persistent" mapping

6. This PR

smoogipoo commented May 18, 2023 • edited

bdach left a comment • edited

Choose a reason for hiding this comment

smoogipoo commented May 22, 2023 • edited

smoogipoo commented May 18, 2023 •

edited

smoogipoo commented May 18, 2023 •

edited

bdach left a comment •

edited

smoogipoo commented May 22, 2023 •

edited