Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unmap persistent buffers before uploading #5800

Merged
merged 3 commits into from May 20, 2023
Merged

Conversation

smoogipoo
Copy link
Contributor

@smoogipoo smoogipoo commented May 18, 2023

This fixes the graphical issues in ppy/osu#23538

The conjecture here, going by documentation and testing, is that Map() blocks the GPU from accessing the resource, and Unmap() allows it to access it again.

Due to the undocumented nature of this entire endeavour, I'm going to spend the rest of this PR providing some documentation on my findings and testing.

Introduction

A lot of this is diving into seemingly uncharted territory as I've been unable to find anyone else doing what we're doing here, and all suggestions lead to worse performance than these changes achieve. It seems like D3D is more tailored for 3D applications where VBOs are fully populated ahead of time and are very rarely if ever updated, or, when they are; users are okay with adding artificial delays in their code around particular calls, but we can't have these delays.

History

Testing is done via this test scene (with the test button pressed).

1. Staging buffer pool

When we're only updating a subrange of the buffer, this is the memory management that Veldrid offers. It:

The performance profile looks like this:

image
image

This is testable by applying this diff:

diff --git a/osu.Framework/Graphics/Veldrid/VeldridRenderer.cs b/osu.Framework/Graphics/Veldrid/VeldridRenderer.cs
index a95cfcf12b..66a41efddb 100644
--- a/osu.Framework/Graphics/Veldrid/VeldridRenderer.cs
+++ b/osu.Framework/Graphics/Veldrid/VeldridRenderer.cs
@@ -606,7 +606,6 @@ internal IStagingBuffer<T> CreateStagingBuffer<T>(uint count)
             switch (Device.BackendType)
             {
                 // D3D11 benefits from persistently mapped buffers.
-                case GraphicsBackend.Direct3D11:
                 // Persistently mapped buffers appear to work by default on Vulkan.
                 case GraphicsBackend.Vulkan:
                     return new PersistentStagingBuffer<T>(this, count);

For this particular test scene, this is about a 65% reduction in FPS versus the non-animated state, which is pretty rough and is what prompted these changes. This is the sort of performance drop that we would also expect in the game, as we have a lot of VBOs updated during animations (e.g. the entire carousel moving).

2. Staging buffer pool with fenced return

Following on from the above, one might expect that the issue is constant syncs between the CPU and GPU as a result of reusing the buffers. This can be corroborated, because if we apply the following patch to Veldrid:

diff --git a/src/Veldrid/D3D11/D3D11GraphicsDevice.cs b/src/Veldrid/D3D11/D3D11GraphicsDevice.cs
index 87d5bcf..8c48e11 100644
--- a/src/Veldrid/D3D11/D3D11GraphicsDevice.cs
+++ b/src/Veldrid/D3D11/D3D11GraphicsDevice.cs
@@ -352,7 +352,7 @@ namespace Veldrid.D3D11
                                 buffer.Buffer,
                                 0,
                                 D3D11Formats.VdToD3D11MapMode((buffer.Usage & BufferUsage.Dynamic) == BufferUsage.Dynamic, mode),
-                                Vortice.Direct3D11.MapFlags.None);
+                                buffer.Usage == BufferUsage.Staging ? Vortice.Direct3D11.MapFlags.DoNotWait : Vortice.Direct3D11.MapFlags.None);

                             info.MappedResource = new MappedResource(resource, mode, msr.DataPointer, buffer.SizeInBytes);

The game immediately crashes with:

Unhandled exception. SharpGen.Runtime.SharpGenException: HRESULT: [0x887A000A], Module: [Vortice.DXGI], ApiCode: [DXGI_ERROR_WAS_STILL_DRAWING/WasStillDrawing], Message: [The GPU was busy at the moment when the call was made, and the call was neither executed nor scheduled.
]
   at SharpGen.Runtime.Result.ThrowFailureException()
   at SharpGen.Runtime.Result.CheckError()
   at Vortice.Direct3D11.ID3D11DeviceContext.Map(ID3D11Resource resource, Int32 subresource, MapMode mode, MapFlags flags)
   at Veldrid.D3D11.D3D11GraphicsDevice.MapCore(MappableResource resource, MapMode mode, UInt32 subresource) in Z:\Repos\veldrid\src\Veldrid\D3D11\D3D11GraphicsDevice.cs:line 351

To rectify this, I first made this Veldrid-side commit which attempts to use a fence to signal when the GPU is done with a buffer. This is similar to how other platforms work (note that this is upstreamed to ppy/Veldrid, and not to veldrid/Veldrid).

The performance profile looks like this:

image
image

Although the raw frametimes are a little bit better, it's extremely inconsistent to the point that it feels even worse than the prior method. This is better visualised by going into borderless fullscreen which uses the waitable swapchain, that is seemingly unable to decide whether to render at 144fps (nominal) or 72fps. This feels really bad:
image

Furthermore, it looks as if we've just transferred over half our processing time to this Signal() method, which I found to be an unexpected and unexplained result.

3. Staging buffer pool with static 6-frame return

Further testing showed that the GPU could keep a buffer in use for at most 6 frames. This is likely platform (and load) dependent, but I had to test using a static 6-frame return window instead of a signal.

This was done in this commit, based on from the above.

The performance profile lookps like this:
image
image

The results here look about identical to the fenced method. It's interesting that in both cases the spikes are due to SwapBuffers, which could be indicating that a sync point has been introduced which I haven't been able to discover yet.

4. The giga-buffer

One question arose in my mind - is it the frequency of Map()/Unmap()? I found a relevant-looking GitHub repo showcasing a fix for another game.

The fact that there are multiple sync points back-to-back makes this especially problematic since submitting those infividual copy commands that happen between calls to Map is fairly costly on the driver side.

Thus was born the idea of a single buffer living inside VeldridRenderer that is mapped and unmapped once every frame, and copied to every relevant VBO within that frame's BufferUpdateCommands. The relevant branch can be found here.

The performance profile looks like this:

image
image

Yes, you see that correctly. The single Map() call inside Reset() becomes the dominating hotspot in this case.

Those with keen eyes will notice that I've used 1 global buffer. Increasing that number to 12 makes the performance profile look like this:

image
image

Which is... At least as bad if not worse than any other attempts. Very interesting to me, is that the overhead of Map() and related functions is completely gone, but now exists somewhere inside SwapBuffers.

5. "Persistent" mapping

The only solution to performance that I've found is persistently mapping the staging buffer. I haven't found another project that does this but it leads to this performance profile:

image
image

Which, although it still has spikes from time to time, is still much smoother than any other solution. Furthermore, the game is able to hold a consistent framerate in borderless windowed mode:

image

6. This PR

This PR builds upon the above to fix ppy/osu#23538 in a bit of an unfortunate way - the issue appears to be fixed by unmapping the buffer. This leads to the following performance profile:

image
image

In this case we've lost some performance by bringing back a form of the Map()/Unmap(), but in general it is still better than any other solution.

@smoogipoo
Copy link
Contributor Author

smoogipoo commented May 18, 2023

On another bare-metal machine (as opposed to KVM pass-thru)...

GTX 1660: Corroborates the results exactly (my pass-thru is 3060TI so this is expected).

Arc A380: Appears to have the same performance with any of the described methods. It varies between spending 100% of time in calls to the driver, or 50-50 in calls vs SwapBuffers(), but they all result in the same FPS (300fps ± 10) and overall frame smoothness.
Not sure which is the more preferable result in this case...

9800GT: Only master and this change work well.

I don't have any AMD GPUs to test right now D:

Copy link
Collaborator

@bdach bdach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't verify everything in the OP as I presume your testing capabilities are far more comprehensive than mine, but I did roughly look at FPS and can confirm that this change is worse than master but better than pre-persistent-buffers on my machine too.

@bdach bdach enabled auto-merge May 20, 2023 18:55
@bdach bdach merged commit c929900 into ppy:master May 20, 2023
11 checks passed
@smoogipoo
Copy link
Contributor Author

smoogipoo commented May 22, 2023

AMD 5500XT:

  • Suffers from Map()/Unmap() being expensive.
  • The "gigabuffer" with 12 buffers method performs the best. On-par with the performance prior to this change.
  • Every other method described here performs around the same.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Intel Arc A770 artifacts
3 participants