Unmap persistent buffers before uploading #5800
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This fixes the graphical issues in ppy/osu#23538
The conjecture here, going by documentation and testing, is that
Map()
blocks the GPU from accessing the resource, andUnmap()
allows it to access it again.Due to the undocumented nature of this entire endeavour, I'm going to spend the rest of this PR providing some documentation on my findings and testing.
Introduction
A lot of this is diving into seemingly uncharted territory as I've been unable to find anyone else doing what we're doing here, and all suggestions lead to worse performance than these changes achieve. It seems like D3D is more tailored for 3D applications where VBOs are fully populated ahead of time and are very rarely if ever updated, or, when they are; users are okay with adding artificial delays in their code around particular calls, but we can't have these delays.
History
Testing is done via this test scene (with the test button pressed).
1. Staging buffer pool
When we're only updating a subrange of the buffer, this is the memory management that Veldrid offers. It:
The performance profile looks like this:
This is testable by applying this diff:
For this particular test scene, this is about a 65% reduction in FPS versus the non-animated state, which is pretty rough and is what prompted these changes. This is the sort of performance drop that we would also expect in the game, as we have a lot of VBOs updated during animations (e.g. the entire carousel moving).
2. Staging buffer pool with fenced return
Following on from the above, one might expect that the issue is constant syncs between the CPU and GPU as a result of reusing the buffers. This can be corroborated, because if we apply the following patch to Veldrid:
The game immediately crashes with:
To rectify this, I first made this Veldrid-side commit which attempts to use a fence to signal when the GPU is done with a buffer. This is similar to how other platforms work (note that this is upstreamed to
ppy/Veldrid
, and not toveldrid/Veldrid
).The performance profile looks like this:
Although the raw frametimes are a little bit better, it's extremely inconsistent to the point that it feels even worse than the prior method. This is better visualised by going into borderless fullscreen which uses the waitable swapchain, that is seemingly unable to decide whether to render at 144fps (nominal) or 72fps. This feels really bad:
Furthermore, it looks as if we've just transferred over half our processing time to this
Signal()
method, which I found to be an unexpected and unexplained result.3. Staging buffer pool with static 6-frame return
Further testing showed that the GPU could keep a buffer in use for at most 6 frames. This is likely platform (and load) dependent, but I had to test using a static 6-frame return window instead of a signal.
This was done in this commit, based on from the above.
The performance profile lookps like this:
The results here look about identical to the fenced method. It's interesting that in both cases the spikes are due to
SwapBuffers
, which could be indicating that a sync point has been introduced which I haven't been able to discover yet.4. The giga-buffer
One question arose in my mind - is it the frequency of
Map()
/Unmap()
? I found a relevant-looking GitHub repo showcasing a fix for another game.Thus was born the idea of a single buffer living inside
VeldridRenderer
that is mapped and unmapped once every frame, and copied to every relevant VBO within that frame'sBufferUpdateCommands
. The relevant branch can be found here.The performance profile looks like this:
Yes, you see that correctly. The single
Map()
call insideReset()
becomes the dominating hotspot in this case.Those with keen eyes will notice that I've used 1 global buffer. Increasing that number to 12 makes the performance profile look like this:
Which is... At least as bad if not worse than any other attempts. Very interesting to me, is that the overhead of
Map()
and related functions is completely gone, but now exists somewhere insideSwapBuffers
.5. "Persistent" mapping
The only solution to performance that I've found is persistently mapping the staging buffer. I haven't found another project that does this but it leads to this performance profile:
Which, although it still has spikes from time to time, is still much smoother than any other solution. Furthermore, the game is able to hold a consistent framerate in borderless windowed mode:
6. This PR
This PR builds upon the above to fix ppy/osu#23538 in a bit of an unfortunate way - the issue appears to be fixed by unmapping the buffer. This leads to the following performance profile:
In this case we've lost some performance by bringing back a form of the
Map()
/Unmap()
, but in general it is still better than any other solution.