New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vulkan revision 2 #4933
Vulkan revision 2 #4933
Conversation
Gave this a try but it was slower on nvidia. And I actually have no idea why, because even just uncommenting the Still needs testing on AMD I suppose. |
The test feedback so far has been very positive for windows; and also works great on nvidia, and the official AMD drivers. It also fixes some issues people were having, and makes it faster and more responsive. RADV/mesa is the only platform that doesn't seem to like it; but I don't think that will hold us back. (Maybe merging it into master will give mesa some motivation to fix the issues?) |
Rebased. I've also added an option to disable async transfer/compute, mainly for debugging purposes, but also because they seem to be unstable on some drivers (async compute in particular). |
Nvidia driver version is out and should include the bug fixes for this. With async compute disabled by default it also avoids crashes on several other drivers. I'd be happy to merge it in this form. @wm4 ? |
Clang just pointed out a fairly glaring issue in the sharing mode initialization for allocated buffers. In theory, that might have explained some of the issues with async transfer/compute on devices where it led to issues. (Although probably the only affected buffers are texture upload buffers) |
Actually, I'm not even sure if they're affected. I think the only case that was affected were updates for uniform buffers, because UBO buffers are always updated from the graphics queue (for stupid reasons) even when using them with compute shaders. So that actually makes sense when you think about the async compute stuff. Hopefully this fixed it? If users could try with the above fix and --async-compute again I'd be very grateful. |
Found and fixed some bugs relating to image synchronization, thanks to a vulkan validation layers update. It's possible that this fixes the crashing on some devices. Testing desired. |
Does --vulkan-async-compute work for anyone? With Pascal and 387 drivers, I get "device timeout" errors on Windows and it semi-freezes the whole system on Linux. |
@aufkrawall Can you test with 4a19ae5? |
It's fixed now (tested on Linux). It doesn't have an influence on the reported frame timings though, GPU usage reported by the driver is also roughly the same. |
@aufkrawall I don't know of any GCN results, but I've heard reports from a user that said async-compute bumped FPS from 520 to 550 on a 1080 Ti. Apparently the latest round of fixes, plus the new nvidia driver version, means -rev2 will no longer crash or deadlock on windows nvidia. (I still have one swapchain hang on linux nvidia but I suspect a driver bug here as well, since it matches the pattern of the other swapchain-related issues) Don't know if any AMD Windows users have tested this version. Mesa still doesn't work. |
I have a RX470 and I'm currently using Windows. The last time I tried to test the difference on a 540p AVC file (decodes fast) with gpu-hq + ravu-r4 + adaptive-sharpen (just to increase the GPU load), with --vulkan-async-compute and --vulkan--queue-count=3 I got around 490 fps. Without --vulkan-async-compute and --vulkan--queue-count=3 I got around 515 fps. I could time how long it takes to reach 1min mark or something in both cases, I feel that with async-compute turned on the minimums are higher, but I could be wrong. |
It is actually broken on Linux and amdgpu-pro when running color conversion. (At least yuv420.) |
@AstralStorm: What does ‘broken’ mean in this context? |
Hi. please rebase, patch failed when apply the PR in master (wscript) |
@haasn Crashing on surface swap regardless of sync or async or compute mode. |
@AstralStorm: backtrace? log? |
Backtrace is useless due to lack of symbols. (Cannot trace back from ICD.) It crashes on the second real frame swap of youtube yuv420 videos (h264). I will provide the log soon. |
@sl1pkn07 rebased; somebody decided to make the same change to mpv master as one of the commits in this PR but did it slightly differently, which was the reason for the conflict... Untested (ffmpeg master still doesn't work with mpv) |
@AstralStorm can you reproduce it from inside gdb, and maybe step through the function |
Hi seems this d98c538, the first hunk, is applied or patch the PR see it
also conflicts in video/out/gpu/ra.h the patch is applied in clean clone greetings |
1. No more static arrays (deps / callbacks / queues / cmds) 2. Allows safely recording multiple commands at the same time 3. Uses resources optimally by never over-allocating commands
Instead of associating a single VkSemaphore with every command buffer and allowing the user to ad-hoc wait on it during submission, make the raw semaphores-to-signal array work like the raw semaphores-to-wait-on array. Doesn't really provide a clear benefit yet, but it's required for upcoming modifications.
At least one of the radeon developers is looking into the issue with amdgpu and some generations of AMD GPUs; but for the meantime, disabling the use of async transfer avoids triggering the stall (which happens when attempting a buffer->image copy from the compute queue). |
@haasn Is he a developer from radeon ? If so, leave a comment about the weird exclusive fullscreen issue. |
Are you using Windows 10 Creators Update 2? It should by default force every fullscreen application into "direct borderless" mode (which should be what you want), unless you disabled that in the Windows compatibility options for mpv.exe. |
@aufkrawall You mean this?: |
@Artoriuz I did poke @haasn about this a couple of days ago and it seemed like he wanted to wait for the amdgpu fixes, but I might have understood him wrong. As long as this has close to zero effect on anything else I'd just take this in since vulkan has more or less been experimental in mpv so far as well. |
For me it worked absolutely reliably so far on Nvidia Pascal (Windows 10/Arch Linux) and Intel Skylake (Arch Linux) so far, except of that async compute segfault which haasn fixed. :) |
@jeeb I'd take it in as-is since we've hit a stalemate in the bug fixing process. I can't find any more issues (apart from the ones I discovered while re-analysing/rewriting this code for libplacebo); so anything remaining would need investigation from driver devs or improvements to the debugging layers. One thing I'm sort of wishing we could do is make the use of |
Here are my testing results: Intel Ivy Bridge, mesa 17.2.5, linux 4.7.10. Running
but it actually seems to mostly work in both rev1 and rev2 other than high CPU usage of about 80% of one core. The average redraw time (as indicated by stats.lua) is increased with rev2 although the fresh draw time is slightly lowered. vulkan_rev1: 13 ms draw / 3 ms redraw (log, stats) I also see lots of None of this seems to be affected by the various permutations of |
@kevmitch The timers are unreliable with rev2. Can you try using |
baserev2 is 3% faster
vulkan-rev2: 370fps log, stats debandrev2 is 20% slower
vulkan-rev2: 180fps log, stats Again, |
OK. Took a general look at this whole thing and then tried to focus on the effects it could have on the other renderers since there was limited touching of generic files included.
Otherwise the biggest thought I got after looking at the code was:
Thus, I have very minimal concerns (although I'm still interested in the logic behind the parameter grouping in those functions where the new parameters were added). Additionally ran some static analysis on the current master with vulkan_r2 rebased on top. There are some warnings from vulkan-related files but not sure if those have anything to do with code changed in this PR. tl;dr Looking pretty good as far as I'm generally concerned. |
From your log, your device only supports a graphics queue, so those options do nothing on your device. Maybe we should make these options tristate ( The 20% performance loss in rev2+deband is probably connected to the error message spam (“Error: texture could not be created”). Can you investigate this? Which line throws the error, and where does the error value come from? Can you try using P.s. Intel supports neither events nor async queues nor implicit image barriers, so I don't expect rev2 and rev1 to be any different performance-wise if everything is behaving correctly. Edit: I compared the logs (via |
Reposting my comments from IRC
|
ultimately because
where
Are you talking about this: https://gpuopen.com/using-the-vulkan-validation-layers/ ?
how do I measure this? |
If you have nvidia, with nvidia-smi, or nvidia-settings |
Sorry, the option name is The stack trace helps. Can you paste the output of |
I think it's fine as one option. It's doing the same high level thing. Does |
Just tested and |
@kevmitch The issue in your case is the fact that in mpv we universally try and enable storage image support for our intermediate FBOs, but your device doesn't support storage images on the This also explain the performance loss - your redraws are not getting cached as a result, since it fails creating the cache FBO. This should not cause any issues apart from the performance loss and the spammed error message. (Maybe we remember a failure here?) The proper solution is to expand The next-best half-baked work-around would be to add a flag to ra_tex_resize indicating whether we want to use it as a storage image or not. (The cache FBO is never used as a storage image). But this would just work around your specific issue - it doesn't completely eliminate the bug, since it's still possible we might run into cases of trying to make storable FBOs with a non-storable |
Anyway, I wouldn't mind merging this without commit 9789bac and leaving that commit to a later date. That way this at least isn't a regression; i.e. the only bugs related to non-storable FBO formats are also bugs in master. |
If we know the limitations of the current setup I do not consider this a deal-breaker, given that my understanding is that something is planned to be done about it (which might of course be incorrect). Better to not shake the pack too much compared to what has already gone through a relative amount of testing. @kevmitch , what's your opinion? |
I reverted the commit and indeed got a slight improvement over rev1 rev2 deband without 9789bac: 225fps log, stats There are also no more
|
well shit |
Just FYI I posted a radv patch to fix the compute bugs on it, should be in master and next stable release. |
It's indeed working nicely here with Linux 4.15, mesa-git and RX 560. According to render stats, async compute gives a massive performance boost (~4.5ms vs. ~14.5ms). Can that be real? With or without AC, I can run higher settings than with OGL. Kudos, haasn (and David Airlie ;) )! |
Ok, I used the more reliable benchmark method described by haasn in his blog (also added vsync off for D3D11). I don't have the concrete numbers anymore, but I checked everything several times, so I think my conclusions should be really reliable for my testcase (scaling down 4k 60fps 8 bit and 4k 50fps 10 bit to 1440p).
You btw. still need mesa-git, Mesa 17.3.2 still hangs system with RadV + mpv. |
I wonder if it'd be worth to do the hardware copy into Vulkan DR memory... |
I could also do the same test on a Windows 10 machine with a GTX 1060, also here Vulkan was faster than D3D11. |
RadV now works with Mesa 17.3.3. There is tearing in Xorg windowed, but not fullscreen. |
This improves on the major shortcomings of the ver1 design and enables full use of async transfers, async compute, multi-command queueing, event-based image barriers, safe FBO invalidation, better implicit renderpass layout transitions and more (tm).
It turns out that the VkSemaphore deadlock issue was an nvidia driver bug. This approach works after all, and scales well - especially now that I've refactored it to avoid using VkSemaphores for intra-frame dependencies as much as possible. There is now virtually no performance drop as a result of this refactor even without the fancy async transfer stuff. (And with async transfer on top, the performance goes through the roof again)
I probably won't actually merge this until the fixes for the nvidia deadlock issue make it into an actual driver release, but it would be good to get some testing (on other platforms) and code review out of the way.
NOTE: This is still not entirely optimal. Improvement avenues:
And probably more that I've forgotten. Let the hate commence!