New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement cuda/nvdev hwdec for vulkan backend #6170
Conversation
The way you have the code written now, your intermediate buffer will be allocated in host memory, not GPU memory. So you're actually doing a texture download followed by a texture upload. The code in You could probably introduce a new RA_BUF_TYPE for this (maybe |
Thanks for catching that. I've added a new RA_BUF_TYPE and removed exportable from the ra API. |
a4294e2
to
d2a416f
Compare
This is now good to merge. |
Added an additional change to implement device matching between Vulkan and CUDA. This requires an additional method in the nv-codec-headers. |
I've implemented a VkBuffer pool to avoid problems with interpolation. Everything I can test seems to work correctly at this stage. |
Further update with dynamic pool allocation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More or less okay. Some commits don't really make sense in isolation, also the commit names don't really conform to mpv style.
The vk_buf_priv thing is sort of annoying, but we've been bikeshedding the buffer pool enough considering it's going to be replaced once nvidia fixes their driver.
I've pushed updates that address all comments except the one about extension loading. I will work on that next. |
I've now addressed all the outstanding comments, including one out of band from @BtbN that it wasn't building cleanly if one or the other backend was disabled. I can squash changes in various ways if you want; just let me know what you'd like me to do there. |
Updated to address comments (except the one open question). |
Simplified the GL vs Vulkan #ifdefs. |
I see no more major issues with this, just bikeshedding / cosmetic / peace-of-mind. So I guess it LGTM. I would suggest squashing the two commits related to external memory extensions together (9ae36c5 and 6528f07), as well as all of the commits with |
I have squashed all the cuda changes and left the vulkan changes in three logical parts (exportable memory, buffer user data, and the device UUID getter). Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two more minor changes I found while re-viewing the squashed commits.
The CUDA/Vulkan interop works on the basis of memory being exported from Vulkan and then imported by CUDA. To enable this, we add a way to declare a buffer as being intended for export, and then add a function to do the export. For now, we support the fd and Handle based exports on Linux and Windows respectively. There are others, which we can support when a need arises. Also note that this is just for exporting buffers, rather than textures (VkImages). Image import on the CUDA side is supposed to work, but it is currently buggy and waiting for a new driver release. Finally, at least with my nvidia hardware and drivers, everything seems to work even if we don't initialise the buffer with the right exportability options. Nevertheless I'm enforcing it so that we're following the spec.
This is arguably a little contrived, but in the case of CUDA interop, we have to track additional state on the cuda side for each exported buffer. If we want to be able to manage buffers with an ra_buf_pool, we need some way to keep that CUDA state associated with each created buffer. The easiest way to do that is to attach it directly to the buffers.
We need this to do device matching for the cuda interop.
Despite their place in the tree, hwdecs can be loaded and used just fine by the vulkan GPU backend. In this change we add Vulkan interop support to the cuda/nvdec hwdec. The overall process is mostly straight forward, so the main observation here is that I had to implement it using an intermediate Vulkan buffer because the direct VkImage usage is blocked by a bug in the nvidia driver. When that gets fixed, I will revist this. Nevertheless, the intermediate buffer copy is very cheap as it's all device memory from start to finish. Overall CPU utilisiation is pretty much the same as with the OpenGL GPU backend. Note that we cannot use a single intermediate buffer - rather there is a pool of them. This is done because the cuda memcpys are not explicitly synchronised with the texture uploads. In the basic case, this doesn't matter because the hwdec is not asked to map and copy the next frame until after the previous one is rendered. In the interpolation case, we need extra future frames available immediately, so we'll be asked to map/copy those frames and vulkan will be asked to render them. So far, harmless right? No. All the vulkan rendering, including the upload steps, are batched together and end up running very asynchronously from the CUDA copies. The end result is that all the copies happen one after another, and only then do the uploads happen, which means all textures are uploaded the same, final, frame data. Whoops. Unsurprisingly this results in the jerky motion because every 3/4 frames are identical. The buffer pool ensures that we do not overwrite a buffer that is still waiting to be uploaded. The ra_buf_pool implementation automatically checks if existing buffers are available for use and only creates a new one if it really has to. It's hard to say for sure what the maximum number of buffers might be but we believe it won't be so large as to make this strategy unusable. The highest I've seen is 12 when using interpolation with tscale=bicubic. A future optimisation here is to synchronise the CUDA copies with respect to the vulkan uploads. This can be done with shared semaphores that would ensure the copy of the second frames only happens after the upload of the first frame, and so on. This isn't trivial to implement as I'd have to first adjust the hwdec code to use asynchronous cuda; without that, there's no way to use the semaphore for synchronisation. This should result in fewer intermediate buffers being required.
Fixed. Thanks. |
With these patches applied, nvdec stops working for both OpenGL and Vulkan. |
That logs says that nvdec isn't compiled in. Did you update your nv-codec-headers? |
Arch had an update for ffnvcodec-headers in the repos, works fine after installing that. |
This is a working implementation of nvdec acceleration with the vulkan backend using the new CUDA 10 vulkan interop API. Due to bad documentation, I haven't yet worked out how to work with a VkImage directly in CUDA, and have needed to use an intermediate VkBuffer as the target for the copy from CUDA, which is then copied to the VkImage texture.
The implementation only works on Linux as the windows memory export mechanism is slightly different and I can't test it.
Finally, I did not attempt to do any semaphore based synchronisation between cuda and vulkan. The interop API supports exporting a vulkan semaphore and using it from CUDA. One can imagine this would be necessary to ensure the CUDA copy is correctly synchronised. I can believe that the existing barrier logic on the buffer -> image copy on the vulkan side is sufficient; certainly there is no visual indication of a synchronisation problem.