Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement cuda/nvdev hwdec for vulkan backend #6170

Merged
merged 4 commits into from Oct 22, 2018
Merged

Conversation

philipl
Copy link
Member

@philipl philipl commented Sep 30, 2018

This is a working implementation of nvdec acceleration with the vulkan backend using the new CUDA 10 vulkan interop API. Due to bad documentation, I haven't yet worked out how to work with a VkImage directly in CUDA, and have needed to use an intermediate VkBuffer as the target for the copy from CUDA, which is then copied to the VkImage texture.

The implementation only works on Linux as the windows memory export mechanism is slightly different and I can't test it.

Finally, I did not attempt to do any semaphore based synchronisation between cuda and vulkan. The interop API supports exporting a vulkan semaphore and using it from CUDA. One can imagine this would be necessary to ensure the CUDA copy is correctly synchronised. I can believe that the existing barrier logic on the buffer -> image copy on the vulkan side is sufficient; certainly there is no visual indication of a synchronisation problem.

@haasn
Copy link
Member

haasn commented Sep 30, 2018

The way you have the code written now, your intermediate buffer will be allocated in host memory, not GPU memory. So you're actually doing a texture download followed by a texture upload.

The code in vk_buf_create effectively determines what memory type to place the buffer in based on the type: for RA_BUF_TYPE_TEX_UPLOAD it requires memFlags & VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT, which in practice means it's forced to be in host RAM.

You could probably introduce a new RA_BUF_TYPE for this (maybe RA_BUF_TYPE_SHARED_MEMORY?) and have it require DEVICE_LOCAL_BIT as well as exportable. (If you combine the bool exportable with this memory type you wouldn't even need the new bool - "is exportable" would be equal to "is of type SHARED_MEMORY")

@philipl
Copy link
Member Author

philipl commented Sep 30, 2018

Thanks for catching that. I've added a new RA_BUF_TYPE and removed exportable from the ra API.

@philipl philipl force-pushed the master branch 7 times, most recently from a4294e2 to d2a416f Compare October 7, 2018 23:06
@philipl
Copy link
Member Author

philipl commented Oct 7, 2018

This is now good to merge.

@philipl
Copy link
Member Author

philipl commented Oct 10, 2018

Added an additional change to implement device matching between Vulkan and CUDA. This requires an additional method in the nv-codec-headers.

@philipl
Copy link
Member Author

philipl commented Oct 14, 2018

I've implemented a VkBuffer pool to avoid problems with interpolation. Everything I can test seems to work correctly at this stage.

@philipl
Copy link
Member Author

philipl commented Oct 15, 2018

Further update with dynamic pool allocation.

DOCS/man/options.rst Outdated Show resolved Hide resolved
Copy link
Member

@haasn haasn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More or less okay. Some commits don't really make sense in isolation, also the commit names don't really conform to mpv style.

The vk_buf_priv thing is sort of annoying, but we've been bikeshedding the buffer pool enough considering it's going to be replaced once nvidia fixes their driver.

video/out/vulkan/malloc.c Outdated Show resolved Hide resolved
video/out/opengl/hwdec_cuda.c Outdated Show resolved Hide resolved
video/out/vulkan/ra_vk.c Outdated Show resolved Hide resolved
video/out/vulkan/utils.c Outdated Show resolved Hide resolved
@philipl
Copy link
Member Author

philipl commented Oct 16, 2018

I've pushed updates that address all comments except the one about extension loading. I will work on that next.

@philipl
Copy link
Member Author

philipl commented Oct 17, 2018

I've now addressed all the outstanding comments, including one out of band from @BtbN that it wasn't building cleanly if one or the other backend was disabled.

I can squash changes in various ways if you want; just let me know what you'd like me to do there.

video/out/opengl/hwdec_cuda.c Outdated Show resolved Hide resolved
video/out/opengl/hwdec_cuda.c Outdated Show resolved Hide resolved
video/out/opengl/hwdec_cuda.c Show resolved Hide resolved
@philipl
Copy link
Member Author

philipl commented Oct 18, 2018

Updated to address comments (except the one open question).

@philipl
Copy link
Member Author

philipl commented Oct 19, 2018

Simplified the GL vs Vulkan #ifdefs.

@haasn
Copy link
Member

haasn commented Oct 20, 2018

I see no more major issues with this, just bikeshedding / cosmetic / peace-of-mind. So I guess it LGTM.

I would suggest squashing the two commits related to external memory extensions together (9ae36c5 and 6528f07), as well as all of the commits with hwdec_cuda in the name. The changes make more sense as a whole than as the individual commits, and they partially undo each other so it makes the history needlessly confusing to have them separate.

@philipl
Copy link
Member Author

philipl commented Oct 20, 2018

I have squashed all the cuda changes and left the vulkan changes in three logical parts (exportable memory, buffer user data, and the device UUID getter). Thanks!

Copy link
Member

@haasn haasn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two more minor changes I found while re-viewing the squashed commits.

video/out/vulkan/common.h Outdated Show resolved Hide resolved
video/out/vulkan/utils.c Outdated Show resolved Hide resolved
The CUDA/Vulkan interop works on the basis of memory being exported
from Vulkan and then imported by CUDA. To enable this, we add a way
to declare a buffer as being intended for export, and then add a
function to do the export.

For now, we support the fd and Handle based exports on Linux and
Windows respectively. There are others, which we can support when
a need arises.

Also note that this is just for exporting buffers, rather than
textures (VkImages). Image import on the CUDA side is supposed to
work, but it is currently buggy and waiting for a new driver release.

Finally, at least with my nvidia hardware and drivers, everything
seems to work even if we don't initialise the buffer with the right
exportability options. Nevertheless I'm enforcing it so that we're
following the spec.
This is arguably a little contrived, but in the case of CUDA interop,
we have to track additional state on the cuda side for each exported
buffer. If we want to be able to manage buffers with an ra_buf_pool,
we need some way to keep that CUDA state associated with each created
buffer. The easiest way to do that is to attach it directly to the
buffers.
We need this to do device matching for the cuda interop.
Despite their place in the tree, hwdecs can be loaded and used just
fine by the vulkan GPU backend.

In this change we add Vulkan interop support to the cuda/nvdec hwdec.

The overall process is mostly straight forward, so the main observation
here is that I had to implement it using an intermediate Vulkan buffer
because the direct VkImage usage is blocked by a bug in the nvidia
driver. When that gets fixed, I will revist this.

Nevertheless, the intermediate buffer copy is very cheap as it's all
device memory from start to finish. Overall CPU utilisiation is pretty
much the same as with the OpenGL GPU backend.

Note that we cannot use a single intermediate buffer - rather there
is a pool of them. This is done because the cuda memcpys are not
explicitly synchronised with the texture uploads.

In the basic case, this doesn't matter because the hwdec is not
asked to map and copy the next frame until after the previous one
is rendered. In the interpolation case, we need extra future frames
available immediately, so we'll be asked to map/copy those frames
and vulkan will be asked to render them. So far, harmless right? No.

All the vulkan rendering, including the upload steps, are batched
together and end up running very asynchronously from the CUDA copies.

The end result is that all the copies happen one after another, and
only then do the uploads happen, which means all textures are uploaded
the same, final, frame data. Whoops. Unsurprisingly this results in
the jerky motion because every 3/4 frames are identical.

The buffer pool ensures that we do not overwrite a buffer that is
still waiting to be uploaded. The ra_buf_pool implementation
automatically checks if existing buffers are available for use and
only creates a new one if it really has to. It's hard to say for sure
what the maximum number of buffers might be but we believe it won't
be so large as to make this strategy unusable. The highest I've seen
is 12 when using interpolation with tscale=bicubic.

A future optimisation here is to synchronise the CUDA copies with
respect to the vulkan uploads. This can be done with shared semaphores
that would ensure the copy of the second frames only happens after the
upload of the first frame, and so on. This isn't trivial to implement
as I'd have to first adjust the hwdec code to use asynchronous cuda;
without that, there's no way to use the semaphore for synchronisation.
This should result in fewer intermediate buffers being required.
@philipl
Copy link
Member Author

philipl commented Oct 20, 2018

Fixed. Thanks.

@sfan5
Copy link
Member

sfan5 commented Oct 22, 2018

With these patches applied, nvdec stops working for both OpenGL and Vulkan.
The log file doesn't really show any error: https://0x0.st/s6Hu.txt

@philipl
Copy link
Member Author

philipl commented Oct 22, 2018

That logs says that nvdec isn't compiled in. Did you update your nv-codec-headers?

@sfan5
Copy link
Member

sfan5 commented Oct 22, 2018

Arch had an update for ffnvcodec-headers in the repos, works fine after installing that.

@sfan5 sfan5 merged commit da1073c into mpv-player:master Oct 22, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants