Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tonemapping function relying on OpenCL filter and NVENC HEVC decoder #3442

Merged
merged 10 commits into from
Sep 4, 2020

Conversation

nyanmisaka
Copy link
Member

@nyanmisaka nyanmisaka commented Jun 25, 2020

Changes

These PRs rely on the tonemap_opencl filter based on the hardware acceleration of the OpenCL device, can cooperate with the NVENC HEVC decoder to perform tone mapping from HDR to SDR while maintaining a decent transcoding speed.

The current CPU-based tonemap method is very unsuitable for real-time transcoding in terms of speed.

Requirements

  • HEVC videos with HDR10(smpte2084) or HLG(arib-std-b67) metadata.
  • FFmpeg application file with opencl hwaccel type enabled. ffmpeg -hwaccels
  • NVIDIA Pascal and Turing or newer GPUs are recommended.
  • Keep the NVIDIA driver as up-to-date as possible on Windows 10.
  • NVIDIA proprietary drivers and the OpenCL runtime library are required on Linux.
  • Should get a bit better performance on Linux.

Shows

Issues
#415 was partially resolved with NVIDIA GPU.

@sparky8251
Copy link
Contributor

So, hopefully dumbm question but why does this use OpenCL and GPU offloading specifically?

I see you briefly mention what I assume is the reason, but is ffmpeg and/or CPU support for tonemapping that subpar?

Also, what about AMD GPUs? They can transcode quite a bit as well and are well priced. Typically better for a media server than an Intel iGPU after all.

@nyanmisaka
Copy link
Member Author

So, hopefully dumbm question but why does this use OpenCL and GPU offloading specifically?

I see you briefly mention what I assume is the reason, but is ffmpeg and/or CPU support for tonemapping that subpar?

Also, what about AMD GPUs? They can transcode quite a bit as well and are well priced. Typically better for a media server than an Intel iGPU after all.

There are currently two efficient tonemap methods in ffmpeg, one is tonemap_vaapi based on specific hardware blocks, and the other is tonemap_opencl based on the open parallel computing platform. Their performance is much better than the basic tonemap filter. As tonemap can only use a small number of cores, and it is not optimized with assembly or SIMD, it can usually only get an average speed of 10fps, which is very unsuitable for real-time transcoding.

To be honest, the current situation of AMD AMF is a bit awkward. Until now, ffmpeg only had its encoder, but could not find the supporting decoder and scaler. This means that some operations cannot be fully executed in the GPU, which reduces the throughput. Their team tried to submit these missing components in 2018 but somehow they were not accepted, and I am trying to push these implements to restart.

GPUOpen-LibrariesAndSDKs/AMF#199

@barronpm barronpm added the merge conflict Merge conflicts should be resolved before a merge label Jul 23, 2020
@nyanmisaka nyanmisaka force-pushed the tonemap branch 3 times, most recently from 070a185 to 48e3f2b Compare July 24, 2020 16:26
@barolo
Copy link

barolo commented Jul 24, 2020

What about vaapi decoding on amd? what would amf bring?

@nyanmisaka
Copy link
Member Author

What about vaapi decoding on amd? what would amf bring?

vaapi + tonemap opencl is unavailable on amd gpus due to lack of cl_intel_va_api_media_sharing extension which allows data interop between opencl and vaapi devices.

h264_amf(amdgpu-pro 20.20) + tonemap opencl on linux is feasible but higher resolution(1080p+) will freeze the system.

On windows, the good news is that tonemap is useable while it requires a powerful cpu to feed the amd gpu. As I said before, amd amf lacks of decoder components in ffmpeg which has become a serious performance bottleneck In this scenario, relative to NVDEC.

3400g + vega11 720p4m avg 70fps
vega11-720p4m
3400g + vega11 1080p15m avg 53fps
vega11-1080p15m

3950x + vega64 720p4m avg 140fps
vega64-720p4m
3950x + vega64 1080p15m avg 101fps
vega64-1080p15m

@barolo
Copy link

barolo commented Jul 26, 2020

Understood, thanks for the clarification, hopefully the patches can be pushed through this time

@nyanmisaka nyanmisaka removed the merge conflict Merge conflicts should be resolved before a merge label Jul 26, 2020
@barronpm barronpm added the merge conflict Merge conflicts should be resolved before a merge label Jul 27, 2020
@barronpm
Copy link
Member

Yay, more conflicts

@nyanmisaka
Copy link
Member Author

Yay, more conflicts

Resolved again.

@nyanmisaka nyanmisaka removed the merge conflict Merge conflicts should be resolved before a merge label Jul 27, 2020
// The left side of the dot is the platform number, and the right side is the device number on the platform.
OpenclDevice = "0.0";
EnableTonemapping = false;
TonemappingAlgorithm = "reinhard";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally prefer hable or mobius to reinhard look for most films.

Here were some tests: #415 (comment) (and the rest of that thread)
Also the filters seem to really like if the input is scaled to float first. (And I suspect the GPU doesn't really mind)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hardware accelerated tonemap_opencl filter is different from tonemap. Its average brightness is calculated in real time, which means it will dynamically adjust the brightness according to the threshold when the scene brightness changes greatly. I also made some attempts on hable, but failed to get a suitable brightness as reinhard.

If you also have an nvidia card, you can also try this PR to find better parameters.

@FCLC
Copy link

FCLC commented Sep 21, 2020

Sweet work dude! I'm on the amd gpu side of things as of now, and trying to get real time transcode going. GPU is an rx5600xt on linux (kernel 5.4.4) using Pop!_os 20.04 lts (built on top of ubuntu)

As of now, it looks like openCL is completely borked for navi 10 based cards, and a lack of Vaapi_tonemap support means that performance when using vaapi for encode+decode helps a little, but not a lot.

in terms of pure cli arguments, I'm running: ffmpeg -vaapi_device /dev/dri/renderD128 -hwaccel_output_format vaapi -i input.mp4 -threads 24 -vf 'zscale=transfer=linear,tonemap=hable,zscale=transfer=bt709,format=yuv420p,hwmap' -c:v hevc_vaapi output.mp4.

Running without the gpu net's around 3 fps, running with runs at 5. Any ideas while we wait for proper AMF on linux?

@nyanmisaka
Copy link
Member Author

nyanmisaka commented Sep 21, 2020

As of now, it looks like openCL is completely borked for navi 10 based cards
Running without the gpu net's around 3 fps, running with runs at 5. Any ideas while we wait for proper AMF on linux?

As per https://www.amd.com/en/support/kb/release-notes/rn-amdgpu-unified-linux-20-30
amdgpu-pro is compatible with AMD Radeon™ RX 5700/5600/5500 Series Graphics. It includes OpenCL runtime for gfx10.

sudo apt install clinfo && clinfo will show you if OpenCL runtime is installed. Note that the one from Mesa driver doesnt support tonemap image processing.

You can follow the instructions here to install amf-amdgpu-pro. And then install jellyfin-ffmpeg 4.3.1-1 from our repo to get OpenCL and AMF support.

If all things above is installed, you can run /usr/lib/jellyfin-ffmpeg/ffmpeg -v debug -init_hw_device opencl to watch the OpenCL platform and device number.

#4171 add tonemapping for AMD AMF. I've tried it on Windows but it should works on Linux if AMF and OpenCL is properly installed.

@blazarr
Copy link

blazarr commented Sep 27, 2020

Seeing an issue initializing the OpenCL device. I have a p2000 passed through. Not quite sure what I should look at.

[AVHWDeviceContext @ 0x561dcd631340] Failed to get number of OpenCL platforms: -1001.
Device creation failed: -19.
Failed to set value 'opencl=ocl:0.0' for option 'init_hw_device': No such device
Error parsing global options: No such device

edit: This is likely an issue with my setup, I am sure it has nothing to do with the feature was just seeing if anyone had an ideas.

Screen Shot 2020-09-27 at 3 17 10 PM

@nyanmisaka
Copy link
Member Author

Seeing an issue initializing the OpenCL device. I have a p2000 passed through. Not quite sure what I should look at.

Seems that the ffmpeg cannot load any OpenCL platforms. You may need to install opencl icd loader in the container. Or may be you can follow the guidance from NVIDIA to step up OpenCL runtime in docker.

I think OpenCL should works on the host once you have installed their propriety driver and nvidia-opencl. I will give it a try in container and add some docs then.

@blazarr
Copy link

blazarr commented Sep 29, 2020

Seeing an issue initializing the OpenCL device. I have a p2000 passed through. Not quite sure what I should look at.

Seems that the ffmpeg cannot load any OpenCL platforms. You may need to install opencl icd loader in the container. Or may be you can follow the guidance from NVIDIA to step up OpenCL runtime in docker.

I think OpenCL should works on the host once you have installed their propriety driver and nvidia-opencl. I will give it a try in container and add some docs then.

It worked! In my case I am using unraid with the linuxserver Nvidia drivers. Their image at the moment does not support OpenCL, I had to use aptitude to install nvidia-opencl-icd but its working great. GOOD JOB! This is really AWESOME!!

@LP0101
Copy link

LP0101 commented Sep 29, 2020

@blazarr can you give some more details on what you did? I'm running the same setup (jellyfin-unstable on unraid with nvidia drivers), I can't get the nvidia-opencl-icd package installed inside the Jellyfin container. Did you have to add the nvidia sources in the container first?

@blazarr
Copy link

blazarr commented Sep 29, 2020

@blazarr can you give some more details on what you did? I'm running the same setup (jellyfin-unstable on unraid with nvidia drivers), I can't get the nvidia-opencl-icd package installed inside the Jellyfin container. Did you have to add the nvidia sources in the container first?

Run the below commands in the container then restart. Turn off auto updates, an update will require you rerun these commands.

apt-get update 
apt-get install nano
nano /etc/apt/sources.list
//Add line below
deb http://deb.debian.org/debian buster-backports main contrib non-free
apt-get update
apt-get install aptitude
aptitude install nvidia-opencl-icd

@LP0101
Copy link

LP0101 commented Sep 30, 2020

That worked. Thanks a ton!

Is the plan to have that baked into the image when it moves to stable?

@nyanmisaka
Copy link
Member Author

That worked. Thanks a ton!

Is the plan to have that baked into the image when it moves to stable?

It's non-free that is incompatible with our GPL license. But we can add this to our docs.

@LP0101
Copy link

LP0101 commented Sep 30, 2020

This feels like an issue - the way it stands right now, on systems where the nvidia-opencl-icd loader has to be installed inside the container (ie: unRAID), the official image won't be able to tonemap, and will fail at even transcoding HDR content with tonemapping enabled

@nyanmisaka
Copy link
Member Author

This feels like an issue - the way it stands right now, on systems where the nvidia-opencl-icd loader has to be installed inside the container (ie: unRAID), the official image won't be able to tonemap, and will fail at even transcoding HDR content with tonemapping enabled

The tonemap option is disabled by default until you have configured it well.

@blazarr
Copy link

blazarr commented Sep 30, 2020

It looks like on my card each stream consumes a flat 1GB of video memory. Does this vary or does is the always a static amount?

@nyanmisaka
Copy link
Member Author

It looks like on my card each stream consumes a flat 1GB of video memory. Does this vary or does is the always a static amount?

It's controlled by the OpenCL driver for the maximum performance.

@heyhippari
Copy link
Contributor

Just for the sake of comparison, here's what the example used looks like on VLC, with tonemapping:

image

I think it's safe to say that our implementation is pretty accurate 😃

@TonyRL
Copy link

TonyRL commented Dec 8, 2020

@blazarr can you give some more details on what you did? I'm running the same setup (jellyfin-unstable on unraid with nvidia drivers), I can't get the nvidia-opencl-icd package installed inside the Jellyfin container. Did you have to add the nvidia sources in the container first?

@LP0101 From what I have tested, if you use the official docker image jellyfin/jellyfin:unstable, you can simply install nvidia-opencl-common instead of nvidia-opencl-icd which the latter brings a lot of stuff that already comes with Unraid + nvidia drivers. This also helps keeping your docker.img minimal.

@nyanmisaka
Copy link
Member Author

@TonyRL
Thanks for your advice!
But it should be note that nvidia-opencl-common is only available in Debian, not in Ubuntu.
https://packages.debian.org/buster/nvidia-opencl-common

@blazarr
Copy link

blazarr commented Dec 15, 2020

It looks like on my card each stream consumes a flat 1GB of video memory. Does this vary or does is the always a static amount?

It's controlled by the OpenCL driver for the maximum performance.

Ok so I am trying to figure out how many 4k HDR transcodes different system setups can handle. It appears that you get 1 transcode per 1GB of memory on the Nvidia cards (I am sure you need a decent cpu as to not bottleneck for any off gpu work involved ?).

As for support for intel hardware, how would you say this does on resources, kind of wondering how many streams a 9th gen Intel CPU could pull off, does it just use 1GB or ram?

I know you have been pretty busy with all this stuff so no worries if you can't get to this, I am just trying to better understand the resources consumed.

Do you have plans to optimize around the intel 11th gen chips with support for hardware tonemapping?

@tobbenb
Copy link

tobbenb commented Jan 1, 2021

@TonyRL
Thanks for your advice!
But it should be note that nvidia-opencl-common is only available in Debian, not in Ubuntu.
https://packages.debian.org/buster/nvidia-opencl-common

There is also a solution which does not involve adding debian packages. The only file missing is /etc/OpenCL/vendors/nvidia.icd which has the following content: libnvidia-opencl.so.1
I'm not sure how we will fix this in the linuxserver container. We will probably just create the file when building the container as it's just a small text file.

@FCLC
Copy link

FCLC commented Jan 1, 2021

Hey @nyanmisaka (long time no see! ;) )

Would you mind sharing the effective ffmpeg command that this is passing to the GPU to avoid redundant memory copies for gpu memory back to system and back again between decoding-> openCL->encoding?

@nyanmisaka
Copy link
Member Author

nyanmisaka commented Jan 2, 2021

Ok so I am trying to figure out how many 4k HDR transcodes different system setups can handle. It appears that you get 1 transcode per 1GB of memory on the Nvidia cards (I am sure you need a decent cpu as to not bottleneck for any off gpu work involved ?).

As for support for intel hardware, how would you say this does on resources, kind of wondering how many streams a 9th gen Intel CPU could pull off, does it just use 1GB or ram?

Generally speaking, most HDR videos have a resolution of 4k or higher. In addition to OpenCL, the pipeline for transcoding 4k video also requires much more VRAM.

As for the intel Gen 9/9.5 iGPU, it uses VRAM from memory. Therefore, its internal bandwidth and its OpenCL performance should be considered before reaching the VRAM bottleneck. I don't think the mainstream HD/UHD 630 iGPU can compare to a GTX 1050 in the number of simultaneous tone mapping processes.

Do you have plans to optimize around the intel 11th gen chips with support for hardware tonemapping?

Currently Intel only shipped TigerLake or newer iGPUs on some notebooks, which I don't plan to buy. It is reported that these new iGPUs will be in the 11th gen desktop Core processors, and I will look at them then.

@nyanmisaka
Copy link
Member Author

nyanmisaka commented Jan 2, 2021

There is also a solution which does not involve adding debian packages. The only file missing is /etc/OpenCL/vendors/nvidia.icd which has the following content: libnvidia-opencl.so.1
I'm not sure how we will fix this in the linuxserver container. We will probably just create the file when building the container as it's just a small text file.

It's nice to hear that additional dependencies can be avoided by creating a small text file. I guess it contains manifest for nvidia opencl runtime.
I haven't dig into it so deeply as we don't plan to ship nvidia's proprietary driver within our docker image.

@nyanmisaka
Copy link
Member Author

Would you mind sharing the effective ffmpeg command that this is passing to the GPU to avoid redundant memory copies for gpu memory back to system and back again between decoding-> openCL->encoding?

In the current ffmpeg, sharing P010 textures between opencl and other decoders/encoders are not available on nvidia and amd. Memory copy will inevitably happen.

@FCLC
Copy link

FCLC commented Jan 2, 2021

sharing P010 textures between opencl and other decoders/encoders are not available on nvidia and amd.

If i understand correctly, this is because, in the case of the intel system, it's using standard system ram, mapped to the iGPU, whereas the other 2 are typically discrete cards, moving data over the pcie bus. In the case of amd APU's or nvidia embedded devices, would the system not avoid the transfer between the 2 areas, being as they're using the same memory pool?

sharing P010 textures

is this also the case for nv12? if so, could we convert to nv12 and remove 1 of the redundant copies?

[Trying to think of how to lower overhead and broaden the usability to be as close to optimal on as many platforms as needed ]

@nyanmisaka
Copy link
Member Author

If i understand correctly, this is because, in the case of the intel system, it's using standard system ram, mapped to the iGPU, whereas the other 2 are typically discrete cards, moving data over the pcie bus. In the case of amd APU's or nvidia embedded devices, would the system not avoid the transfer between the 2 areas, being as they're using the same memory pool?

I think both Intel iGPU and AMD GPU/APU are capable of mapping data between HWA ASIC and OpenCL devices.

Intel has implemented this in VAAPI through hwmap filter, and it's QSV variants will coming soon in Q2 2021.

But for some reasons ffmpeg rejected the hwcontext_amf from AMD AMF team two years ago, this prevent us from using the interop function built in their framework.

Here's a comparison, hwdownload/hwupload causes huge overhead in this first pipeline.

(unoptimized pipeline)
decoder d3d11 -> hwdownload nv12/p010(cpu) hwupload -> opencl(scale,tonemap) -> hwdownload nv12/p010(cpu) -> amf encoder(nv12)

(optimized)
decoder d3d11 -> interop(opencl) -> opencl(scale,tonemap) -> uninterop(dx11) -> amf encoder(d3d11/nv12)

And a nice example to show the real performance of AMD transcoder.
https://github.com/rigaya/VCEEnc

I'll try to dig into AMD AMF and impl related functions into ffmpeg.

As for Nvidia I don't see any reason for them to put extra resources on OpenCL beyond the basic CPU/d3d11<-->OpenCL data exchange. FYI https://stackoverflow.com/questions/65157235/pass-ffpmeg-opencl-filter-output-to-nvenc-without-hwdownload

is this also the case for nv12? if so, could we convert to nv12 and remove 1 of the redundant copies?

Nope. tonemap_opencl requires P010 as input per ffmpeg.
https://github.com/FFmpeg/FFmpeg/blob/66deab3a2609aa9462709c82be5d4efbb6af2a08/libavfilter/vf_tonemap_opencl.c#L398

@FCLC
Copy link

FCLC commented Jan 10, 2021

If i understand correctly, this is because, in the case of the intel system, it's using standard system ram, mapped to the iGPU, whereas the other 2 are typically discrete cards, moving data over the pcie bus. In the case of amd APU's or nvidia embedded devices, would the system not avoid the transfer between the 2 areas, being as they're using the same memory pool?

I think both Intel iGPU and AMD GPU/APU are capable of mapping data between HWA ASIC and OpenCL devices.

Intel has implemented this in VAAPI through hwmap filter, and it's QSV variants will coming soon in Q2 2021.

But for some reasons ffmpeg rejected the hwcontext_amf from AMD AMF team two years ago, this prevent us from using the interop function built in their framework.

Here's a comparison, hwdownload/hwupload causes huge overhead in this first pipeline.

(unoptimized pipeline)
decoder d3d11 -> hwdownload nv12/p010(cpu) hwupload -> opencl(scale,tonemap) -> hwdownload nv12/p010(cpu) -> amf encoder(nv12)

(optimized)
decoder d3d11 -> interop(opencl) -> opencl(scale,tonemap) -> uninterop(dx11) -> amf encoder(d3d11/nv12)

And a nice example to show the real performance of AMD transcoder.
https://github.com/rigaya/VCEEnc

I'll try to dig into AMD AMF and impl related functions into ffmpeg.

As for Nvidia I don't see any reason for them to put extra resources on OpenCL beyond the basic CPU/d3d11<-->OpenCL data exchange. FYI https://stackoverflow.com/questions/65157235/pass-ffpmeg-opencl-filter-output-to-nvenc-without-hwdownload

is this also the case for nv12? if so, could we convert to nv12 and remove 1 of the redundant copies?

Nope. tonemap_opencl requires P010 as input per ffmpeg.
https://github.com/FFmpeg/FFmpeg/blob/66deab3a2609aa9462709c82be5d4efbb6af2a08/libavfilter/vf_tonemap_opencl.c#L398

In the meantime I'm trying to develop a cuda based tonemap filter.

Do you have any experience with those filters? if so, would you mind taking a look at this issue in my repo? trying to create a low power embedded HW accelerated cluster.

Issue is here: FCLC/Multi-Plexer#5 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Release 10.7.0
  
Completed PRs
Development

Successfully merging this pull request may close these issues.

None yet