Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Fully support hardware filters on Apple VideoToolbox #11014

Merged
merged 16 commits into from
Mar 9, 2024

Conversation

gnattu
Copy link
Member

@gnattu gnattu commented Feb 15, 2024

Changes

The upstream ffmpeg 6.1 released with a new scale_vt filter which provides support for scaling and tone mapping using VideoToolBox's native method. This PR:

  • Enables hardware scale filter if possible
  • Enables hardware tone mapping is possible

One thing to note is that the VideoToolBox's tone mapping method has to use hard-coded Apple's parameters and does not have any tunable like other filters, but Apple's settings looks good enough to me.

Currently the WebUI does not have the options for enabling Tone Mapping for VideoToolBox, which need to be implemented separately as it does not have the tunable options available to other implementations.

Issues

Signed-off-by: gnattu <gnattuoc@me.com>
Signed-off-by: gnattu <gnattuoc@me.com>
Signed-off-by: gnattu <gnattuoc@me.com>
Signed-off-by: gnattu <gnattuoc@me.com>
Signed-off-by: gnattu <gnattuoc@me.com>
@gnattu gnattu marked this pull request as draft February 15, 2024 16:23
@gnattu
Copy link
Member Author

gnattu commented Feb 18, 2024

It now works with incorrect colorspace (as the main is in YUV and the overlay is in BGRA):

image

To handle the colorspace, we have multiple approaches here, and this could be very specific to videotoolbox, as the other approaches like the CUDA implementation assume the two layers have matching pixel formats, so they do not perform any conversion.

The most native way would be to use CoreImage, generate an intermediate image with -[CIImage initWithCVPixelBuffer:], then set kCIContextWorkingColorSpace to the CIContext, and finally render it into another CVPixelBuffer with -[CIContext render:toCVPixelBuffer:bounds:colorSpace:]. This approach could give us the most correct result, as it properly handles gamma correction, but having an intermediate image for every frame could be slower. (Note: CoreImage does have a Metal backend, but ffmpeg's own CoreImage filter only uses the old OpenGL backend. If I'm going with this approach, I will create my own CIContext with Metal.)

Another approach we could try is to bypass the HWContenxt of VideoToolBox and use a custom matrix in Metal to generate a texture2d with YCbCr, with an alpha channel. The problem here is that the colors could be wrong if we just use a basic matrix for such conversion. The matrix I plan to use is the following:

const float4x4 rgbaToYcbcrTransform = float4x4(
   float4(+0.2990, -0.1687, +0.5000, +0.0000),
   float4(+0.5870, -0.3313, -0.4187, +0.0000),
   float4(+0.1140, +0.5000, -0.0813, +0.0000),
   float4(+0.0000, +0.5000, +0.5000, +1.0000)
);

I'm going to try the CoreImage first to see if the performance is acceptable.

@gnattu
Copy link
Member Author

gnattu commented Feb 18, 2024

The CoreImage approach had initial success with color:
image

A big problem is that the frames appears not to be "in order", and it the video seems to be "vibrating" back and forth. What could be the cause of this? My assumption is that the frame_sync has something happened, but I'm not quite sure on how to fix that.

@gnattu
Copy link
Member Author

gnattu commented Feb 18, 2024

Attaching the example file for reference.
test.mp4.zip

@gnattu
Copy link
Member Author

gnattu commented Feb 18, 2024

By adding usleep(1600); into the blending function body made the frame in-sync, so it seems like the function is called before the previous frame is ready? Why would this happen?

@nyanmisaka
Copy link
Member

ffmpeg's frame_sync utils only guarantees that frames are synchronized in timestamps, but does not guarantee that the underlying video data is synchronized.

@nyanmisaka
Copy link
Member

Perhaps you need some calls from CoreImage to ensure the video data has been synchronized correctly, similar to glflush()/glfinish() in OpenGL.

@gnattu
Copy link
Member Author

gnattu commented Feb 18, 2024

I got it. CoreImage defers rendering until users request access to it, that's why the frame is out of sync. I have to use CVPixelBufferLockBaseAddress to lock the rendered pixelbuffer first, and now the video is synced properly.

test_fixed.mp4.zip

@nyanmisaka
Copy link
Member

Good catch. But i can still find some vibrating frames in this clip. Especially some frames at the beginning and end.

@gnattu
Copy link
Member Author

gnattu commented Feb 18, 2024

At this point it seems like it is unlikely a frame syncing issue in our filter. If I switch to h264_videotoolbox then there will be only vibrating frames at the beginning even I removed all CVPixelBufferLockBaseAddress. It seems like the frame we output overloads the downstream pipeline which caused out of sync frames to be generated.

If I forced it to run at a very slow speed (like 30fps), then even hevc_videotoolbox will not have any out of sync frames.

Probably we should not output BGRA frames directly, but convert them back to planer YUV to make the hardware happier?

Or we need to find some way to throttle the frame output so that we never overloads the downstream.

test.mp4_with_throttle.zip

@nyanmisaka
Copy link
Member

nyanmisaka commented Feb 18, 2024

There is no overload in libavfilter. HW encoders can be overloaded but this is usually mitigated by dropping frames, some encoders allow this. It doesn't happen with such low resolution and bitrate use cases.

Such uncertainty means that there is still an out-of-sync somewhere in the overlay filter or the hwcontext of videotoolbox. It's best to start by reviewing your code.

@gnattu
Copy link
Member Author

gnattu commented Feb 18, 2024

By converting the output frame back to yuv fixed this, so this is indeed the behavior of videotoolbox when you feed it with bgra frames too quickly.
test.mp4_output_yuv.zip

@nyanmisaka
Copy link
Member

So the input and output of your filter was (yuv+bgra=>bgra)?
And you changed it to (yuv+bgra=>yuv).

@gnattu
Copy link
Member Author

gnattu commented Feb 18, 2024

Yes. That is what I did.

@nyanmisaka
Copy link
Member

Then it’s indeed an issue of the videotoolbox encoder when it comes to handling non-yuv input.

A common practice for an overlay filter is to keep the format of the output and main input the same, so the latter one is better.

@gnattu
Copy link
Member Author

gnattu commented Feb 19, 2024

The initial working tree is pushed here:
gnattu/jellyfin-ffmpeg@587a97a

Before submitting this upstream, I'd like to receive some suggestions first.

@nyanmisaka
Copy link
Member

The initial working tree is pushed here: gnattu/jellyfin-ffmpeg@587a97a

Before submitting this upstream, I'd like to receive some suggestions first.

Looks good at first glance. I left some suggestions.

@gnattu
Copy link
Member Author

gnattu commented Feb 20, 2024

Pushed v2 patch:
gnattu/jellyfin-ffmpeg@d2276b0

Now I'm thinking about using VTPixelTransferSessionTransferImage for every conversion by default. My initial testing shows that it is much slower than CoreImage, but after further testing I got no measurable performance difference, and VTPixelTransferSessionTransferImage handles more formats than CoreImage. The CoreImage can be used as a compatibility fallback, because VTPixelTransferSessionTransferImage is not available until iOS 16.

@gnattu
Copy link
Member Author

gnattu commented Feb 20, 2024

Now we have a new problem: for certain input combination, we will end up having a lot of duplicated frames in generated output when repeatlast=1, this will not affect jellyfin but this would be a problem for upstream. Use the same input combination, set repeatlast=0 the duplicated frame problem is gone. The duplicated frame will make the video ~90x longer(a 30s video will become 45 minutes because there are duplicated frames). The output video "looks normal" when you playback at 90x speed. Why something like this would happen? is the framesync event handler called 90x more than it should do?

@gnattu
Copy link
Member Author

gnattu commented Feb 20, 2024

It seems like we have to manually set the timebase in configure_output like below:

link->time_base = avctx->inputs[0]->time_base;
ctx->fs.time_base = link->time_base;

The problem seems to gone with this.

Signed-off-by: gnattu <gnattuoc@me.com>
@JPVenson JPVenson added backend Related to the server backend feature Adding a new feature, or substantial improvements on existing functionality platform Support a new platform or improve compatibility needs testing release-highlight A major new feature for the next release labels Mar 3, 2024
@nyanmisaka
Copy link
Member

nyanmisaka commented Mar 8, 2024

nyanmisaka@367f0db

I simplified some redundant parts. And the order of the deint filter has been adjusted, because it must be chained after the decoder, not after the scale filter, otherwise the image quality will be degraded. Finally, I disabled hwSurface only for cases using VideoToolBox tone-mapping. Because for common transcoding use cases, it is not cost-effective to exchange higher system power consumption (2x memory copy: decoder buffer->copy to sw buffer->copy to hw buffer) and more heat for a small amount of extra FPS, especially for battery-powered notebooks.

@gnattu
Copy link
Member Author

gnattu commented Mar 8, 2024

nyanmisaka@d031f7d

I simplified some redundant parts. And the order of the deint filter has been adjusted, because it must be chained after the decoder, not after the scale filter, otherwise the image quality will be degraded. Finally, I disabled hwSurface only for cases using VideoToolBox tone-mapping. Because for common transcoding use cases, it is not cost-effective to exchange higher system power consumption (2x memory copy: decoder buffer->copy to sw buffer->copy to hw buffer) and more heat for a small amount of extra FPS, especially for battery-powered notebooks.

No, you should not use hw surface, as it not only breaks tone mapping, but it also breaks HDR pass-through. The wrong colors will appear everywhere. It also cannot guarantee zero copy on VideoToolbox, as the hardware decoder, GPU, and hardware encoder could be located at different locations off the bus (not always in the same chip like other platforms). For Intel Macs, the hardware encoders/decoders could be the Intel-built-in, or they could be the T2 chip's built-in on some models. However, the filter context would be created on the amdgpu if available. In this case, hw surface not only makes it slower, but it also does not eliminate the copy either. If we have to use it, we should only use it for 8bit videos and take closer look at the performance and power consumption to see if it produces any measurable benefits.

@nyanmisaka
Copy link
Member

nyanmisaka commented Mar 8, 2024

but it also breaks HDR pass-through.

Actually we haven't implemented HDR passthrough yet because it requires HEVC Main10 instead of Main. Unless you want 8-bit HDR.

It also cannot guarantee zero copy on VideoToolbox, as the hardware decoder, GPU, and hardware encoder could be located at different locations off the bus (not always in the same chip like other platforms). For Intel Macs, the hardware encoders/decoders could be the Intel-built-in, or they could be the T2 chip's built-in on some models. However, the filter context would be created on the amdgpu if available. In this case, hw surface not only makes it slower, but it also does not eliminate the copy either. If we have to use it, we should only use it for 8bit videos and take closer look at the performance and power consumption to see if it produces any measurable benefits.

Well, I'm convinced. hwSurface is now only enabled for OpenCL tone-mapping. nyanmisaka@bbbdf7c

@nyanmisaka
Copy link
Member

Please re-test it when you have time. Includes disabling hardware decoders. Then we can approve and 🚀 it.

Co-authored-by: nyanmisaka <nst799610810@gmail.com>
Signed-off-by: gnattu <gnattuoc@me.com>
@gnattu
Copy link
Member Author

gnattu commented Mar 8, 2024

Tested a few files and it appears to be OK. I merged most of your suggested code into my branch, only made some modifications:

  • Manually check OpenCL Tone Mapping to avoid stack overflow
  • Only do color override for OpenCL Tone Mapping because it is not needed when hw surface is not in use
  • Return empty filter chain when nothing has to be done

Signed-off-by: gnattu <gnattuoc@me.com>
@gnattu
Copy link
Member Author

gnattu commented Mar 9, 2024

What we need to do is modify the upstream scale_vt filter to make it output NV12 frames, since we are not doing P010 encoding in any case for now. Currently, the scale_vt filter does not modify the pixel format, and all 10-bit inputs will have a 10-bit output, even after tone-mapping. This causes our encoder to complain about it (10-bit input using main profile), leading to frame errors for some videos. It will look like the same problem when we are feeding BGRA frames directly into it.

@gnattu

This comment was marked as resolved.

gnattu added 2 commits March 9, 2024 11:02
Signed-off-by: gnattu <gnattuoc@me.com>
Signed-off-by: gnattu <gnattuoc@me.com>
Signed-off-by: gnattu <gnattuoc@me.com>
@nyanmisaka nyanmisaka requested a review from a team March 9, 2024 14:13
@crobibero crobibero merged commit a92de9b into jellyfin:master Mar 9, 2024
12 checks passed
@gnattu gnattu deleted the vf-videotoolbox branch July 17, 2024 19:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend Related to the server backend feature Adding a new feature, or substantial improvements on existing functionality needs testing platform Support a new platform or improve compatibility release-highlight A major new feature for the next release
Projects
No open projects
Status: Done
Development

Successfully merging this pull request may close these issues.

5 participants