Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for hardware video decoding #4839

Merged
merged 40 commits into from
Oct 29, 2021
Merged

Conversation

Opelkuh
Copy link
Contributor

@Opelkuh Opelkuh commented Oct 24, 2021

Closes #4079

Performance gains are extremely variable between different machines but in general they're smaller than I expected.

On a Windows / Linux

On high bitrate / high resolution videos it's definitely noticeable, HW decoders have no problem with these while the SW one sometimes can't maintain 1x speed. Here's an example of a 4K 60fps video playback:

vid.mp4

The SW one is slightly laggy while NVDEC runs at 2x speed without a problem (slowed down for the comparison). This is on a GTX 1060 6GB, so nothing cutting edge.

However, with lower bitrate videos I was getting pretty much the same CPU usage, but some HW decoders usually used more RAM (especially NVDEC).

On Android

I was testing playback of the default test video that's in visual tests on an old Samsung Galaxy S6 Edge. Decoding with MediaCodec used ~3% less total CPU and RAM usage was pretty much the same. Not great, not terrible.


Based on this testing I decided to default desktop platforms to not use HW decoders but enabled them by default on Android and iOS.

I don't have the hardware to test every HW decoder but the list bellow contains the ones that I could test so far. If anyone has access to machines that support any of these, please share if it works for you (and ideally how well/badly)!

Windows

  • NVDEC
  • Intel Quick Sync Video
  • DirectX Video Acceleration 2

Linux

  • VDPAU
  • VA-API

Android

  • MediaCodec

macOS

  • VideoToolbox

iOS

  • VideoToolbox

HW decoder selection

Automatic HW decoder selection is implemented the simplest way possible, get what decoders are available and then try them one by one until something sticks. This has the added benefit that it will automatically fallback to SW decoding if everything else fails.

Possible performance improvements

AFAIK all hardware decoders return frames in NV12 pixel format but the current rendering is set up to consume YUV420P. This means that all HW decoded frames have to go through a format conversion on the CPU. From my testing, this adds roughly 1% total CPU usage on desktop and ~5% on Android. It also adds a bit of RAM usage.

I plan to address this in a separate PR, without affecting the SW decoding path, but here's a quick and dirty patch if you'd like to try this change now:

Patch
Index: osu.Framework/Graphics/Video/VideoSpriteDrawNode.cs
===================================================================
diff --git a/osu.Framework/Graphics/Video/VideoSpriteDrawNode.cs b/osu.Framework/Graphics/Video/VideoSpriteDrawNode.cs
--- a/osu.Framework/Graphics/Video/VideoSpriteDrawNode.cs	(revision bf0a870fc35278929759d8586942bcd9ebfa67ef)
+++ b/osu.Framework/Graphics/Video/VideoSpriteDrawNode.cs	(date 1635037508888)
@@ -18,13 +18,12 @@
             video = source;
         }
 
-        private int yLoc, uLoc = 1, vLoc = 2;
+        private int yLoc, uvLoc = 1;
 
         public override void Draw(Action<TexturedVertex2D> vertexAction)
         {
             Shader.GetUniform<int>("m_SamplerY").UpdateValue(ref yLoc);
-            Shader.GetUniform<int>("m_SamplerU").UpdateValue(ref uLoc);
-            Shader.GetUniform<int>("m_SamplerV").UpdateValue(ref vLoc);
+            Shader.GetUniform<int>("m_SamplerUV").UpdateValue(ref uvLoc);
 
             var yuvCoeff = video.ConversionMatrix;
             Shader.GetUniform<Matrix3>("yuvCoeff").UpdateValue(ref yuvCoeff);
Index: osu.Framework/Resources/Shaders/sh_yuv2rgb.h
===================================================================
diff --git a/osu.Framework/Resources/Shaders/sh_yuv2rgb.h b/osu.Framework/Resources/Shaders/sh_yuv2rgb.h
--- a/osu.Framework/Resources/Shaders/sh_yuv2rgb.h	(revision bf0a870fc35278929759d8586942bcd9ebfa67ef)
+++ b/osu.Framework/Resources/Shaders/sh_yuv2rgb.h	(date 1635037495377)
@@ -1,8 +1,7 @@
 #include "sh_TextureWrapping.h"
 
 uniform sampler2D m_SamplerY;
-uniform sampler2D m_SamplerU;
-uniform sampler2D m_SamplerV;
+uniform sampler2D m_SamplerUV;
 
 uniform mediump mat3 yuvCoeff;
 
@@ -16,7 +15,7 @@
         return vec4(0.0);
 
     lowp float y = texture2D(m_SamplerY, wrappedCoord, lodBias).r;
-    lowp float u = texture2D(m_SamplerU, wrappedCoord, lodBias).r;
-    lowp float v = texture2D(m_SamplerV, wrappedCoord, lodBias).r;
+    lowp float u = texture2D(m_SamplerUV, wrappedCoord, lodBias).r;
+    lowp float v = texture2D(m_SamplerUV, wrappedCoord, lodBias).g;
     return vec4(yuvCoeff * (vec3(y, u, v) + offsets), 1.0);
 }
Index: osu.Framework/Graphics/Video/VideoDecoder.cs
===================================================================
diff --git a/osu.Framework/Graphics/Video/VideoDecoder.cs b/osu.Framework/Graphics/Video/VideoDecoder.cs
--- a/osu.Framework/Graphics/Video/VideoDecoder.cs	(revision bf0a870fc35278929759d8586942bcd9ebfa67ef)
+++ b/osu.Framework/Graphics/Video/VideoDecoder.cs	(date 1635037362441)
@@ -618,7 +618,7 @@
                 lastDecodedFrameTime = (float)frameTime;
 
                 // Note: this is the pixel format that `VideoTexture` expects internally
-                frame = ensureFramePixelFormat(frame, AVPixelFormat.AV_PIX_FMT_YUV420P);
+                frame = ensureFramePixelFormat(frame, AVPixelFormat.AV_PIX_FMT_NV12);
                 if (frame == null)
                     continue;
 
Index: osu.Framework/Graphics/Video/VideoTexture.cs
===================================================================
diff --git a/osu.Framework/Graphics/Video/VideoTexture.cs b/osu.Framework/Graphics/Video/VideoTexture.cs
--- a/osu.Framework/Graphics/Video/VideoTexture.cs	(revision bf0a870fc35278929759d8586942bcd9ebfa67ef)
+++ b/osu.Framework/Graphics/Video/VideoTexture.cs	(date 1635037660628)
@@ -78,7 +78,7 @@
                 Debug.Assert(memoryLease == null);
                 memoryLease = NativeMemoryTracker.AddMemory(this, Width * Height * 3 / 2);
 
-                textureIds = new int[3];
+                textureIds = new int[2];
                 GL.GenTextures(textureIds.Length, textureIds);
 
                 for (int i = 0; i < textureIds.Length; i++)
@@ -99,7 +99,7 @@
                         int width = (videoUpload.Frame->width + 1) / 2;
                         int height = (videoUpload.Frame->height + 1) / 2;
 
-                        GL.TexImage2D(TextureTarget2d.Texture2D, 0, TextureComponentCount.R8, width, height, 0, PixelFormat.Red, PixelType.UnsignedByte, IntPtr.Zero);
+                        GL.TexImage2D(TextureTarget2d.Texture2D, 0, TextureComponentCount.Rg8, width, height, 0, PixelFormat.Rg, PixelType.UnsignedByte, IntPtr.Zero);
 
                         textureSize += width * height;
                     }
@@ -116,12 +116,9 @@
             {
                 GLWrapper.BindTexture(textureIds[i]);
 
-                GL.PixelStore(PixelStoreParameter.UnpackRowLength, videoUpload.Frame->linesize[(uint)i]);
-                GL.TexSubImage2D(TextureTarget2d.Texture2D, 0, 0, 0, videoUpload.Frame->width / (i > 0 ? 2 : 1), videoUpload.Frame->height / (i > 0 ? 2 : 1),
-                    PixelFormat.Red, PixelType.UnsignedByte, (IntPtr)videoUpload.Frame->data[(uint)i]);
+                GL.TexSubImage2D(TextureTarget2d.Texture2D, 0, 0, 0, videoUpload.Frame->width / (i + 1), videoUpload.Frame->height / (i + 1),
+                    i == 0 ? PixelFormat.Red : PixelFormat.Rg, PixelType.UnsignedByte, (IntPtr)videoUpload.Frame->data[(uint)i]);
             }
-
-            GL.PixelStore(PixelStoreParameter.UnpackRowLength, 0);
 
             UploadComplete = true;
         }

Doesn't affect SW decoder path
Some codecs (like MediaCodec on Android) don't set `best_effort_timestamp` for some reason.
I only ran into these while using MediaCodec decoder on Android. `avcodec_send_packet` would get stuck and always return EAGAIN because the buffered frames were never consumed.

Fixing this loop fixed video playback but there were visual glitches, because if `avcodec_send_packet` failed, the packet's data would get overwritten on the next iteration of `decodeNextFrame` which created a gap in the video data that FFmpeg received.
Result is exactly the same but the used parameters should make more sense now.
@Wieku
Copy link

Wieku commented Oct 26, 2021

I haven't looked in depth into this, but on my project I noticed a significant increase in color conversion speed by using Google's libyuv library over ffmpeg's swscale

Screenshot compares 1080p60 encoding using libyuv for rgb->i420 (on the left) conversion instead of libswscale (right):
image

@bdach
Copy link
Collaborator

bdach commented Oct 26, 2021

Ran this on my linux box, it selected VDPAU and ran OK, so can probably cross that off the list. Works fine but I don't see a huge difference in the performance numbers I've looked at (mostly CPU and RAM, didn't check much else truth be told).

Copy link
Collaborator

@bdach bdach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have spent an hour and a bit reading through it and for my untrained eye looks solid enough. A few things raised question marks when I was reading but then at the end after reading the whole thing kinda clicked into place. Interesting design.

osu.Framework/Graphics/Video/FFmpegFuncs.cs Outdated Show resolved Hide resolved
@@ -8,5 +8,31 @@ namespace osu.Framework.Graphics.Video
internal static class FfmpegExtensions
{
internal static double GetValue(this AVRational rational) => rational.num / (double)rational.den;

internal static bool IsHardwarePixelFormat(this AVPixelFormat pixFmt)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For anyone else reading this, I think this can be used to cross-reference: https://ffmpeg.org/doxygen/4.1/pixfmt_8h_source.html

Seems like all formats were caught except maybe AVPixelFormat.AV_PIX_FMT_XVMC, but that one is apparently pretty old/obsoleted by VDPAU/VA-API so...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, I added it to the list. In reality most of these aren't really required but I added them to live up to the function's name.

osu.Framework/Graphics/Video/HardwareVideoDecoder.cs Outdated Show resolved Hide resolved
if (targetHwDecoders.Count == 0)
break;

// Note: Intersect order here is important, order of the returned elements is determined by the first enumerable.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what it's worth this is documented behaviour so probably safe to rely on it...?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made sure that it was at least documented but it's still not ideal. I'll change it to proper sort. It should also sort the entire codecs list instead of just the devices in each codec separately.

Are you ok with sorting it by the enum value (where lowest = "best" device)?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorting by enum value sounds good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up with a separate Comparer as otherwise I would have to shuffle between AVHWDeviceType and HardwareVideoDecoder.
The way it's written now should also move Quick Sync higher than DXVA, which the previous sorting didn't do correctly.

osu.Framework/Graphics/Video/VideoDecoder.cs Show resolved Hide resolved
@peppy
Copy link
Member

peppy commented Oct 27, 2021

Can confirm this is working on macOS. I think it's probably fine to have it on by default for desktops?

That said, wondering if we want to add a framework-level configuration setting for this (a simple bool). Quite feasible that this would be a user checkbox at osu!'s side? At which point we may also want to consider the ability to toggle the decoder status without recreating the decoder.

Maybe that's over-complicating things and it should be always-on, as there are few drawbacks of having it enabled when available?

@bdach
Copy link
Collaborator

bdach commented Oct 27, 2021

wondering if we want to add a framework-level configuration setting for this (a simple bool)

There is one already in this pull - mind you, it's not really a simple bool, but a HardwareVideoDecoder flag enum value, but it can work either way (as is you could toggle that setting between None and Any).

@peppy
Copy link
Member

peppy commented Oct 27, 2021

Right, specifically the bindable part then, which isn't currently implemented to update on the fly.

@Opelkuh
Copy link
Contributor Author

Opelkuh commented Oct 28, 2021

I wasn't sure if it was possible to change the FFmpeg decoder mid-playback, but it turned out to be pretty simple to implement , so now it's properly bound to the config. But I'll note that the video lags for a short period when changing it + it sometimes spits out some warnings, until the new decoder gets enough of the video stream.

I added a toggle for this to the tests so you can try it from there.

@peppy
Copy link
Member

peppy commented Oct 28, 2021

That's pretty amazing, I was expecting us to have to recreate the decoder object completely.

@Opelkuh
Copy link
Contributor Author

Opelkuh commented Oct 28, 2021

Recreating the whole VideoDecoder was the first thing I tried but that didn't work at all, presumably because FFmpeg needs to read some headers at the start of the file.

I think it's probably fine to have it on by default for desktops?

Sorry, I missed this yesterday. I was mainly concerned with the increased RAM usage for little to no benefit with lower bitrate videos using NVDEC (it adds ~100MB). But it seems that other APIs don't have this "problem" and as some people like to say, unused RAM is wasted RAM. So I enabled it by default for all platforms, after all, people can always turn it off.

@bdach
Copy link
Collaborator

bdach commented Oct 28, 2021

Have checked the mid-playback decoder switch and it does look to be working pretty great, aside from the mentioned hitch immediately after switching. I've also tested semi-experimentally how it would behave if the HW codec failed to initialise by applying the following:

diff --git a/osu.Framework/Graphics/Video/VideoDecoder.cs b/osu.Framework/Graphics/Video/VideoDecoder.cs
index 65af89ec9c..1fa5ec9f69 100644
--- a/osu.Framework/Graphics/Video/VideoDecoder.cs
+++ b/osu.Framework/Graphics/Video/VideoDecoder.cs
@@ -137,6 +137,9 @@ public VideoDecoder(Stream videoStream)
                 if (formatContext == null)
                     return;
 
+                if (e.NewValue == HardwareVideoDecoder.Any)
+                    decoderCommands.Enqueue(() => throw new InvalidOperationException());
+
                 decoderCommands.Enqueue(recreateCodecContext);
             });
         }

The good part is that it doesn't crash, the bad part is that it leaves the decoder in a faulted state. But I think that's fine, being able to hotswap in the first place is already quite something. I'd leave that be until it is deemed a problem.

bdach
bdach previously approved these changes Oct 28, 2021
@Opelkuh
Copy link
Contributor Author

Opelkuh commented Oct 28, 2021

It should (hopefully) happen very rarely as it should always at least fallback to SW decoding when everything else fails.

codecs.Add((firstCodec, AVHWDeviceType.AV_HWDEVICE_TYPE_NONE));

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support hardware video decode
4 participants