-
Notifications
You must be signed in to change notification settings - Fork 62
Fix BT709 full-range CUDA color conversion #791
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
test/test_decoders.py
Outdated
gpu_frame = decoder_gpu.get_frame_at(frame_index).data.cpu() | ||
cpu_frame = decoder_cpu.get_frame_at(frame_index).data | ||
|
||
torch.testing.assert_close(gpu_frame, cpu_frame, rtol=0, atol=2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On main this fails on BT709_FULL_RANGE
with
E AssertionError: Tensor-likes are not close!
E
E Mismatched elements: 2012707 / 2764800 (72.8%)
E Greatest absolute difference: 20 at index (0, 62, 206) (up to 2 allowed)
E Greatest relative difference: 1.0 at index (0, 0, 640) (up to 0 allowed)
torch.testing.assert_close(gpu_frame, cpu_frame, rtol=0, atol=2) | ||
elif cuda_version_used_for_building_torch() == (12, 8): | ||
assert psnr(gpu_frame, cpu_frame) > 20 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I developed this PR on CUDA 12.9, and I was unconditionally using torch.testing.assert_close(gpu_frame, cpu_frame, rtol=0, atol=2)
which was passing. And it's passing on the 12.9 CI job.
When I submitted the PR and the CI tested on CUDA 12.6, I realized the test wasn't passing. I'm unable to tell by how much, and I'm unable to reproduce locally because I don't have 12.6, and I can't ssh into the runner either.
12.8 is producing OK results, with a psnr of ~24, but it's not as good as with 12.9.
I honestly think we should treat this as bugs in NPP that were eventually fixed in 12.9. I can't imagine us having to use different code-paths depending on the current runtime CUDA version. That sounds too complicated, and I'm not even sure that is doable. I.e. this isn't about compile-time #define
checks, that wouldn't be enough, because we can compile on 12.9 and run on 12.8.
Note that 12.6 is considered to be legacy support from now on with torch: pytorch/pytorch#159980
Is there any documentation that explains "709 HDTV full color conversion" versus "RGB 709 CSC color conversion" more? Whenever I search for it, I always land back at the Nvidia docs. Is it an Nvidia-only concept, or a part of the the standard, but using a different name? |
You probably won't find resources directly addressing this. It mostly has to do with the color range: full vs limited. NVidia calls those "HDTV" and "CSC". But the color-range (full vs limited) is an orthogonal concept to the color space (709), so it's unlikely you'll find resources explaining both concepts simultaneously for 709 specifically. I am also really struggling to find decent resources that just talk about the color range. It seems to be the kind of concept that you have to just know about. Almost all wikipedia pages I'm reading are assuming pre-existing knowledge about the concept. There is https://www.highresolution.tv/2024/07/02/hre-shorts-color-range/ which sounds decent, but TBH I've used LLMs to build my understanding. When working on 10bit support, I will be extending the note to discuss limited range. |
Yeah, I'm going through your comment (which is amazing!) and I'm coming to the understanding that it's just the names Nvidia uses for full-color and limited-color. |
// defined ours above. HOWEVER, the function itself expects Y to be in [0, 255] | ||
// and U,V to be in [-128, 127]. So U and V need to be offset by -128 to center | ||
// them around 0. The offsets can be applied by adding a 4th column to the | ||
// matrix: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get lost in this paragraph.
they expect the matrix coefficient to assume the input Y is in [0, 1], and [U, V] in [-0.5, 0.5]. We're in luck, that's how we defined ours above.
Who is "they"? The NPP functions? And at this point, the understanding I personally have from the text is that our matrix is aligned with what "they" want. But:
HOWEVER, the function itself expects Y to be in [0, 255] and U,V to be in [-128, 127]. So U and V need to be offset by -128 to center them around 0. The offsets can be applied by adding a 4th column to the matrix:
Now I'm confused: I feel like the first sentence I quoted said that Y is in [0, 1], but the second one says it's in [0, 255]. Or are "they" and "the function itself" different entities? By "the function itself," do you mean nppiNV12ToRGB_8u_ColorTwist32f_P2C3R_Ctx()
from NPP? Same confusion applies to U and V.
What I think may be the case is that by "they" you mean "the field of people who understand this stuff and defined it" and by "this function" you mean the specific function from NPP that implements the functionality we need. Is that the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try to clarify - this is all very confusing to me too because the API really is quite messy
# ffmpeg -f lavfi -i testsrc2=duration=1:size=1920x720:rate=30 \ | ||
# -c:v libx264 -pix_fmt yuv420p -color_primaries bt709 -color_trc bt709 \ | ||
# -colorspace bt709 -color_range pc bt709_full_range.mp4 | ||
# |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Up until this PR, we've maintained the rule that all generated references can be generated from the generate_reference_resources.sh script. Is that something we want to continue to maintain? I think there is a lot of value in it, but that script is also not the cleanest artifact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I can tell this script generates the bmp / pt reference frames, but not the source video themselves. I see similar comments there indicating how the videos were generated:
Here we're not generating or using the frames, we're just comparing the CPU output with the GPU output.
I agree we should also try to check against a ground truth reference, but I'll leave that out as a follow-up if that's OK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're actually doing a bit of both, which is messy. I'm going to create an issue about it.
@NicolasHug, thank you for this fix! This is an enormous improvement in the quality of our CUDA support, and it is extremely not obvious! |
This PR improves the color conversion of BT.709 full-range videos. In
main
, we were always usingnppiNV12ToRGB_709CSC_8u_P2C3R_Ctx
, which is for limited (studio) range. For full-range, NPP providesnppiNV12ToRGB_709HDTV_8u_P2C3R_Ctx
. But, as I tried it, it doesn't seem to match our CPU results very closely - in fact they were further away than withnppiNV12ToRGB_709CSC_8u_P2C3R_Ctx
. So instead, we now rely on a custom color-conversion matrix which provides very, very close results to the CPU on CUDA 12.9.I will leave this PR description here. I tried to document a lot of the color conversion process in a note, in the code. I will be extending on this note when finalizing the 10bit support, hopefully soon.
Comparing CPU frame (top) with GPU frame (bottom) on CUDA 12.9
On main
This PR