Skip to content

Conversation

NicolasHug
Copy link
Contributor

@NicolasHug NicolasHug commented Jul 27, 2025

This PR improves the color conversion of BT.709 full-range videos. In main, we were always using nppiNV12ToRGB_709CSC_8u_P2C3R_Ctx, which is for limited (studio) range. For full-range, NPP provides nppiNV12ToRGB_709HDTV_8u_P2C3R_Ctx. But, as I tried it, it doesn't seem to match our CPU results very closely - in fact they were further away than with nppiNV12ToRGB_709CSC_8u_P2C3R_Ctx. So instead, we now rely on a custom color-conversion matrix which provides very, very close results to the CPU on CUDA 12.9.

I will leave this PR description here. I tried to document a lot of the color conversion process in a note, in the code. I will be extending on this note when finalizing the 10bit support, hopefully soon.

Comparing CPU frame (top) with GPU frame (bottom) on CUDA 12.9

On main

bt709_full_range_cpu_gpu_comparison_main

This PR

bt709_full_range_cpu_gpu_comparison

@NicolasHug NicolasHug marked this pull request as draft July 27, 2025 14:28
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jul 27, 2025
@NicolasHug NicolasHug changed the title Support 10 bits videos on CUDA [WIP] Fix full-range CUDA decoding Jul 27, 2025
@NicolasHug NicolasHug changed the title [WIP] Fix full-range CUDA decoding Fix BT709 full-range CUDA color conversion Aug 15, 2025
@NicolasHug NicolasHug marked this pull request as ready for review August 15, 2025 09:52
gpu_frame = decoder_gpu.get_frame_at(frame_index).data.cpu()
cpu_frame = decoder_cpu.get_frame_at(frame_index).data

torch.testing.assert_close(gpu_frame, cpu_frame, rtol=0, atol=2)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On main this fails on BT709_FULL_RANGE with

E           AssertionError: Tensor-likes are not close!
E           
E           Mismatched elements: 2012707 / 2764800 (72.8%)
E           Greatest absolute difference: 20 at index (0, 62, 206) (up to 2 allowed)
E           Greatest relative difference: 1.0 at index (0, 0, 640) (up to 0 allowed)

torch.testing.assert_close(gpu_frame, cpu_frame, rtol=0, atol=2)
elif cuda_version_used_for_building_torch() == (12, 8):
assert psnr(gpu_frame, cpu_frame) > 20

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I developed this PR on CUDA 12.9, and I was unconditionally using torch.testing.assert_close(gpu_frame, cpu_frame, rtol=0, atol=2) which was passing. And it's passing on the 12.9 CI job.

When I submitted the PR and the CI tested on CUDA 12.6, I realized the test wasn't passing. I'm unable to tell by how much, and I'm unable to reproduce locally because I don't have 12.6, and I can't ssh into the runner either.

12.8 is producing OK results, with a psnr of ~24, but it's not as good as with 12.9.

I honestly think we should treat this as bugs in NPP that were eventually fixed in 12.9. I can't imagine us having to use different code-paths depending on the current runtime CUDA version. That sounds too complicated, and I'm not even sure that is doable. I.e. this isn't about compile-time #define checks, that wouldn't be enough, because we can compile on 12.9 and run on 12.8.

Note that 12.6 is considered to be legacy support from now on with torch: pytorch/pytorch#159980

@scotts
Copy link
Contributor

scotts commented Aug 18, 2025

Is there any documentation that explains "709 HDTV full color conversion" versus "RGB 709 CSC color conversion" more? Whenever I search for it, I always land back at the Nvidia docs. Is it an Nvidia-only concept, or a part of the the standard, but using a different name?

@NicolasHug
Copy link
Contributor Author

You probably won't find resources directly addressing this. It mostly has to do with the color range: full vs limited. NVidia calls those "HDTV" and "CSC".

But the color-range (full vs limited) is an orthogonal concept to the color space (709), so it's unlikely you'll find resources explaining both concepts simultaneously for 709 specifically.

I am also really struggling to find decent resources that just talk about the color range. It seems to be the kind of concept that you have to just know about. Almost all wikipedia pages I'm reading are assuming pre-existing knowledge about the concept. There is https://www.highresolution.tv/2024/07/02/hre-shorts-color-range/ which sounds decent, but TBH I've used LLMs to build my understanding.

When working on 10bit support, I will be extending the note to discuss limited range.

@scotts
Copy link
Contributor

scotts commented Aug 18, 2025

Yeah, I'm going through your comment (which is amazing!) and I'm coming to the understanding that it's just the names Nvidia uses for full-color and limited-color.

// defined ours above. HOWEVER, the function itself expects Y to be in [0, 255]
// and U,V to be in [-128, 127]. So U and V need to be offset by -128 to center
// them around 0. The offsets can be applied by adding a 4th column to the
// matrix:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get lost in this paragraph.

they expect the matrix coefficient to assume the input Y is in [0, 1], and [U, V] in [-0.5, 0.5]. We're in luck, that's how we defined ours above.

Who is "they"? The NPP functions? And at this point, the understanding I personally have from the text is that our matrix is aligned with what "they" want. But:

HOWEVER, the function itself expects Y to be in [0, 255] and U,V to be in [-128, 127]. So U and V need to be offset by -128 to center them around 0. The offsets can be applied by adding a 4th column to the matrix:

Now I'm confused: I feel like the first sentence I quoted said that Y is in [0, 1], but the second one says it's in [0, 255]. Or are "they" and "the function itself" different entities? By "the function itself," do you mean nppiNV12ToRGB_8u_ColorTwist32f_P2C3R_Ctx() from NPP? Same confusion applies to U and V.

What I think may be the case is that by "they" you mean "the field of people who understand this stuff and defined it" and by "this function" you mean the specific function from NPP that implements the functionality we need. Is that the case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to clarify - this is all very confusing to me too because the API really is quite messy

# ffmpeg -f lavfi -i testsrc2=duration=1:size=1920x720:rate=30 \
# -c:v libx264 -pix_fmt yuv420p -color_primaries bt709 -color_trc bt709 \
# -colorspace bt709 -color_range pc bt709_full_range.mp4
#
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up until this PR, we've maintained the rule that all generated references can be generated from the generate_reference_resources.sh script. Is that something we want to continue to maintain? I think there is a lot of value in it, but that script is also not the cleanest artifact.

Copy link
Contributor Author

@NicolasHug NicolasHug Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I can tell this script generates the bmp / pt reference frames, but not the source video themselves. I see similar comments there indicating how the videos were generated:

https://github.com/pytorch/torchcodec/blob/ffcb7ab2e98c204dfb103f46b2db154cbf1aa713/test/generate_reference_resources.sh#L65-L66

Here we're not generating or using the frames, we're just comparing the CPU output with the GPU output.

I agree we should also try to check against a ground truth reference, but I'll leave that out as a follow-up if that's OK

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're actually doing a bit of both, which is messy. I'm going to create an issue about it.

@scotts
Copy link
Contributor

scotts commented Aug 19, 2025

@NicolasHug, thank you for this fix! This is an enormous improvement in the quality of our CUDA support, and it is extremely not obvious!

@NicolasHug NicolasHug merged commit 4af0bfe into meta-pytorch:main Aug 19, 2025
47 checks passed
@NicolasHug NicolasHug deleted the studiorange branch August 19, 2025 12:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants