Add batch decoding support to CUDA #319

ahmadsharif1 · 2024-10-29T19:00:16Z

Allocate a batch tensor on the correct device. When cuda is passed in it uses that now.
Pass in the batch tensor's view to the color conversion function convertAVFrameToDecodedOutputOnCuda().
Add a test to test frame contents.
Added a TODO to eventually merge preAllocatedOutputTesnor into RawDecodedOutput because it doesn't make sense to pass in two output data pointers.
Add device to VideoDecoder class
Update sampler benchmark to take in device and video arguments from the commandline

Sampler benchmark results:

CPU:
python benchmarks/samplers/benchmark_samplers.py --device=cpu
----------
num_clips = 1
clips_at_random_indices     med = 23.16ms +- 16.18  med fps = 431.8
clips_at_regular_indices    med = 5.67ms +- 0.43  med fps = 1764.3
clips_at_random_timestamps  med = 22.54ms +- 16.21  med fps = 443.7
clips_at_regular_timestamps med = 7.46ms +- 5.66  med fps = 1339.7
----------
num_clips = 50
clips_at_random_indices     med = 2400.86ms +- 803.05  med fps = 208.3
clips_at_regular_indices    med = 1343.50ms +- 288.18  med fps = 372.2
clips_at_random_timestamps  med = 1170.24ms +- 727.77  med fps = 427.3
clips_at_regular_timestamps med = 950.92ms +- 294.30  med fps = 515.3

CUDA:
python benchmarks/samplers/benchmark_samplers.py --device=cuda:0
----------
num_clips = 1
[AVHWDeviceContext @ 0x8793680] Using current CUDA context.
clips_at_random_indices     med = 245.46ms +- 116.64  med fps = 40.7
clips_at_regular_indices    med = 284.49ms +- 39.86  med fps = 35.2
clips_at_random_timestamps  med = 264.93ms +- 115.74  med fps = 37.7
clips_at_regular_timestamps med = 283.26ms +- 9.99  med fps = 35.3
----------
num_clips = 50
[AVHWDeviceContext @ 0x8d0d680] Using current CUDA context.
clips_at_random_indices     med = 308.00ms +- 104.52  med fps = 1623.4
clips_at_regular_indices    med = 286.54ms +- 12.69  med fps = 1744.9
clips_at_random_timestamps  med = 368.12ms +- 105.73  med fps = 1358.3
clips_at_regular_timestamps med = 285.32ms +- 13.19  med fps = 1717.4

CUDA is only worth it for lots of decoding (and could win at throughput) and potentially for higher resolution videos.

Also interestingly enough the variability in CUDA is quite low.

NicolasHug

Thank you @ahmadsharif1 . Only minor suggestions from me.

This is not immediately related to this PR, but now that we publicly expose CUDA, we'll want to beef-up our CUDA tests. They're pretty minimal right now. The test utils that I linked-to below will be useful. Let's follow-up with that in a separate PR (happy to help).

NicolasHug · 2024-10-30T10:32:07Z

src/torchcodec/decoders/_core/VideoDecoder.cpp

        rawOutput, output, preAllocatedOutputTensor);
  } else if (streamInfo.options.device.type() == torch::kCUDA) {
-    // TODO: handle pre-allocated output tensor
+    // TODO: we should fold preAllocatedOutputTensor into RawDecodedOutput.


Nit: move this TODO outside of this else/if block (just on top of it?), because this applies to CPU as well, not just to the CUDA branch. We may also want to open an issue?

NicolasHug · 2024-10-30T10:32:50Z

src/torchcodec/decoders/_video_decoder.py

            instances of ``VideoDecoder`` in parallel. Use a higher number for multi-threaded
            decoding which is best if you are running a single instance of ``VideoDecoder``.
            Default: 1.
+        device (str or torch.device, optional): The device to use for decoding.


Suggested change

device (str or torch.device, optional): The device to use for decoding.

device (str or torch.device, optional): The device to use for decoding. Default: "cpu".

NicolasHug · 2024-10-30T10:33:38Z

src/torchcodec/decoders/_video_decoder.py

            decoding which is best if you are running a single instance of ``VideoDecoder``.
            Default: 1.
+        device (str or torch.device, optional): The device to use for decoding.
+


This wasn't introduced in this PR, but we might as well fix it here: the .. note:: below should be part of the parameter description of the dimension_order parameter. Do you mind moving it back up?

NicolasHug · 2024-10-30T10:35:49Z

test/utils.py



+# Asserts that at most percentage of the elements are different by more than abs_tolerance.
+def assert_tensor_nearly_equal(frame1, frame2, percentage=0.3, abs_tolerance=20):


Nit regarding the name: we already have assert_tensor_close, which semantically convey the same meaning as "nearly equal" to me. So the distinction between these isn't obvious. Maybe assert_tensor_close_on_at_least(...)?

Agreed. Even better if we can use the same utility function - I suspect that the logic we're doing in this function is quite similar to what torch.testing.assert_close() is already doing.

The answer also might be that we just eliminate both assert_tensor_close() and assert_tensor_nearly_equal(), and just use plain torch.testing.assert_close() with scenario-specific tolerances.

we are doing something different here compared to assert_close.

I could use assert_close but the tolerances were quite high. I actually did use it in my first PR:

#242 (comment)

NicolasHug · 2024-10-30T10:44:32Z

test/decoders/test_video_decoder_ops.py

+    # TODO: Figure out how to parameterize this test to run on both CPU and CUDA.abs
+    # The question is how to have the @needs_cuda decorator with the pytest.mark.parametrize
+    # decorator on the same test.


It's simple!

We just need to define this new util

https://github.com/pytorch/vision/blob/e9a3213524a0abd609ac7330cf170b9e19917d39/test/common_utils.py#L122-L125

and it can be used like this

https://github.com/pytorch/vision/blob/e9a3213524a0abd609ac7330cf170b9e19917d39/test/test_utils.py#L221

If you want, we can merge this PR as-is and follow-up with that

I'll do that as a follow-up

NicolasHug · 2024-10-30T10:47:27Z

test/decoders/test_video_decoder_ops.py

        assert_tensor_equal(frames0and180[1], reference_frame180)

+    @needs_cuda
+    def test_get_frames_at_indices_with_cuda(self):


We'll also want to test get_frames_in_range, and all the batch-APIs?
I feel like we should be parametrizing a fair amount of our tests. But this can be done as a follow-up.

I'll do that as a follow-up

benchmarks/samplers/benchmark_samplers.py

.

851b399

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 29, 2024

ahmadsharif1 added 3 commits October 29, 2024 12:01

.

328ea07

.

2b47104

.

2088934

ahmadsharif1 marked this pull request as ready for review October 29, 2024 20:19

NicolasHug approved these changes Oct 30, 2024

View reviewed changes

scotts reviewed Oct 30, 2024

View reviewed changes

benchmarks/samplers/benchmark_samplers.py Show resolved Hide resolved

scotts approved these changes Oct 30, 2024

View reviewed changes

ahmadsharif1 added 4 commits October 30, 2024 06:57

.

85721dd

.

5597543

.

74a1d24

.

f717df6

ahmadsharif1 merged commit dc16154 into meta-pytorch:main Oct 30, 2024
37 of 40 checks passed

ahmadsharif1 deleted the cuda13 branch October 30, 2024 15:57

ahmadsharif1 mentioned this pull request Oct 30, 2024

Add more cuda tests #326

Merged

ahmadsharif1 mentioned this pull request Nov 7, 2024

Improve the way we allocate and use memory for GPU batch decoding #189

Closed

	device (str or torch.device, optional): The device to use for decoding.
	device (str or torch.device, optional): The device to use for decoding. Default: "cpu".



		# Asserts that at most percentage of the elements are different by more than abs_tolerance.
		def assert_tensor_nearly_equal(frame1, frame2, percentage=0.3, abs_tolerance=20):

Add batch decoding support to CUDA #319

Add batch decoding support to CUDA #319

Uh oh!

Conversation

ahmadsharif1 commented Oct 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NicolasHug left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ahmadsharif1 commented Oct 29, 2024 •

edited

Loading