Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
91 changes: 56 additions & 35 deletions src/torchcodec/decoders/_core/VideoDecoder.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,31 @@ double ptsToSeconds(int64_t pts, const AVRational& timeBase) {
return ptsToSeconds(pts, timeBase.den);
}

// Returns a [N]CHW *view* of a [N]HWC input tensor, if the options require so.
// The [N] leading batch-dimension is optional i.e. the input tensor can be 3D
// or 4D.
// Calling permute() is guaranteed to return a view as per the docs:
// https://pytorch.org/docs/stable/generated/torch.permute.html
torch::Tensor MaybePermuteHWC2CHW(
const VideoDecoder::VideoStreamDecoderOptions& options,
torch::Tensor& hwcTensor) {
if (options.dimensionOrder == "NHWC") {
return hwcTensor;
}
auto numDimensions = hwcTensor.dim();
auto shape = hwcTensor.sizes();
if (numDimensions == 3) {
TORCH_CHECK(shape[2] == 3, "Not a HWC tensor: ", shape);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this robust if the width/height is 3?

Copy link
Contributor Author

@NicolasHug NicolasHug Oct 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will have false positive for the extremely rare (and probably degenerate) case where a video width is 3.

This check is the very very best we can do at this stage. The alternative is to not check anything.

return hwcTensor.permute({2, 0, 1});
} else if (numDimensions == 4) {
TORCH_CHECK(shape[3] == 3, "Not a NHWC tensor: ", shape);
return hwcTensor.permute({0, 3, 1, 2});
} else {
TORCH_CHECK(
false, "Expected tensor with 3 or 4 dimensions, got ", numDimensions);
}
}

struct AVInput {
UniqueAVFormatContext formatContext;
std::unique_ptr<AVIOBytesContext> ioBytesContext;
Expand Down Expand Up @@ -167,28 +192,13 @@ VideoDecoder::BatchDecodedOutput::BatchDecodedOutput(
const VideoStreamDecoderOptions& options,
const StreamMetadata& metadata)
: ptsSeconds(torch::empty({numFrames}, {torch::kFloat64})),
durationSeconds(torch::empty({numFrames}, {torch::kFloat64})) {
if (options.dimensionOrder == "NHWC") {
frames = torch::empty(
{numFrames,
options.height.value_or(*metadata.height),
options.width.value_or(*metadata.width),
3},
{torch::kUInt8});
} else if (options.dimensionOrder == "NCHW") {
frames = torch::empty(
{numFrames,
3,
options.height.value_or(*metadata.height),
options.width.value_or(*metadata.width)},
torch::TensorOptions()
.memory_format(torch::MemoryFormat::ChannelsLast)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if using this is identical to permuting a NHWC tensor at the end. I am not 100% sure. Do you?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant in terms of performance, is it identical? Specifically this is a bit concerning:

Uploading image.png…

We really should have a batch benchmark because the whole point of batch is to do things in a performant way for many frames.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I don't understand what you mean. What image did you mean to link to?

Note that this PR should e strictly more efficient:

Batched output tensors are now always created as NHWC. They are converted to NCHW in one single step, instead of converting HWC sub-tensors N times.

Copy link
Contributor

@ahmadsharif1 ahmadsharif1 Oct 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry the image wasn't uploaded properly:

image

I am not sure about the performance implications of doing a permute instead of .to(channels_last)

From this page: https://pytorch.org/tutorials/intermediate/memory_format_tutorial.html#:~:text=What%20is%20Channels%20Last,pixel%2Dper%2Dpixel).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I still don't really understand where you're coming from. Can you please share a link? Is this relevant for this PR?

Again, the change involved in this PR is:

Batched output tensors are now always created as NHWC. They are converted to NCHW in one single step, instead of converting HWC sub-tensors N times

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the link.

We're not concerned about memory format (contiguous vs channels-last) in this PR. This is a related but distinct concern to the dimension order.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link in the edited comment above. It's from the channels-last page.

I am not 100% sure if creating a NHWC and permuting it is the same as creating a NCHW with channels-last and working with that. The code that you deleted was doing the latter.

A benchmark may show a difference -- or not. Do you know?

.dtype({torch::kUInt8}));
} else {
TORCH_CHECK(
false, "Unsupported frame dimensionOrder =" + options.dimensionOrder)
}
}
durationSeconds(torch::empty({numFrames}, {torch::kFloat64})),
frames(torch::empty(
{numFrames,
options.height.value_or(*metadata.height),
options.width.value_or(*metadata.width),
3},
{torch::kUInt8})) {}

VideoDecoder::VideoDecoder() {}

Expand Down Expand Up @@ -890,22 +900,27 @@ void VideoDecoder::convertAVFrameToDecodedOutputOnCPU(
if (output.streamType == AVMEDIA_TYPE_VIDEO) {
if (streamInfo.colorConversionLibrary == ColorConversionLibrary::SWSCALE) {
torch::Tensor tensor;
int width = streamInfo.options.width.value_or(frame->width);
int height = streamInfo.options.height.value_or(frame->height);
if (preAllocatedOutputTensor.has_value()) {
// TODO: check shape of preAllocatedOutputTensor?
tensor = preAllocatedOutputTensor.value();
auto shape = tensor.sizes();
TORCH_CHECK(
(shape.size() == 3) && (shape[0] == height) &&
(shape[1] == width) && (shape[2] == 3),
"Expected tensor of shape ",
height,
"x",
width,
"x3, got ",
shape);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any idea how to make this single call shorter 🤔 ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can do:

TORCH_CHECK(
            (shape.size() == 3) && (shape.equals({height, width, 3}),
            "Expected tensor of shape ",
            height,
            "x",
            width,
            "x3, got ",
            shape);

But I'm not sure. The main thing I'm not sure about is if an array literal will auto-convert into a the corresponding ArrayRef: https://pytorch.org/cppdocs/api/classc10_1_1_array_ref.html#_CPPv4NK3c108ArrayRef6equalsE8ArrayRef

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I was mainly hoping to avoid the

            height,
            "x",
            width,
            "x3, got ",
            shape);

stack :p

} else {
int width = streamInfo.options.width.value_or(frame->width);
int height = streamInfo.options.height.value_or(frame->height);
tensor = torch::empty(
{height, width, 3}, torch::TensorOptions().dtype({torch::kUInt8}));
}

rawOutput.data = tensor.data_ptr<uint8_t>();
convertFrameToBufferUsingSwsScale(rawOutput);

if (streamInfo.options.dimensionOrder == "NCHW") {
tensor = tensor.permute({2, 0, 1});
}
output.frame = tensor;
} else if (
streamInfo.colorConversionLibrary ==
Expand All @@ -916,6 +931,14 @@ void VideoDecoder::convertAVFrameToDecodedOutputOnCPU(
"Invalid color conversion library: " +
std::to_string(static_cast<int>(streamInfo.colorConversionLibrary)));
}
if (!preAllocatedOutputTensor.has_value()) {
// We only convert to CHW if a pre-allocated tensor wasn't passed. When a
// pre-allocated tensor is passed, it's up to the caller (typically a
// batch API) to do the conversion. This is more efficient as it allows
// batch NHWC tensors to be permuted only once, instead of permuting HWC
// tensors N times.
output.frame = MaybePermuteHWC2CHW(streamInfo.options, output.frame);
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think there's some smell to this. Whether the tensor was pre-allocated and whether it should be permuted should be orthogonal concepts.
I think it would make sense for all the low-level decoding capabilities (including this function convertAVFrameToDecodedOutputOnCPU ) to only ever accept and return HWC tensors.

And it should be up to the higher-level decoding entry-points (basically the moral equivalent of the public methods) to do the conversion. It's not trivial because getFrameAtIndex is both an entry point and a sub-function of other entry-points. Maybe that also means we should already let all entry-points do their own allocation and always pass pre-allocated tensors?

Copy link
Contributor

@scotts scotts Oct 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed on your reasoning and the principles.

One way to square the circle is to split the public facing part of getFrameAtIndex from the actual work being done. We would have a public member function (getFrameAtIndex) and a private member function (getFrameAtIndexInternal, or something like that).

  • getFrameAtInedex would call getFrameAtIndexInternal for the actual work, and after, it would do the conversion check and the actual conversion if needed.
  • getFrameAtIndexInternal would just assume HWC tensors and do the real work.

Then all of the internal calls to getFrameAtIndex would become calls to getFrameAtIndexInternal. I've implementing variants this pattern several times before. We would repeat this for any public entry points that also need to do work for other public entry points, as needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with low-level functions only dealing with HWC. AFAICT, most (all?) low-level code deals with HWC because it has better performance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that sounds good. I'll try to implement that in a follow-up PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FAICT, most (all?) low-level code deals with HWC

That's not quite the case. In main , convertAVFrameToDecodedOutputOnCPU accepts both HWC and CHW - this is actually what this PR is fixing.

It may still return both HWC and CHW (instead of just HWC), and this is what I want to fix as a follow-up


} else if (output.streamType == AVMEDIA_TYPE_AUDIO) {
// TODO: https://github.com/pytorch-labs/torchcodec/issues/85 implement
Expand Down Expand Up @@ -1046,6 +1069,7 @@ VideoDecoder::BatchDecodedOutput VideoDecoder::getFramesAtIndices(
}
i++;
}
output.frames = MaybePermuteHWC2CHW(options, output.frames);
return output;
}

Expand Down Expand Up @@ -1081,7 +1105,7 @@ VideoDecoder::BatchDecodedOutput VideoDecoder::getFramesInRange(
output.ptsSeconds[f] = singleOut.ptsSeconds;
output.durationSeconds[f] = singleOut.durationSeconds;
}

output.frames = MaybePermuteHWC2CHW(options, output.frames);
return output;
}

Expand Down Expand Up @@ -1134,6 +1158,7 @@ VideoDecoder::getFramesDisplayedByTimestampInRange(
// need this special case below.
if (startSeconds == stopSeconds) {
BatchDecodedOutput output(0, options, streamMetadata);
output.frames = MaybePermuteHWC2CHW(options, output.frames);
return output;
}

Expand Down Expand Up @@ -1176,6 +1201,7 @@ VideoDecoder::getFramesDisplayedByTimestampInRange(
output.ptsSeconds[f] = singleOut.ptsSeconds;
output.durationSeconds[f] = singleOut.durationSeconds;
}
output.frames = MaybePermuteHWC2CHW(options, output.frames);

return output;
}
Expand Down Expand Up @@ -1302,11 +1328,6 @@ torch::Tensor VideoDecoder::convertFrameToTensorUsingFilterGraph(
torch::Tensor tensor = torch::from_blob(
filteredFramePtr->data[0], shape, strides, deleter, {torch::kUInt8});
StreamInfo& activeStream = streams_[streamIndex];
if (activeStream.options.dimensionOrder == "NCHW") {
// The docs guaranty this to return a view:
// https://pytorch.org/docs/stable/generated/torch.permute.html
tensor = tensor.permute({2, 0, 1});
}
return tensor;
}

Expand Down
6 changes: 6 additions & 0 deletions test/decoders/test_video_decoder_ops.py
Original file line number Diff line number Diff line change
Expand Up @@ -425,6 +425,12 @@ def test_color_conversion_library_with_dimension_order(
assert frames.shape[1:] == expected_shape
assert_tensor_equal(frames[0], frame0_ref)

frames = get_frames_at_indices(
decoder, stream_index=stream_index, frame_indices=[0, 1, 3, 4]
)
assert frames.shape[1:] == expected_shape
assert_tensor_equal(frames[0], frame0_ref)

@pytest.mark.parametrize(
"width_scaling_factor,height_scaling_factor",
((1.31, 1.5), (0.71, 0.5), (1.31, 0.7), (0.71, 1.5), (1.0, 1.0)),
Expand Down
Loading