Skip to content

Approximate mode performance #1077

@NicolasHug

Description

@NicolasHug

We previously thought seek_mode="approximate" is always faster than exact mode. We recently realized it's not always the case. It can be slower in some weird cases which I'll try to explain below.

The TL;DR is: in approximate mode, we're getting incorrect keyframe information from the FFmpeg internal index when we call av_index_search_timestamp(). We then fail to avoid seeking, and we try to seek to a "keyframe" which is in fact not a keyframe, which causes a backwards seek. This only happen in specific (and hopefully rare) access patterns.


illustration

Let's generate a 10s long mp4 video with 300 frames, and 2 keyframes: one at index 0, one at index 150:

ffmpeg -f lavfi -i testsrc2=duration=10:size=1280x720:rate=30 -c:v libx264 -g 150 test.mp4 -y

Now let's run a quick benchmark where we sample 150 frames (benchmark code below):

~/dev/torchcodec-cuda (approx-debug*) » python bench_approx.py test.mp4  --sampling 150  
FPS: approx = 163      exact = 223

approx is clearly slower. Digging a bit, the slowdown appears when we try to decode frame 148, but in a particular setting:

~/dev/torchcodec-cuda (approx-debug*) » python bench_approx.py test.mp4  --sampling "[146]"  # OK
FPS: approx = 4      exact = 4

~/dev/torchcodec-cuda (approx-debug*) » python bench_approx.py test.mp4  --sampling "[148]"   # OK
FPS: approx = 4      exact = 4

~/dev/torchcodec-cuda (approx-debug*) » python bench_approx.py test.mp4  --sampling "[146, 148]"  # whaaaaaat?
FPS: approx = 4      exact = 8

Decoding frame 146 alone is fine. Decoding frame 148 alone is fine. But decoding frame 146 and then frame 148 is 2X slower.
(Note that I'm not decoding 146 147 148 because we have an optimization when decoding consecutive frames, which would bypass the "bug behavior" I want to show).

The problem is that in approximate mode, FFmpeg tells us that frame 148 is a keyframe - but it's not!! When going from frame 146 to frame 148, we'll end up here:

int lastKeyFrameIndex = getKeyFrameIndexForPts(lastDecodedAvFramePts_);
int targetKeyFrameIndex = getKeyFrameIndexForPts(cursor_);
return lastKeyFrameIndex >= 0 && targetKeyFrameIndex >= 0 &&
lastKeyFrameIndex == targetKeyFrameIndex;
}

In there, we'll try to skip seeking if the keyframe associated with 146 (lastKeyFrameIndex) is the same as the keyframe associated with 148 (targetKeyFrameIndex). And normally we should be able to skip seeking! But when we call av_index_search_timestamp():

int SingleStreamDecoder::getKeyFrameIndexForPts(int64_t pts) const {
const StreamInfo& streamInfo = streamInfos_.at(activeStreamIndex_);
if (streamInfo.keyFrames.empty()) {
return av_index_search_timestamp(
streamInfo.stream, pts, AVSEEK_FLAG_BACKWARD);
} else {

it tells us that frame 146 is associated with keyframe 0, which is correct, but it tells us that frame 148 is associated with frame 148, which is wrong! 148 isn't a key frame. Only frames 0 and 150 are.

So we'll seek. And we'll seek to the timestamp of frame 148 which... is actually going to send us right back to frame 0. And we'll have to decode all frames from 0 to 148 just to decode frame 148. I can confirm all that by instrumenting the code with some printf.

Basically when we decode frames [146, 148]:

  • in exact mode we decode all frames in [0, 146], we avoid seeking, and we then decode frames in [147, 148]. ONCE.
  • in approximate mode we decode all frames in [0, 146], we fail to avoid seeking so we go right back to 0, and we decode all frames in [0, 148]. That's twice as long.

Note: I previously thought that av_index_search_timestamp() was slow because it would potentially read from the file etc. That's not true. It's cheap, it's a binary search on a pre-built index array, just like what we do ourselves in exact mode. av_index_search_timestamp() isn't slow, it's just not returning what we want it to return.


How do we fix this?

I'm not sure. I don't think this is an FFmpeg bug. I think av_index_search_timestamp() what it should, and we might be using it incorrectly. It's possible that it accepts a dts as input rather than a pts. At the very least, I think it accepts an offset timestamp (in this case the offset is 1024 I think). We can print the internal FFmpeg index with something like this

  // Debug: Print FFmpeg index entries
  printf("=== FFmpeg Index Entries (after addStream) ===\n");
  int count = avformat_index_get_entries_count(streamInfo.stream);
  printf("Total entries: %d\n", count);
  for (int i = 0; i < count ; i++) {
      const AVIndexEntry *entry = avformat_index_get_entry(streamInfo.stream, i);
      if (entry) {
          printf("Entry %d: pos=%ld, timestamp=%ld, size=%d, flags=%s\n",
                 i,
                 entry->pos,
                 entry->timestamp,
                 entry->size,
                 (entry->flags & AVINDEX_KEYFRAME) ? "KEYFRAME" : "");
      }
  }
  printf("===========================================\n");

and looking at it, it's clear that there's some weird timestamp offset going on here. But I still don't understand that enough.

I think we mainly have two options:

  • better understand how to correctly use av_index_search_timestamp(). Are we passing the wrong timestamp to it?
  • heuristic: build our own partial keyframe index on the fly in approximate mode. If the next frame we want to decode is forward, and it's closer to us than the previous keyframe, then we can skip seeking. It'll probably be faster to just decode all frames towards the target than to seek anywhere. (I'm not sure. uh).

When does the problem actually happen?

We should note that for that specific video, approx isn't always slower. It was slower above when we decoded 150 frames but if we decode much less, or much more, there's no slowdown:

~/dev/torchcodec-cuda (approx-debug*) » python bench_approx.py test.mp4  --sampling 100
FPS: approx = 163      exact = 162
~/dev/torchcodec-cuda (approx-debug*) » python bench_approx.py test.mp4  --sampling 250 
FPS: approx = 320      exact = 319

But it's not a matter of number of frame. It's a matter of whether we try to go from frame i < 148 to frame j in [148, 149]. If we go from i< 148 to frame j >= 150 there's no slowdown, because yeah, we can't skip seeking in this case but frame 150 is a keyframe so seeking to it is fine, we're not seeking back down to 0!

~/dev/torchcodec-cuda (approx-debug*) » python bench_approx.py test.mp4  --sampling "[146, 150]" 
FPS: approx = 8      exact = 8

Benchmark code

import argparse
from pathlib import Path
from time import perf_counter_ns

import torch
import torchcodec
from torchcodec.decoders import VideoDecoder


def bench(f, *args, num_exp=100, warmup=0, **kwargs):

    for _ in range(warmup):
        f(*args, **kwargs)

    times = []
    for _ in range(num_exp):

        start = perf_counter_ns()
        f(*args, **kwargs)
        end = perf_counter_ns()

        times.append(end - start)

    return torch.tensor(times).float()

def decode_one_video_torchcodec(video_source, seek_mode, indices):
    if isinstance(video_source, bytes):
        source = video_source
    else:
        source = str(video_source)
    decoder = VideoDecoder(source, device="cpu", seek_mode=seek_mode, num_ffmpeg_threads=1)
    return decoder.get_frames_at(indices)


parser = argparse.ArgumentParser()
parser.add_argument("path")
parser.add_argument(
    "--sampling",
    type=str,
    default="all",
    help="Sampling strategy. 'all' for all frames, or an N (int) for N evenly spaced frames.",
)

args = parser.parse_args()

start = perf_counter_ns()
with open(args.path, "rb") as f:
    video_data = f.read()

dummy_dec = VideoDecoder(video_data, device="cpu")
if str(args.sampling).startswith("["):
    indices = torch.tensor(eval(args.sampling), dtype=torch.int)
elif str(args.sampling).startswith("first"):
    num_frames_to_samples = int(args.sampling[len("first") :])
    indices = torch.arange(num_frames_to_samples)
elif args.sampling == "all":
    indices = torch.arange(len(dummy_dec))
else:
    num_frames_to_samples = int(args.sampling)
    indices = torch.linspace(
        0, len(dummy_dec) - 1, num_frames_to_samples, dtype=torch.int
    )
# print(indices)

approx_times = bench(
    decode_one_video_torchcodec, video_data, seek_mode="approximate",
    indices=indices, warmup=1, num_exp=3
)

exact_times = bench(
    decode_one_video_torchcodec, video_data, seek_mode="exact",
    indices=indices, warmup=1, num_exp=3
)

approx_fps = (len(indices) / (1e-9 * approx_times)).median().item()
exact_fps = (len(indices) / (1e-9 * exact_times)).median().item()
print(f"FPS: approx = {approx_fps:.0f}      exact = {exact_fps:.0f}")

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions