Approximate mode performance

We previously thought `seek_mode="approximate"` is always faster than exact mode. We recently realized it's not always the case. It can be slower in some weird cases which I'll try to explain below.

**The TL;DR is**: in approximate mode, we're getting incorrect keyframe information from the FFmpeg internal index when we call `av_index_search_timestamp()`. We then fail to avoid seeking, and we try to seek to a "keyframe" which is in fact *not* a keyframe, which causes a backwards seek. This only happen in specific (and hopefully rare) access patterns.


----

### illustration

Let's generate a 10s long mp4 video with 300 frames, and 2 keyframes: one at index 0, one at index 150:

```bash
ffmpeg -f lavfi -i testsrc2=duration=10:size=1280x720:rate=30 -c:v libx264 -g 150 test.mp4 -y
```

Now let's run a quick benchmark where we sample 150 frames (benchmark code below):


```bash
~/dev/torchcodec-cuda (approx-debug*) » python bench_approx.py test.mp4  --sampling 150  
FPS: approx = 163      exact = 223
```

approx is clearly slower. Digging a bit, the slowdown appears when we try to decode frame 148, but in a particular setting:


```bash
~/dev/torchcodec-cuda (approx-debug*) » python bench_approx.py test.mp4  --sampling "[146]"  # OK
FPS: approx = 4      exact = 4

~/dev/torchcodec-cuda (approx-debug*) » python bench_approx.py test.mp4  --sampling "[148]"   # OK
FPS: approx = 4      exact = 4

~/dev/torchcodec-cuda (approx-debug*) » python bench_approx.py test.mp4  --sampling "[146, 148]"  # whaaaaaat?
FPS: approx = 4      exact = 8
```

Decoding frame 146 alone is fine. Decoding frame 148 alone is fine. But decoding frame 146 *and then* frame 148 is 2X slower.
(Note that I'm not decoding `146 147 148` because we have an optimization when decoding consecutive frames, which would bypass the "bug behavior" I want to show).

The problem is that **in approximate mode, FFmpeg tells us that frame 148 is a keyframe  - but it's not!!** When going from frame 146 to frame 148, we'll end up here:

https://github.com/meta-pytorch/torchcodec/blob/392bab3cb619d3bfdb6e9f3a5b705425eb7c1133/src/torchcodec/_core/SingleStreamDecoder.cpp#L1148-L1152

In there, we'll try to skip seeking if the keyframe associated with 146 (`lastKeyFrameIndex`) is the same as the keyframe associated with 148 (`targetKeyFrameIndex`). And normally we should be able to skip seeking! But when we call `av_index_search_timestamp()`:

https://github.com/meta-pytorch/torchcodec/blob/392bab3cb619d3bfdb6e9f3a5b705425eb7c1133/src/torchcodec/_core/SingleStreamDecoder.cpp#L1368-L1373

it tells us that frame 146 is associated with keyframe 0, which is correct, but it tells us that frame 148 is associated with frame 148, which is wrong! 148 isn't a key frame. Only frames 0 and 150 are.

So we'll seek. And we'll seek to the timestamp of frame 148 which... is actually going to send us right back to frame 0. And we'll have to decode all frames from 0 to 148 just to decode frame 148. I can confirm all that by instrumenting the code with some printf.

Basically when we decode frames `[146, 148]`:

- in exact mode we decode all frames in  [0, 146], we avoid seeking, and we then decode frames in [147, 148]. ONCE.
- in approximate mode we decode all frames in  [0, 146], we *fail* to avoid seeking so we go right back to 0, and we decode all frames in [0, 148]. That's twice as long.


Note: I previously thought that `av_index_search_timestamp()` was slow because it would potentially read from the file etc. That's not true. It's cheap, it's a binary search on a pre-built index array, just like what we do ourselves in exact mode. `av_index_search_timestamp()`  isn't slow, it's just not returning what we want it to return.

-----

### How do we fix this?

I'm not sure. I don't think this is an FFmpeg bug. I think `av_index_search_timestamp()` what it should, and we might be using it incorrectly. It's possible that it accepts a *dts* as input rather than a *pts*. At the very least, I think it accepts an offset timestamp (in this case the offset is 1024 I think). We can print the internal FFmpeg index with something like this

<details>

```C
  // Debug: Print FFmpeg index entries
  printf("=== FFmpeg Index Entries (after addStream) ===\n");
  int count = avformat_index_get_entries_count(streamInfo.stream);
  printf("Total entries: %d\n", count);
  for (int i = 0; i < count ; i++) {
      const AVIndexEntry *entry = avformat_index_get_entry(streamInfo.stream, i);
      if (entry) {
          printf("Entry %d: pos=%ld, timestamp=%ld, size=%d, flags=%s\n",
                 i,
                 entry->pos,
                 entry->timestamp,
                 entry->size,
                 (entry->flags & AVINDEX_KEYFRAME) ? "KEYFRAME" : "");
      }
  }
  printf("===========================================\n");
```

</details>

and looking at it, it's clear that there's some weird timestamp offset going on here. But I still don't understand that enough.

I think we mainly have two options:

- better understand how to correctly use `av_index_search_timestamp()`. Are we passing the wrong timestamp to it?
- heuristic: build our own partial keyframe index on the fly in approximate mode. If the next frame we want to decode is forward, and it's closer to us than the previous keyframe, then we can skip seeking. It'll probably be faster to just decode all frames towards the target than to seek anywhere. (I'm not sure. uh).


-----

### When does the problem actually happen?

We should note that for that specific video, approx isn't always slower. It was slower above when we decoded 150 frames but if we decode much less, or much more, there's no slowdown:

```bash
~/dev/torchcodec-cuda (approx-debug*) » python bench_approx.py test.mp4  --sampling 100
FPS: approx = 163      exact = 162
~/dev/torchcodec-cuda (approx-debug*) » python bench_approx.py test.mp4  --sampling 250 
FPS: approx = 320      exact = 319
```

But it's not a matter of number of frame. It's a matter of whether we try to go from frame `i < 148` to frame `j in [148, 149]`. If we go from `i< 148` to frame `j >= 150` there's no slowdown, because yeah, we can't skip seeking in this case but frame 150 *is* a keyframe so seeking to it is fine, we're not seeking back down to 0!


```bash
~/dev/torchcodec-cuda (approx-debug*) » python bench_approx.py test.mp4  --sampling "[146, 150]" 
FPS: approx = 8      exact = 8

```


### Benchmark code

<details>

```py
import argparse
from pathlib import Path
from time import perf_counter_ns

import torch
import torchcodec
from torchcodec.decoders import VideoDecoder


def bench(f, *args, num_exp=100, warmup=0, **kwargs):

    for _ in range(warmup):
        f(*args, **kwargs)

    times = []
    for _ in range(num_exp):

        start = perf_counter_ns()
        f(*args, **kwargs)
        end = perf_counter_ns()

        times.append(end - start)

    return torch.tensor(times).float()

def decode_one_video_torchcodec(video_source, seek_mode, indices):
    if isinstance(video_source, bytes):
        source = video_source
    else:
        source = str(video_source)
    decoder = VideoDecoder(source, device="cpu", seek_mode=seek_mode, num_ffmpeg_threads=1)
    return decoder.get_frames_at(indices)


parser = argparse.ArgumentParser()
parser.add_argument("path")
parser.add_argument(
    "--sampling",
    type=str,
    default="all",
    help="Sampling strategy. 'all' for all frames, or an N (int) for N evenly spaced frames.",
)

args = parser.parse_args()

start = perf_counter_ns()
with open(args.path, "rb") as f:
    video_data = f.read()

dummy_dec = VideoDecoder(video_data, device="cpu")
if str(args.sampling).startswith("["):
    indices = torch.tensor(eval(args.sampling), dtype=torch.int)
elif str(args.sampling).startswith("first"):
    num_frames_to_samples = int(args.sampling[len("first") :])
    indices = torch.arange(num_frames_to_samples)
elif args.sampling == "all":
    indices = torch.arange(len(dummy_dec))
else:
    num_frames_to_samples = int(args.sampling)
    indices = torch.linspace(
        0, len(dummy_dec) - 1, num_frames_to_samples, dtype=torch.int
    )
# print(indices)

approx_times = bench(
    decode_one_video_torchcodec, video_data, seek_mode="approximate",
    indices=indices, warmup=1, num_exp=3
)

exact_times = bench(
    decode_one_video_torchcodec, video_data, seek_mode="exact",
    indices=indices, warmup=1, num_exp=3
)

approx_fps = (len(indices) / (1e-9 * approx_times)).median().item()
exact_fps = (len(indices) / (1e-9 * exact_times)).median().item()
print(f"FPS: approx = {approx_fps:.0f}      exact = {exact_fps:.0f}")
```

</details>

	int lastKeyFrameIndex = getKeyFrameIndexForPts(lastDecodedAvFramePts_);
	int targetKeyFrameIndex = getKeyFrameIndexForPts(cursor_);
	return lastKeyFrameIndex >= 0 && targetKeyFrameIndex >= 0 &&
	lastKeyFrameIndex == targetKeyFrameIndex;
	}

	int SingleStreamDecoder::getKeyFrameIndexForPts(int64_t pts) const {
	const StreamInfo& streamInfo = streamInfos_.at(activeStreamIndex_);
	if (streamInfo.keyFrames.empty()) {
	return av_index_search_timestamp(
	streamInfo.stream, pts, AVSEEK_FLAG_BACKWARD);
	} else {

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Approximate mode performance #1077

illustration

How do we fix this?

When does the problem actually happen?

Benchmark code

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Approximate mode performance #1077

Description

illustration

How do we fix this?

When does the problem actually happen?

Benchmark code

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions