Skip to content

vello_hybrid: Fix slow scenes with many changing gradients#1496

Merged
LaurenzV merged 3 commits into
mainfrom
laurenz/grad
Mar 16, 2026
Merged

vello_hybrid: Fix slow scenes with many changing gradients#1496
LaurenzV merged 3 commits into
mainfrom
laurenz/grad

Conversation

@LaurenzV
Copy link
Copy Markdown
Collaborator

@LaurenzV LaurenzV commented Mar 9, 2026

Closes #1413.

There were two problems in the old implementation:

  1. When compacting the LUTs, we had a doubly-nested loop that, for each gradient existing in the cache we iterated over ALL removed gradients to compute the new lut_start position for that gradient.
  2. When copying LUT data, we would do so pixel-by-pixel instead of copying the whole range at once.

This PR addresses both points by 1) computing a prefix sum for how much lut_start needs to be adjusted for a given cached ramp and 2) copying whole ranges at once using copy_within.

Before:
As you can see, the scene quickly gets laggy once we hit the compaction code path.

gradient_before.mp4
image

After:
Now, the bottleneck is basically only uploading the texture and computing the LUT on the CPU. Which is still not ideal, but much better than before.

gradient_after.mp4

https://github.com/user-attachments/assets/f58db4b4-2a46-4c14-a225-6fea11cd1314
image

@LaurenzV LaurenzV requested a review from grebmeg March 9, 2026 13:29
@nicoburns nicoburns added the C-hybrid Applies to the vello_hybrid crate label Mar 11, 2026
Copy link
Copy Markdown
Collaborator

@grebmeg grebmeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! From the attached videos, I concluded that you added some shortcuts in the gradient scene to create multiple gradients. I think this is a useful feature for testing purposes in vello_example_scenes, so it would be worth including it.


/// Remove LUT data for evicted entries with compacting the LUTs vector, and update remaining offsets.
fn compact_luts(&mut self, ramps_to_remove: &mut [(CacheKey<GradientCacheKey>, CachedRamp)]) {
fn compact_luts(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thought on remove_lru_entries: the full sort_by_key there is doing more work than needed. You're sorting all N entries just to take the smallest count, and compact_luts immediately re-sorts ramps_to_remove by lut_start anyway, so the LRU ordering is discarded.

select_nth_unstable_by_key (quickselect) would give you the count smallest in O(N) average instead of O(N log N):

entries.select_nth_unstable_by_key(count - 1, |(_, last_used)| *last_used);

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the logic now! Feel free to take another look, but I'll merge in the meanwhile.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work on the compaction rework! copy_within + prefix-sum approach is a big improvement over the byte-at-a-time loop.

One thing I've been thinking about: have you considered whether the gradient cache could eventually use the same atlas infrastructure as images, rather than managing its own packed Vec<u8> + compaction?

The idea would be:

  • Allocate each gradient LUT as a small rectangular region in the MultiAtlasManager (e.g., a 1024-texel LUT becomes a 32×32 tile)
  • Eviction becomes image_cache.deallocate(slot) — O(1), no compaction at all
  • The atlas handles fragmentation internally via guillotine free-rect coalescing
  • One fewer GPU texture and bind group to manage

The main trade-off is the 1D → 2D shape mismatch: thin 4096×1 strips would fragment the guillotine allocator badly, so you'd want to wrap LUTs into square-ish tiles, which adds a small amount of shader complexity for 2D indexing within the tile (though the current shader already does something similar to unwrap the flat gradient texture).

Not suggesting this for this PR — the current approach is a solid incremental fix. But it might be worth keeping in mind as a future direction if gradient-heavy scenes remain a hot spot, since it would eliminate the compaction problem at the root rather than optimizing it.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thought on the eviction strategy: the current count-based LRU approach requires collecting all entries, partitioning to find the smallest, then compacting. Have you considered an age-based eviction model instead, similar to what the glyph cache does?

The idea:

  • Each entry already tracks last_used (epoch). During maintain(), entries older than some threshold get evicted directly — no sorting or partitioning needed.
  • hashbrown's extract_if would let you drain expired entries in a single pass, yielding the removed (key, CachedRamp) pairs directly — no scratch vecs, no cloning keys, no select_nth_unstable.
    Something roughly like:
fn maintain(&mut self) {
    let threshold = self.epoch.saturating_sub(self.max_age);
    let mut lru_entries = core::mem::take(&mut self.scratch.lru_entries);
    lru_entries.clear();
    lru_entries.extend(
        self.cache
            .extract_if(|_, (_, last_used)| *last_used < threshold)
            .map(|(key, (ramp, _))| (key, ramp))
    );
    if !lru_entries.is_empty() {
        let mut prefix_sum = core::mem::take(&mut self.scratch.prefix_sum);
        self.compact_luts(&mut lru_entries, &mut prefix_sum);
        self.scratch.prefix_sum = prefix_sum;
        self.has_changed = true;
    }
    self.scratch.lru_entries = lru_entries;
}

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of reusing the image allocator for this! Will add a TODO, happy to implement this in a follow-up.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And also happy to revisit the eviction strategy (perhaps we can come up with some solution that can be shared between glifo and vello later on)? But think in the interest of keeping this PR as small as possible, it would be better to do this in a follow up.

@LaurenzV
Copy link
Copy Markdown
Collaborator Author

I think this is a useful feature for testing purposes in vello_example_scenes, so it would be worth including it.

I'll include it in my new benchmarking harness! I think it makes more sense there.

@LaurenzV LaurenzV enabled auto-merge March 16, 2026 07:45
@LaurenzV
Copy link
Copy Markdown
Collaborator Author

LaurenzV commented Mar 16, 2026

@grebmeg If you could take another look at the newest commit post-merge just to double check, that would be great.

@LaurenzV LaurenzV added this pull request to the merge queue Mar 16, 2026
Merged via the queue into main with commit baae4b9 Mar 16, 2026
17 checks passed
@LaurenzV LaurenzV deleted the laurenz/grad branch March 16, 2026 08:03
@grebmeg
Copy link
Copy Markdown
Collaborator

grebmeg commented Mar 16, 2026

@grebmeg If you could take another look at the newest commit post-merge just to double check, that would be great.

It looks good to me! Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

C-hybrid Applies to the vello_hybrid crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

vello_hybrid: Gradients that change in each frame lead to sluggish performance

4 participants