vello_hybrid: Fix slow scenes with many changing gradients#1496
Conversation
grebmeg
left a comment
There was a problem hiding this comment.
Nice work! From the attached videos, I concluded that you added some shortcuts in the gradient scene to create multiple gradients. I think this is a useful feature for testing purposes in vello_example_scenes, so it would be worth including it.
|
|
||
| /// Remove LUT data for evicted entries with compacting the LUTs vector, and update remaining offsets. | ||
| fn compact_luts(&mut self, ramps_to_remove: &mut [(CacheKey<GradientCacheKey>, CachedRamp)]) { | ||
| fn compact_luts( |
There was a problem hiding this comment.
One thought on remove_lru_entries: the full sort_by_key there is doing more work than needed. You're sorting all N entries just to take the smallest count, and compact_luts immediately re-sorts ramps_to_remove by lut_start anyway, so the LRU ordering is discarded.
select_nth_unstable_by_key (quickselect) would give you the count smallest in O(N) average instead of O(N log N):
entries.select_nth_unstable_by_key(count - 1, |(_, last_used)| *last_used);There was a problem hiding this comment.
I've updated the logic now! Feel free to take another look, but I'll merge in the meanwhile.
There was a problem hiding this comment.
Nice work on the compaction rework! copy_within + prefix-sum approach is a big improvement over the byte-at-a-time loop.
One thing I've been thinking about: have you considered whether the gradient cache could eventually use the same atlas infrastructure as images, rather than managing its own packed Vec<u8> + compaction?
The idea would be:
- Allocate each gradient LUT as a small rectangular region in the
MultiAtlasManager(e.g., a 1024-texel LUT becomes a 32×32 tile) - Eviction becomes
image_cache.deallocate(slot)— O(1), no compaction at all - The atlas handles fragmentation internally via
guillotinefree-rect coalescing - One fewer GPU texture and bind group to manage
The main trade-off is the 1D → 2D shape mismatch: thin 4096×1 strips would fragment the guillotine allocator badly, so you'd want to wrap LUTs into square-ish tiles, which adds a small amount of shader complexity for 2D indexing within the tile (though the current shader already does something similar to unwrap the flat gradient texture).
Not suggesting this for this PR — the current approach is a solid incremental fix. But it might be worth keeping in mind as a future direction if gradient-heavy scenes remain a hot spot, since it would eliminate the compaction problem at the root rather than optimizing it.
There was a problem hiding this comment.
Another thought on the eviction strategy: the current count-based LRU approach requires collecting all entries, partitioning to find the smallest, then compacting. Have you considered an age-based eviction model instead, similar to what the glyph cache does?
The idea:
- Each entry already tracks
last_used(epoch). Duringmaintain(), entries older than some threshold get evicted directly — no sorting or partitioning needed. hashbrown'sextract_ifwould let you drain expired entries in a single pass, yielding the removed(key, CachedRamp)pairs directly — no scratch vecs, no cloning keys, noselect_nth_unstable.
Something roughly like:
fn maintain(&mut self) {
let threshold = self.epoch.saturating_sub(self.max_age);
let mut lru_entries = core::mem::take(&mut self.scratch.lru_entries);
lru_entries.clear();
lru_entries.extend(
self.cache
.extract_if(|_, (_, last_used)| *last_used < threshold)
.map(|(key, (ramp, _))| (key, ramp))
);
if !lru_entries.is_empty() {
let mut prefix_sum = core::mem::take(&mut self.scratch.prefix_sum);
self.compact_luts(&mut lru_entries, &mut prefix_sum);
self.scratch.prefix_sum = prefix_sum;
self.has_changed = true;
}
self.scratch.lru_entries = lru_entries;
}There was a problem hiding this comment.
I like the idea of reusing the image allocator for this! Will add a TODO, happy to implement this in a follow-up.
There was a problem hiding this comment.
And also happy to revisit the eviction strategy (perhaps we can come up with some solution that can be shared between glifo and vello later on)? But think in the interest of keeping this PR as small as possible, it would be better to do this in a follow up.
I'll include it in my new benchmarking harness! I think it makes more sense there. |
|
@grebmeg If you could take another look at the newest commit post-merge just to double check, that would be great. |
It looks good to me! Thank you! |
Closes #1413.
There were two problems in the old implementation:
lut_startposition for that gradient.This PR addresses both points by 1) computing a prefix sum for how much
lut_startneeds to be adjusted for a given cached ramp and 2) copying whole ranges at once usingcopy_within.Before:
As you can see, the scene quickly gets laggy once we hit the compaction code path.
gradient_before.mp4
After:
Now, the bottleneck is basically only uploading the texture and computing the LUT on the CPU. Which is still not ideal, but much better than before.
gradient_after.mp4
https://github.com/user-attachments/assets/f58db4b4-2a46-4c14-a225-6fea11cd1314
