perf(license-detection): add rkyv-based license index cache by abraemer · Pull Request #651 · mstykow/provenant

abraemer · 2026-04-14T05:36:23Z

Summary

Add rkyv-based license index cache that persists the built license index to disk, enabling ~0.14s warm starts (down from ~10s cold starts)
On first run, the index is built from the embedded artifact and saved to ~/.cache/provenant/license_index/; subsequent runs load the cached index via rkyv zero-copy deserialization
Automaton fields are stored as daachorse serialized byte blobs; Rule and License fields are stored as rmp_serde byte blobs (avoiding cascading rkyv derive requirements on those complex types)

Evaluated alternatives

Approach	Cache Size	Warm Start	Cold Start
No caching (baseline)	0	~10.0s	~10.0s
Skip automatons + zstd + rmp_serde	36 MB	~5.3s	~11.5s
Full automata + zstd + rmp_serde	128 MB	~1.5s	~13.3s
Full automata + rmp_serde (no zstd)	358 MB	~1.15s	~11.1s
Full automata + bincode (no zstd)	340 MB	~0.57s	~10.5s
rkyv (this PR)	340 MB	~0.14s	~10.0s
ScanCode Toolkit (Python pickle)	395 MB	—	—

rkyv is ~4x faster than bincode and ~8x faster than rmp_serde for warm starts, at the same cache size.

Scope and exclusions

Included: rkyv cache serialization/deserialization, cache directory management, CachedLicenseIndex struct with byte-blob fields for Automaton/Rule/License, rkyv Archive derives on TokenId/TokenDictionary/TokenSet/TokenMultiset/IndexedRuleMetadata
Explicit exclusions: cache invalidation beyond schema versioning (e.g., no embedded artifact hash check yet), no CLI flags for cache control

Intentional differences from Python

Python ScanCode Toolkit uses pickle for its license index cache (395 MB). This implementation uses rkyv for significantly faster deserialization with a smaller cache footprint.

Follow-up work

Add embedded artifact hash to cache validation (invalidate cache when the embedded artifact changes)
Consider rkyv + zstd compression for a smaller cache at the cost of slightly slower warm starts
Add --reindex and --license-cache-dir CLI flags for cache control

Closes #612

sschuberth · 2026-04-14T05:43:30Z

FYI @mmurto. This sounds like the feature we wanted to get ported from scannerust, right?

abraemer · 2026-04-14T05:50:35Z

@sschuberth yes exactly. I benchmarked a couple of options and simply chose the fastest. At 340MB the cache file is still smaller than ScanCode's so I think that should be fine but feel free to give your opinion on the tradeoffs between cache file size and startup speed.

perf(license-detection): use rkyv for license index cache serialization

8cd74b8

abraemer force-pushed the perf/license-index-cache-rkyv branch from 2c235cf to 8cd74b8 Compare April 14, 2026 05:43

abraemer merged commit 71ba613 into main Apr 14, 2026
14 checks passed

abraemer deleted the perf/license-index-cache-rkyv branch April 14, 2026 06:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(license-detection): add rkyv-based license index cache#651

perf(license-detection): add rkyv-based license index cache#651
abraemer merged 1 commit intomainfrom
perf/license-index-cache-rkyv

abraemer commented Apr 14, 2026 •

edited

Loading

Uh oh!

sschuberth commented Apr 14, 2026

Uh oh!

abraemer commented Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

abraemer commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Evaluated alternatives

Scope and exclusions

Intentional differences from Python

Follow-up work

Uh oh!

sschuberth commented Apr 14, 2026

Uh oh!

abraemer commented Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

abraemer commented Apr 14, 2026 •

edited

Loading