Skip to content

perf(license-detection): add rkyv-based license index cache#651

Merged
abraemer merged 1 commit intomainfrom
perf/license-index-cache-rkyv
Apr 14, 2026
Merged

perf(license-detection): add rkyv-based license index cache#651
abraemer merged 1 commit intomainfrom
perf/license-index-cache-rkyv

Conversation

@abraemer
Copy link
Copy Markdown
Collaborator

@abraemer abraemer commented Apr 14, 2026

Summary

  • Add rkyv-based license index cache that persists the built license index to disk, enabling ~0.14s warm starts (down from ~10s cold starts)
  • On first run, the index is built from the embedded artifact and saved to ~/.cache/provenant/license_index/; subsequent runs load the cached index via rkyv zero-copy deserialization
  • Automaton fields are stored as daachorse serialized byte blobs; Rule and License fields are stored as rmp_serde byte blobs (avoiding cascading rkyv derive requirements on those complex types)

Evaluated alternatives

Approach Cache Size Warm Start Cold Start
No caching (baseline) 0 ~10.0s ~10.0s
Skip automatons + zstd + rmp_serde 36 MB ~5.3s ~11.5s
Full automata + zstd + rmp_serde 128 MB ~1.5s ~13.3s
Full automata + rmp_serde (no zstd) 358 MB ~1.15s ~11.1s
Full automata + bincode (no zstd) 340 MB ~0.57s ~10.5s
rkyv (this PR) 340 MB ~0.14s ~10.0s
ScanCode Toolkit (Python pickle) 395 MB

rkyv is ~4x faster than bincode and ~8x faster than rmp_serde for warm starts, at the same cache size.

Scope and exclusions

  • Included: rkyv cache serialization/deserialization, cache directory management, CachedLicenseIndex struct with byte-blob fields for Automaton/Rule/License, rkyv Archive derives on TokenId/TokenDictionary/TokenSet/TokenMultiset/IndexedRuleMetadata
  • Explicit exclusions: cache invalidation beyond schema versioning (e.g., no embedded artifact hash check yet), no CLI flags for cache control

Intentional differences from Python

  • Python ScanCode Toolkit uses pickle for its license index cache (395 MB). This implementation uses rkyv for significantly faster deserialization with a smaller cache footprint.

Follow-up work

  • Add embedded artifact hash to cache validation (invalidate cache when the embedded artifact changes)
  • Consider rkyv + zstd compression for a smaller cache at the cost of slightly slower warm starts
  • Add --reindex and --license-cache-dir CLI flags for cache control

Closes #612

@sschuberth
Copy link
Copy Markdown

FYI @mmurto. This sounds like the feature we wanted to get ported from scannerust, right?

@abraemer abraemer force-pushed the perf/license-index-cache-rkyv branch from 2c235cf to 8cd74b8 Compare April 14, 2026 05:43
@abraemer
Copy link
Copy Markdown
Collaborator Author

@sschuberth yes exactly. I benchmarked a couple of options and simply chose the fastest. At 340MB the cache file is still smaller than ScanCode's so I think that should be fine but feel free to give your opinion on the tradeoffs between cache file size and startup speed.

@abraemer abraemer merged commit 71ba613 into main Apr 14, 2026
14 checks passed
@abraemer abraemer deleted the perf/license-index-cache-rkyv branch April 14, 2026 06:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Joining forces with "scannerust" / index caching

2 participants