Releases: o24s/vibrato-rkyv
v0.7.0
This release overhauls the dictionary loading and caching mechanisms, introducing a more flexible and robust API with improved cross-platform compatibility and safety.
Breaking Changes
-
Dictionary::from_pathsignature has changed.- Before:
Dictionary::from_path(path)(Unsafe, no validation in release builds) - After:
Dictionary::from_path(path, mode: LoadMode) - Migration: The previous unsafe behavior is now available via the new
unsafe fn from_path_unchecked. For safe loading, choose betweenLoadMode::Validate(always validates) orLoadMode::TrustCache(validates on first run).
// New recommended usage use vibrato_rkyv::{Dictionary, LoadMode}; let dict = Dictionary::from_path("path/to/dict.dic", LoadMode::TrustCache)?;
- Before:
-
Dictionary::from_zstdsignature has changed.- Before:
Dictionary::from_zstd(path) - After:
Dictionary::from_zstd(path, strategy: CacheStrategy) - Migration: To replicate the old behavior of caching locally, use
CacheStrategy::Local. For most applications,CacheStrategy::GlobalCacheis now the recommended default.
// New usage use vibrato_rkyv::{Dictionary, CacheStrategy}; let dict = Dictionary::from_zstd("path/to/dict.zst", CacheStrategy::GlobalCache)?;
- Before:
-
Cache format and location have changed. Caches created with older versions (e.g.,
decompressedsubdirectories, adjacent.sha256files) will no longer be used and can be safely deleted.
New Features & Enhancements
Unified and Flexible Caching System
The caching logic has been redesigned for both compressed (.zst) and uncompressed (.dic) files to be more maintainability and efficiency.
CacheStrategyfor Compressed Dictionaries:from_zstdnow accepts aCacheStrategyto explicitly control cache location, making it usable in read-only environments.CacheStrategy::Local: Caches in a local.cachesubdirectory.CacheStrategy::GlobalCache: Caches in a system-wide user cache directory (e.g.,~/.cache/vibrato-rkyv).CacheStrategy::GlobalData: Caches in a system-wide user data directory.
LoadModefor Uncompressed Dictionaries:from_pathnow requires aLoadModefor explicit control over the safety/performance trade-off.LoadMode::Validate: Safest option. Performs full validation on every load and never writes cache files.LoadMode::TrustCache: Fastest for reloads. Skips validation if a valid "proof file" exists, creating one in the global cache on the first run.
- Hierarchical & Cross-Platform Cache Validation: Caching now uses a two-tiered validation system. It first performs a near-instant check on file metadata, falling back to a full file hash only when necessary. The metadata hashing is now deterministic across platforms.
- New
unsafe fn from_path_unchecked: For advanced users who can guarantee file integrity, this function provides the absolute fastest loading path by skipping all validation.
Legacy Dictionary Support (legacy feature)
When the legacy feature is enabled, vibrato-rkyv can now transparently handle older bincode-based dictionaries.
- Automatic Conversion & Caching: When a legacy dictionary is loaded via
from_zstd, it is available for use while a background thread converts and caches it to therkyvformat for faster subsequent loads. - Safe Background Thread Management: The background caching thread is managed using an RAII pattern,
which makes a best-effort attempt to complete caching on exit and helps reduce the risk of orphaned temporary files, without blocking the main application.
Other Improvements
- New Public API
Dictionary::decompress_zstd: A utility function is now available for manually decompressing and validating dictionary files. - Safer Dictionary Downloads: The download and extraction process now uses a temporary directory, preventing accidental data loss.
- Enhanced Testing and CI: A comprehensive test suite for the new caching logic has been added, and the CI process has been stabilized for Windows environments.
Pre-compiled Dictionaries
Pre-compiled dictionaries compatible with this version are available at:
https://github.com/stellanomia/vibrato-rkyv/releases/tag/v0.6.2
v0.6.7
This release introduces a major restructuring of the project's tooling with a new compiler crate, adds significant new features such as a dictionary downloader, and includes several API improvements and performance fixes.
Breaking Changes
-
Dictionaryis now an enum:
TheDictionarystruct has been changed to an enum withArchivedandOwnedvariants to support both memory-mapped and in-memory dictionaries. New constructorsDictionary::from_innerandTokenizer::from_innerare provided. -
Tokenizeris nowClone,Workerno longer has a lifetime:
To improve flexibility and resolve a design limitation (daac-tools/vibrato#99),Tokenizeris now cloneable (cheaply, viaArc<Dictionary>). Consequently,Workerno longer borrows aTokenizerbut owns a clone, removing its lifetime parameter (Worker<'t>is nowWorker).
Major Changes: New compiler Crate
The train, dictgen, compile, and benchmark crates have been removed and their functionalities consolidated into a new, unified compiler crate. This provides a single command-line interface for all dictionary management tasks.
- Subcommands: The new crate offers the following subcommands:
train: Trains a model from a corpus.dictgen: Generates dictionary source files from a model.build: Builds a binary dictionary from source files.full-build: A new command to execute thetrain,dictgen, andbuildpipeline in a single step.transmute: A new command to convert legacybincode-formatted dictionaries to therkyvformat.
- New Benchmarks: The old benchmark suite has been replaced with
criterionbenchmarks integrated into thevibratolibrary, including tokenization speed comparisons and detailed dictionary loading tests.
New Features
-
Dictionary Download and Caching (
downloadfeature):- A new public API,
Dictionary::from_preset_with_download, is available to download, cache, and load preset dictionaries (e.g., IPADIC, UNIDIC). - Downloads are verified using SHA256 checksums.
- Decompressed dictionaries are cached locally to avoid repeated decompression.
- A new public API,
-
TokenBuffor Owned Token Data:- A new
TokenBufstruct is introduced to hold owned token data, making it suitable for storage or cross-thread communication. Token<'w>remains as a lightweight, borrowed view of a token.- A
Token::to_buf()method is provided for conversion, following thePath -> PathBufpattern.
- A new
Improvements and Fixes
- AVX2 acceleration is now functional with
rkyv: The previously broken AVX2 implementation in theScorerhas been fixed and is now compatible withrkyvserialization. SIMD acceleration can now be used by building with target-cpu flags (e.g.,-C target-cpu=native). - Optimized
Dictionary::from_path: Dictionary loading from a path has been optimized by reading the magic bytes before memory-mapping the entire file, improving performance for invalid files.
Pre-compiled Dictionaries
Pre-compiled dictionaries compatible with this version are available at:
https://github.com/stellanomia/vibrato-rkyv/releases/tag/v0.6.2
v0.6.2
This release provides pre-built dictionaries for vibrato-rkyv v0.6.x.
These dictionaries are serialized in a zero-copy format and compressed with Zstandard. It is recommended to load them with the new Dictionary::from_zstd() function introduced in this version, which automatically handles decompression and caching for fast startups.
mecab-ipadic.tar- Source: MeCab IPADIC v2.7.0 (https://taku910.github.io/mecab/)
unidic-csj.tar- Source: UniDic-csj v3.1.1 (https://clrd.ninjal.ac.jp/unidic/)
unidic-cwj.tar- Source: UniDic-cwj v3.1.1 (https://clrd.ninjal.ac.jp/unidic/)
use vibrato_rkyv::Dictionary;
// Recommended: Automatically handles decompression and caching.
let dict = Dictionary::from_zstd("path/to/your/system.dic.zst")?;
// For manually decompressed files.
let dict = Dictionary::from_path("path/to/your/extracted/system.dic")?;