v0.6.7
This release introduces a major restructuring of the project's tooling with a new compiler crate, adds significant new features such as a dictionary downloader, and includes several API improvements and performance fixes.
Breaking Changes
-
Dictionaryis now an enum:
TheDictionarystruct has been changed to an enum withArchivedandOwnedvariants to support both memory-mapped and in-memory dictionaries. New constructorsDictionary::from_innerandTokenizer::from_innerare provided. -
Tokenizeris nowClone,Workerno longer has a lifetime:
To improve flexibility and resolve a design limitation (daac-tools/vibrato#99),Tokenizeris now cloneable (cheaply, viaArc<Dictionary>). Consequently,Workerno longer borrows aTokenizerbut owns a clone, removing its lifetime parameter (Worker<'t>is nowWorker).
Major Changes: New compiler Crate
The train, dictgen, compile, and benchmark crates have been removed and their functionalities consolidated into a new, unified compiler crate. This provides a single command-line interface for all dictionary management tasks.
- Subcommands: The new crate offers the following subcommands:
train: Trains a model from a corpus.dictgen: Generates dictionary source files from a model.build: Builds a binary dictionary from source files.full-build: A new command to execute thetrain,dictgen, andbuildpipeline in a single step.transmute: A new command to convert legacybincode-formatted dictionaries to therkyvformat.
- New Benchmarks: The old benchmark suite has been replaced with
criterionbenchmarks integrated into thevibratolibrary, including tokenization speed comparisons and detailed dictionary loading tests.
New Features
-
Dictionary Download and Caching (
downloadfeature):- A new public API,
Dictionary::from_preset_with_download, is available to download, cache, and load preset dictionaries (e.g., IPADIC, UNIDIC). - Downloads are verified using SHA256 checksums.
- Decompressed dictionaries are cached locally to avoid repeated decompression.
- A new public API,
-
TokenBuffor Owned Token Data:- A new
TokenBufstruct is introduced to hold owned token data, making it suitable for storage or cross-thread communication. Token<'w>remains as a lightweight, borrowed view of a token.- A
Token::to_buf()method is provided for conversion, following thePath -> PathBufpattern.
- A new
Improvements and Fixes
- AVX2 acceleration is now functional with
rkyv: The previously broken AVX2 implementation in theScorerhas been fixed and is now compatible withrkyvserialization. SIMD acceleration can now be used by building with target-cpu flags (e.g.,-C target-cpu=native). - Optimized
Dictionary::from_path: Dictionary loading from a path has been optimized by reading the magic bytes before memory-mapping the entire file, improving performance for invalid files.
Pre-compiled Dictionaries
Pre-compiled dictionaries compatible with this version are available at:
https://github.com/stellanomia/vibrato-rkyv/releases/tag/v0.6.2