Skip to content

v0.6.7

Choose a tag to compare

@o24s o24s released this 27 Oct 13:25
· 28 commits to main since this release

This release introduces a major restructuring of the project's tooling with a new compiler crate, adds significant new features such as a dictionary downloader, and includes several API improvements and performance fixes.

Breaking Changes

  • Dictionary is now an enum:
    The Dictionary struct has been changed to an enum with Archived and Owned variants to support both memory-mapped and in-memory dictionaries. New constructors Dictionary::from_inner and Tokenizer::from_inner are provided.

  • Tokenizer is now Clone, Worker no longer has a lifetime:
    To improve flexibility and resolve a design limitation (daac-tools/vibrato#99), Tokenizer is now cloneable (cheaply, via Arc<Dictionary>). Consequently, Worker no longer borrows a Tokenizer but owns a clone, removing its lifetime parameter (Worker<'t> is now Worker).

Major Changes: New compiler Crate

The train, dictgen, compile, and benchmark crates have been removed and their functionalities consolidated into a new, unified compiler crate. This provides a single command-line interface for all dictionary management tasks.

  • Subcommands: The new crate offers the following subcommands:
    • train: Trains a model from a corpus.
    • dictgen: Generates dictionary source files from a model.
    • build: Builds a binary dictionary from source files.
    • full-build: A new command to execute the train, dictgen, and build pipeline in a single step.
    • transmute: A new command to convert legacy bincode-formatted dictionaries to the rkyv format.
  • New Benchmarks: The old benchmark suite has been replaced with criterion benchmarks integrated into the vibrato library, including tokenization speed comparisons and detailed dictionary loading tests.

New Features

  • Dictionary Download and Caching (download feature):

    • A new public API, Dictionary::from_preset_with_download, is available to download, cache, and load preset dictionaries (e.g., IPADIC, UNIDIC).
    • Downloads are verified using SHA256 checksums.
    • Decompressed dictionaries are cached locally to avoid repeated decompression.
  • TokenBuf for Owned Token Data:

    • A new TokenBuf struct is introduced to hold owned token data, making it suitable for storage or cross-thread communication.
    • Token<'w> remains as a lightweight, borrowed view of a token.
    • A Token::to_buf() method is provided for conversion, following the Path -> PathBuf pattern.

Improvements and Fixes

  • AVX2 acceleration is now functional with rkyv: The previously broken AVX2 implementation in the Scorer has been fixed and is now compatible with rkyv serialization. SIMD acceleration can now be used by building with target-cpu flags (e.g., -C target-cpu=native).
  • Optimized Dictionary::from_path: Dictionary loading from a path has been optimized by reading the magic bytes before memory-mapping the entire file, improving performance for invalid files.

Pre-compiled Dictionaries

Pre-compiled dictionaries compatible with this version are available at:
https://github.com/stellanomia/vibrato-rkyv/releases/tag/v0.6.2