Skip to content

Releases: o24s/vibrato-rkyv

v0.7.0

06 Nov 15:51

Choose a tag to compare

This release overhauls the dictionary loading and caching mechanisms, introducing a more flexible and robust API with improved cross-platform compatibility and safety.

Breaking Changes

  • Dictionary::from_path signature has changed.

    • Before: Dictionary::from_path(path) (Unsafe, no validation in release builds)
    • After: Dictionary::from_path(path, mode: LoadMode)
    • Migration: The previous unsafe behavior is now available via the new unsafe fn from_path_unchecked. For safe loading, choose between LoadMode::Validate (always validates) or LoadMode::TrustCache (validates on first run).
    // New recommended usage
    use vibrato_rkyv::{Dictionary, LoadMode};
    let dict = Dictionary::from_path("path/to/dict.dic", LoadMode::TrustCache)?;
  • Dictionary::from_zstd signature has changed.

    • Before: Dictionary::from_zstd(path)
    • After: Dictionary::from_zstd(path, strategy: CacheStrategy)
    • Migration: To replicate the old behavior of caching locally, use CacheStrategy::Local. For most applications, CacheStrategy::GlobalCache is now the recommended default.
    // New usage
    use vibrato_rkyv::{Dictionary, CacheStrategy};
    let dict = Dictionary::from_zstd("path/to/dict.zst", CacheStrategy::GlobalCache)?;
  • Cache format and location have changed. Caches created with older versions (e.g., decompressed subdirectories, adjacent .sha256 files) will no longer be used and can be safely deleted.

New Features & Enhancements

Unified and Flexible Caching System

The caching logic has been redesigned for both compressed (.zst) and uncompressed (.dic) files to be more maintainability and efficiency.

  • CacheStrategy for Compressed Dictionaries: from_zstd now accepts a CacheStrategy to explicitly control cache location, making it usable in read-only environments.
    • CacheStrategy::Local: Caches in a local .cache subdirectory.
    • CacheStrategy::GlobalCache: Caches in a system-wide user cache directory (e.g., ~/.cache/vibrato-rkyv).
    • CacheStrategy::GlobalData: Caches in a system-wide user data directory.
  • LoadMode for Uncompressed Dictionaries: from_path now requires a LoadMode for explicit control over the safety/performance trade-off.
    • LoadMode::Validate: Safest option. Performs full validation on every load and never writes cache files.
    • LoadMode::TrustCache: Fastest for reloads. Skips validation if a valid "proof file" exists, creating one in the global cache on the first run.
  • Hierarchical & Cross-Platform Cache Validation: Caching now uses a two-tiered validation system. It first performs a near-instant check on file metadata, falling back to a full file hash only when necessary. The metadata hashing is now deterministic across platforms.
  • New unsafe fn from_path_unchecked: For advanced users who can guarantee file integrity, this function provides the absolute fastest loading path by skipping all validation.

Legacy Dictionary Support (legacy feature)

When the legacy feature is enabled, vibrato-rkyv can now transparently handle older bincode-based dictionaries.

  • Automatic Conversion & Caching: When a legacy dictionary is loaded via from_zstd, it is available for use while a background thread converts and caches it to the rkyv format for faster subsequent loads.
  • Safe Background Thread Management: The background caching thread is managed using an RAII pattern,
    which makes a best-effort attempt to complete caching on exit and helps reduce the risk of orphaned temporary files, without blocking the main application.

Other Improvements

  • New Public API Dictionary::decompress_zstd: A utility function is now available for manually decompressing and validating dictionary files.
  • Safer Dictionary Downloads: The download and extraction process now uses a temporary directory, preventing accidental data loss.
  • Enhanced Testing and CI: A comprehensive test suite for the new caching logic has been added, and the CI process has been stabilized for Windows environments.

Pre-compiled Dictionaries

Pre-compiled dictionaries compatible with this version are available at:
https://github.com/stellanomia/vibrato-rkyv/releases/tag/v0.6.2

v0.6.7

27 Oct 13:25

Choose a tag to compare

This release introduces a major restructuring of the project's tooling with a new compiler crate, adds significant new features such as a dictionary downloader, and includes several API improvements and performance fixes.

Breaking Changes

  • Dictionary is now an enum:
    The Dictionary struct has been changed to an enum with Archived and Owned variants to support both memory-mapped and in-memory dictionaries. New constructors Dictionary::from_inner and Tokenizer::from_inner are provided.

  • Tokenizer is now Clone, Worker no longer has a lifetime:
    To improve flexibility and resolve a design limitation (daac-tools/vibrato#99), Tokenizer is now cloneable (cheaply, via Arc<Dictionary>). Consequently, Worker no longer borrows a Tokenizer but owns a clone, removing its lifetime parameter (Worker<'t> is now Worker).

Major Changes: New compiler Crate

The train, dictgen, compile, and benchmark crates have been removed and their functionalities consolidated into a new, unified compiler crate. This provides a single command-line interface for all dictionary management tasks.

  • Subcommands: The new crate offers the following subcommands:
    • train: Trains a model from a corpus.
    • dictgen: Generates dictionary source files from a model.
    • build: Builds a binary dictionary from source files.
    • full-build: A new command to execute the train, dictgen, and build pipeline in a single step.
    • transmute: A new command to convert legacy bincode-formatted dictionaries to the rkyv format.
  • New Benchmarks: The old benchmark suite has been replaced with criterion benchmarks integrated into the vibrato library, including tokenization speed comparisons and detailed dictionary loading tests.

New Features

  • Dictionary Download and Caching (download feature):

    • A new public API, Dictionary::from_preset_with_download, is available to download, cache, and load preset dictionaries (e.g., IPADIC, UNIDIC).
    • Downloads are verified using SHA256 checksums.
    • Decompressed dictionaries are cached locally to avoid repeated decompression.
  • TokenBuf for Owned Token Data:

    • A new TokenBuf struct is introduced to hold owned token data, making it suitable for storage or cross-thread communication.
    • Token<'w> remains as a lightweight, borrowed view of a token.
    • A Token::to_buf() method is provided for conversion, following the Path -> PathBuf pattern.

Improvements and Fixes

  • AVX2 acceleration is now functional with rkyv: The previously broken AVX2 implementation in the Scorer has been fixed and is now compatible with rkyv serialization. SIMD acceleration can now be used by building with target-cpu flags (e.g., -C target-cpu=native).
  • Optimized Dictionary::from_path: Dictionary loading from a path has been optimized by reading the magic bytes before memory-mapping the entire file, improving performance for invalid files.

Pre-compiled Dictionaries

Pre-compiled dictionaries compatible with this version are available at:
https://github.com/stellanomia/vibrato-rkyv/releases/tag/v0.6.2

v0.6.2

16 Oct 14:27

Choose a tag to compare

This release provides pre-built dictionaries for vibrato-rkyv v0.6.x.

These dictionaries are serialized in a zero-copy format and compressed with Zstandard. It is recommended to load them with the new Dictionary::from_zstd() function introduced in this version, which automatically handles decompression and caching for fast startups.

use vibrato_rkyv::Dictionary;

// Recommended: Automatically handles decompression and caching.
let dict = Dictionary::from_zstd("path/to/your/system.dic.zst")?;

// For manually decompressed files.
let dict = Dictionary::from_path("path/to/your/extracted/system.dic")?;