Release v0.7.1: feat: Support experimental N-best tokenization · o24s/vibrato-rkyv

v0.7.1
bb4ab3f
Choose a tag to compare

Filter

View all tags

v0.7.1: feat: Support experimental N-best tokenization

v0.7.1
bb4ab3f
Choose a tag to compare

Filter

View all tags

o24s tagged this 08 Nov 16:24

This commit introduces an experimental implementation of N-best tokenization, allowing users to retrieve multiple candidate token sequences sorted by cost. This feature is useful for downstream NLP tasks that need to handle ambiguity.

The implementation uses an A* search algorithm to find the N-best paths. The forward costs calculated during the Viterbi lattice construction are used as the heuristic for the A* search.

Changes:
- New `Worker` API: The primary interface is through the `Worker`. A user first calls `worker.tokenize_nbest(n)` and then uses the following methods to inspect the results:
  - `num_nbest_paths()`: Gets the number of paths found (up to `n`).
  - `path_cost(i)`: Gets the total cost of the i-th best path.
  - `nbest_token_iter(i)`: Returns an iterator over the tokens in the i-th best path.
  This iterator-based API is memory-efficient as it avoids allocating all results at once.

Implementation Details:
- Separate Lattices: A new `LatticeNBest` struct is introduced for N-best analysis. The original `Lattice` for 1-best analysis remains untouched. This separation avoids affecting the performance of the standard tokenize() method.

- A* Search Implementation: A new `nbest_generator` module contains the core A* search logic, which traverses the lattice backward from the EOS node to find the best paths.

Resolves: https://github.com/daac-tools/vibrato/issues/151

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Uh oh!