Skip to content

v0.7.1

@o24s o24s tagged this 08 Nov 16:24
This commit introduces an experimental implementation of N-best tokenization, allowing users to retrieve multiple candidate token sequences sorted by cost. This feature is useful for downstream NLP tasks that need to handle ambiguity.

The implementation uses an A* search algorithm to find the N-best paths. The forward costs calculated during the Viterbi lattice construction are used as the heuristic for the A* search.

Changes:
- New `Worker` API: The primary interface is through the `Worker`. A user first calls `worker.tokenize_nbest(n)` and then uses the following methods to inspect the results:
  - `num_nbest_paths()`: Gets the number of paths found (up to `n`).
  - `path_cost(i)`: Gets the total cost of the i-th best path.
  - `nbest_token_iter(i)`: Returns an iterator over the tokens in the i-th best path.
  This iterator-based API is memory-efficient as it avoids allocating all results at once.

Implementation Details:
- Separate Lattices: A new `LatticeNBest` struct is introduced for N-best analysis. The original `Lattice` for 1-best analysis remains untouched. This separation avoids affecting the performance of the standard tokenize() method.

- A* Search Implementation: A new `nbest_generator` module contains the core A* search logic, which traverses the lattice backward from the EOS node to find the best paths.

Resolves: https://github.com/daac-tools/vibrato/issues/151
Assets 2
Loading