This commit introduces an experimental implementation of N-best tokenization, allowing users to retrieve multiple candidate token sequences sorted by cost. This feature is useful for downstream NLP tasks that need to handle ambiguity.
The implementation uses an A* search algorithm to find the N-best paths. The forward costs calculated during the Viterbi lattice construction are used as the heuristic for the A* search.
Changes:
- New `Worker` API: The primary interface is through the `Worker`. A user first calls `worker.tokenize_nbest(n)` and then uses the following methods to inspect the results:
- `num_nbest_paths()`: Gets the number of paths found (up to `n`).
- `path_cost(i)`: Gets the total cost of the i-th best path.
- `nbest_token_iter(i)`: Returns an iterator over the tokens in the i-th best path.
This iterator-based API is memory-efficient as it avoids allocating all results at once.
Implementation Details:
- Separate Lattices: A new `LatticeNBest` struct is introduced for N-best analysis. The original `Lattice` for 1-best analysis remains untouched. This separation avoids affecting the performance of the standard tokenize() method.
- A* Search Implementation: A new `nbest_generator` module contains the core A* search logic, which traverses the lattice backward from the EOS node to find the best paths.
Resolves: https://github.com/daac-tools/vibrato/issues/151