Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
See Nicolas Chopin's book for a really nice introduction to the topic. The algorithms consists in carrying$N$ particles for each sequence in the batch, and at each step to:
We use the multinomial resampling function in this first PR, although it is known to have very large variance. To make the implementation easier we combine (1) and (2) in a single step, similarly to what we do with beam search.
Note that there is a subtlety when doing structured generation. We can think of the simple following scheme to sample from the distribution of sequences that follow the structure:
But this can be very inefficient. Instead, we move particles using a specific proposal: using the biased next-token logits. Since this is not exactly sampling from the original distribution we need to resample the particles using the factor$P_i / \tilde{P}_i$ as a weight where $P_i$ is the unbiased probability of token $i$ and $P_i$ the biased probability of token $i$ (importance sampling).
Note: I am wondering if we should correct the Beam Search algorithm as well.