Decoding strategies

Ancestral sampling consists of sampling a token at each step according to the model's probability distribution over tokens conditional on the previous tokens generated.

Greedy decoding consists of taking the token with the greatest probability at each step according to the model's probability distribution over tokens conditional on the previous tokens generated. Greedy decoding is an approximation to MAP decoding, where the goal is to find the maximum probability generation from the language model, i.e., the mode of the distribution.

Beam search maintains of a list of the $b$ sequences with the greatest sequence probability (the product of the token probabilities) called the beam (the hyperparameter $b$ is called the beam width). At each step and for each sequence in beam search, beam search creates one sequence by adding the token with the greatest probability conditioning on the sequence, another sequence by adding the token with the next greatest probability conditioning on the sequence and so on. In this way, beam search creates $b$ new sequences for each sequence in the beam. It then replaces the beam with the $b$ new sequences with the greatest probability among the $b^2$ new sequences in total. Beam search is also trying to estimate the mode of the distribution.

Temperature sampling is like ancestral sampling, except that at each step you divide the model's logits by a positive, scalar temperature parameter, pass those modified logits through the softmax, and then sample from the resulting distribution.

Top-k sampling is like ancestral sampling, except that at each step it samples from the renormalized probability distribution over the $k$ tokens with the greatest probability.

Nucleus sampling (or top-p sampling) is like ancestral sampling, except that at each step it excludes tokens outside the smallest set of tokens that comprise $p%$ or more of the model's probability mass and then samples from the renormalized probability distribution for the remaining tokens.

Suppose we have a utility function $u(y, h)$ where $h$ is text generated from the language model conditioned on the prompt $x$ and $y$ is the reference completion for the prompt $x$. We would like to find the $h^{*}$ in $H$ that maximizes the $E[u(y, h)]$, where the expectation is over $p(y \mid x)$. The first problem is that $H$ is very large, so we approximate it by a set $S_1$ of samples from the language model. Then, we just have to select $h^{*}$ in $S_1$ that maximizes $E[u(y, h)]$. The second problem is that we don't know $p(y \mid x)$. We can replace it with the language model $p_{\theta}(y \mid x)$, but the expectation over $p_{\theta}(y \mid x)$ is still intractable, so we approximate it by taking a set $S_2$ of samples from the language model and computing the average of $u(y, h)$ for all $y$ in $S_2$. Putting this together, we construct 2 sets of samples from the language model ($S_1$ and $S_2$) and pick the $h^{*} \in S_1$ that maximizes the average of $u(y, h)$ for all $y \in S_2$. Note that $S_1$ and $S_2$ can be the same set of samples. This procedure is Minimum Bayes Risk decoding. For example, EA20 train a translation model, sample from it and pick the translation that maximizes the average METEOR score between the translation and each other translation in the sample. They argue that MBR decoding is more representative of the model's probability distribution. They also show that increasing the number of samples monotonically improves performance, which is not the case with beam width (the so-called "beam search curse"). Unfortunately, MBR decoding is computationally expensive. EA21 suggest some heuristic strategies for reducing the computational cost. An alterative way to view MBR decoding is as medoid estimation and there is a literature on fast medoid estimation (e.g., BT19). However, I don't think this literature provides much improvement over the heuristic strategies used in EA21. MS21 finds that MBR decoding is more robust than beam search in some scenarios (note that the mode is also robust compared to the mean). FGT+21 uses neural metrics as the utility function in MBR decoding for translation.

A common trade-off in decoding is the diversity-quality trade-off (ZDI+21). Other decoding strategies that try to mantain the quality of beam search while improving diversity are: Diverse Beam Search (VCS+18), Stochastic Beam Search (KHW19) and Conditional Poisson Stochastic Beam Search (MAV+21). The papers for both of the stochastic methods also discuss importance sampling methods for more efficiently estimating various distributional quantities. An even simpler way to efficiently sample is to use a Sum-and-Sample estimator (See Equation 13 in KHW20). Using the SAS estimator, you can estimate distributional quantities based on a combination of ancestral samples and beam search candidates reducing the variance of the estimate.

There's also work to construct lattices that encode a large number of hypthoses instead of just trying to find the mode or the medoid of the distribution (e.g., XJD22).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

decoding_strategies.md

decoding_strategies.md

Decoding strategies

Files

decoding_strategies.md

Latest commit

History

decoding_strategies.md

File metadata and controls

Decoding strategies