How does word regularization calculate the sampling score returned by the function "sample_encode_and_score"? #884

lsy641 · 2023-06-19T16:54:26Z

How does word regularization calculate the sampling score returned by the function "sample_encode_and_score"?
Does the sentence sampling score rely on the token score recorded in the vocabulary file?
If the score of token is the log probability of the token in the unigram model, how does the model calculate the sentence sampling score?
And I saw in the origin paper, the tokens are sorted according to the loss of likelihood if this token is removed from the corpus. I thought this loss is another score. Where can I see it?

taku910 · 2023-07-08T01:50:15Z

Sampling score is relying on the score (logprob) of the vocab file.

Given one possible segmentation W=w1, w2, ... wn, the generation probability of W is computed as P(W) = exp(\sum_k logprob(w_k)). We sample the sequence W with respect to the probability P(W).

There are several sampling modes (e.g., nbest-sampling, include-best, without replenishment), but we use the forward-filtering-and-backward-sampling algorithm as the basic algorithm.

This article is useful to understand the FFBS algorithm.
https://towardsdatascience.com/sentencepiece-tokenizer-demystified-d0a3aac19b15

By the way, The part "token is remove from the corpus" is the algorithm to train the sentencepiece. We don't use it in the inference time

taku910 closed this as completed Jul 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How does word regularization calculate the sampling score returned by the function "sample_encode_and_score"? #884

How does word regularization calculate the sampling score returned by the function "sample_encode_and_score"? #884

lsy641 commented Jun 19, 2023 •

edited

taku910 commented Jul 8, 2023

How does word regularization calculate the sampling score returned by the function "sample_encode_and_score"? #884

How does word regularization calculate the sampling score returned by the function "sample_encode_and_score"? #884

Comments

lsy641 commented Jun 19, 2023 • edited

taku910 commented Jul 8, 2023

lsy641 commented Jun 19, 2023 •

edited