Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How does word regularization calculate the sampling score returned by the function "sample_encode_and_score"? #884

Closed
lsy641 opened this issue Jun 19, 2023 · 1 comment

Comments

@lsy641
Copy link

lsy641 commented Jun 19, 2023

How does word regularization calculate the sampling score returned by the function "sample_encode_and_score"?
Does the sentence sampling score rely on the token score recorded in the vocabulary file?
If the score of token is the log probability of the token in the unigram model, how does the model calculate the sentence sampling score?
And I saw in the origin paper, the tokens are sorted according to the loss of likelihood if this token is removed from the corpus. I thought this loss is another score. Where can I see it?

@taku910
Copy link
Collaborator

taku910 commented Jul 8, 2023

Sampling score is relying on the score (logprob) of the vocab file.

Given one possible segmentation W=w1, w2, ... wn, the generation probability of W is computed as P(W) = exp(\sum_k logprob(w_k)). We sample the sequence W with respect to the probability P(W).

There are several sampling modes (e.g., nbest-sampling, include-best, without replenishment), but we use the forward-filtering-and-backward-sampling algorithm as the basic algorithm.

This article is useful to understand the FFBS algorithm.
https://towardsdatascience.com/sentencepiece-tokenizer-demystified-d0a3aac19b15

By the way, The part "token is remove from the corpus" is the algorithm to train the sentencepiece. We don't use it in the inference time

@taku910 taku910 closed this as completed Jul 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants