Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is the loss computation in UnigramTrainer correct? #628

Closed
mbollmann opened this issue Feb 18, 2021 · 3 comments
Closed

Is the loss computation in UnigramTrainer correct? #628

mbollmann opened this issue Feb 18, 2021 · 3 comments
Labels

Comments

@mbollmann
Copy link

When computing logsum_alt, the frequency of a removed piece is re-assigned to alternatives:

// After removing the sentencepiece[i], its frequency freq[i] is
// re-assigned to alternatives.
// new_sum = current_sum - freq[i] + freq[i] * alternatives.size()
// = current_sum + freq[i] (alternatives - 1)
const float logsum_alt = std::log(
static_cast<double>(sum + freq[i] * (alternatives.size() - 1)));

But the code uses alternatives.size() which, if I'm not mistaken, is always equal to sentencepieces.size(). Don't we want to multiply with the number of alternatives for this particular sentencepiece, i.e., alternatives[i].size()? @taku910 ?

@dojoteef
Copy link

@taku910 I ran into the issue when training a Transformer-XL model on various configurations of WikiText-103 using sentencepiece:

  1. My configuration for sentencepiece does not treat whitespace as a special character, thus my vocab includes phrases.
  2. I first detokenize WikiText-103, then train the Unigram tokenizer over the training split of the dataset, before running sentencepiece with a 256k vocab size and max piece length of 32.
  3. After tokenizing the training split, the number of tokens is ~26m (compared to 103m tokens for the closed vocabulary version of WikiText-103). This is a 4x reduction in tokens. Yet, the tokenized version of the validation set only has a 1.3x reduction.
  4. Just as a sanity check I re-trained the Unigram tokenizer over both the training and validation splits, but the number of tokens is roughly the same.
  5. A Transformer-XL model trained over this preprocessed dataset severally overfits to the training set and generalizes very poorly (validation ppl ~52.8).
  6. When I make the proposed fix (changing alternatives.size() to alternatives[i].size(), and re-train the tokenizer over just the training splt, the discrepancy in token counts between training and validation split is fixed (both have an approximately 1.3x reduction in number of tokens).
  7. A Transformer-XL model trained over the fixed version of sentencepiece gets a reasonable perplexity (validation ppl 27.9 compared to 23.9 ppl I get when training on the closed vocab version of WikiText-103).

I looked at the code and @mbollmann seems to be correct that alternatives.size() is always equal to sentencepieces.size().

@taku910 taku910 added the bug label Sep 2, 2022
@taku910
Copy link
Collaborator

taku910 commented Apr 24, 2023

Sorry for the late response. yes, the computation was incorrect. Going to be fixed in the next release.

@taku910
Copy link
Collaborator

taku910 commented May 2, 2023

Fixed in https://github.com/google/sentencepiece/releases/tag/v0.1.99

@taku910 taku910 closed this as completed May 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants