Is the loss computation in UnigramTrainer correct? #628

mbollmann · 2021-02-18T14:24:22Z

When computing logsum_alt, the frequency of a removed piece is re-assigned to alternatives:

sentencepiece/src/unigram_model_trainer.cc

Lines 389 to 394 in ba7e11a

    
           // After removing the sentencepiece[i], its frequency freq[i] is 
        
           // re-assigned to alternatives. 
        
           // new_sum = current_sum - freq[i] + freq[i] * alternatives.size() 
        
           //         = current_sum + freq[i] (alternatives - 1) 
        
           const float logsum_alt = std::log( 
        
               static_cast<double>(sum + freq[i] * (alternatives.size() - 1)));

But the code uses alternatives.size() which, if I'm not mistaken, is always equal to sentencepieces.size(). Don't we want to multiply with the number of alternatives for this particular sentencepiece, i.e., alternatives[i].size()? @taku910 ?

The text was updated successfully, but these errors were encountered:

dojoteef · 2022-01-12T15:57:34Z

@taku910 I ran into the issue when training a Transformer-XL model on various configurations of WikiText-103 using sentencepiece:

My configuration for sentencepiece does not treat whitespace as a special character, thus my vocab includes phrases.
I first detokenize WikiText-103, then train the Unigram tokenizer over the training split of the dataset, before running sentencepiece with a 256k vocab size and max piece length of 32.
After tokenizing the training split, the number of tokens is ~26m (compared to 103m tokens for the closed vocabulary version of WikiText-103). This is a 4x reduction in tokens. Yet, the tokenized version of the validation set only has a 1.3x reduction.
Just as a sanity check I re-trained the Unigram tokenizer over both the training and validation splits, but the number of tokens is roughly the same.
A Transformer-XL model trained over this preprocessed dataset severally overfits to the training set and generalizes very poorly (validation ppl ~52.8).
When I make the proposed fix (changing alternatives.size() to alternatives[i].size(), and re-train the tokenizer over just the training splt, the discrepancy in token counts between training and validation split is fixed (both have an approximately 1.3x reduction in number of tokens).
A Transformer-XL model trained over the fixed version of sentencepiece gets a reasonable perplexity (validation ppl 27.9 compared to 23.9 ppl I get when training on the closed vocab version of WikiText-103).

I looked at the code and @mbollmann seems to be correct that alternatives.size() is always equal to sentencepieces.size().

taku910 · 2023-04-24T08:05:16Z

Sorry for the late response. yes, the computation was incorrect. Going to be fixed in the next release.

taku910 · 2023-05-02T04:25:24Z

Fixed in https://github.com/google/sentencepiece/releases/tag/v0.1.99

taku910 added the bug label Sep 2, 2022

taku910 closed this as completed May 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is the loss computation in UnigramTrainer correct? #628

Is the loss computation in UnigramTrainer correct? #628

mbollmann commented Feb 18, 2021

dojoteef commented Jan 12, 2022

taku910 commented Apr 24, 2023

taku910 commented May 2, 2023

Is the loss computation in UnigramTrainer correct? #628

Is the loss computation in UnigramTrainer correct? #628

Comments

mbollmann commented Feb 18, 2021

dojoteef commented Jan 12, 2022

taku910 commented Apr 24, 2023

taku910 commented May 2, 2023