could not save model against my own txt #226

wanghaisheng · 2018-10-28T20:00:11Z

root@7fbdb34e97e0:/opt/sentencepiece# spm_train --input=data/his.txt --model_prefix=mx --vocab_size=4000



opt/sentencepiece/src/trainer_interface.cc(225) LOG(INFO) Loaded 87599 sentences
/opt/sentencepiece/src/trainer_interface.cc(226) LOG(INFO) Loaded 0 test sentences
/opt/sentencepiece/src/trainer_interface.cc(250) LOG(INFO) all chars count=1455858
/opt/sentencepiece/src/trainer_interface.cc(258) LOG(INFO) Done: 99.9501% characters are covered.
/opt/sentencepiece/src/trainer_interface.cc(267) LOG(INFO) Alphabet size=482
/opt/sentencepiece/src/trainer_interface.cc(268) LOG(INFO) Final character coverage=0.999501
/opt/sentencepiece/src/trainer_interface.cc(300) LOG(INFO) Done! 87599 sentences are loaded
/opt/sentencepiece/src/unigram_model_trainer.cc(127) LOG(INFO) Using 87599 sentences for making seed sentencepieces
/opt/sentencepiece/src/unigram_model_trainer.cc(155) LOG(INFO) Making suffix array...
/opt/sentencepiece/src/unigram_model_trainer.cc(159) LOG(INFO) Extracting frequent sub strings...
/opt/sentencepiece/src/unigram_model_trainer.cc(210) LOG(INFO) Initialized 5129 seed sentencepieces
/opt/sentencepiece/src/trainer_interface.cc(306) LOG(INFO) Tokenizing input sentences with whitespace: 87599
/opt/sentencepiece/src/trainer_interface.cc(315) LOG(INFO) Done! 12256
/opt/sentencepiece/src/unigram_model_trainer.cc(502) LOG(INFO) Using 12256 sentences for EM training
/opt/sentencepiece/src/unigram_model_trainer.cc(518) LOG(INFO) EM sub_iter=0 size=2619 obj=27.6329 num_tokens=69501 num_tokens/piece=26.5372
/opt/sentencepiece/src/unigram_model_trainer.cc(518) LOG(INFO) EM sub_iter=1 size=2035 obj=19.1728 num_tokens=70687 num_tokens/piece=34.7356
/opt/sentencepiece/src/trainer_interface.cc(371) LOG(INFO) Saving model: mx.model
/opt/sentencepiece/src/spm_train_main.cc(159) [_status.ok()] Internal: /opt/sentencepiece/src/trainer_interface.cc(362) [(trainer_spec_.vocab_size()) == (model_proto->pieces_size())] 
Program terminated with an unrecoverable error.

The text was updated successfully, but these errors were encountered:

taku910 · 2018-10-29T00:32:04Z

This happens when the training data is too small and the maximum pieces reserved is less than 4000.

You might want to decrease --vocab size or --hard_vocab_limit=false, which automatically shrink the vocab size.

wanghaisheng · 2018-10-29T04:37:38Z

--hard_vocab_limit=false works

See google/sentencepiece#226 for more info.

wanghaisheng closed this as completed Oct 29, 2018

taku910 mentioned this issue Feb 22, 2019

error!!! #297

Closed

mallamanis pushed a commit to microsoft/dpu-utils that referenced this issue Apr 25, 2020

Allow BPE vocabulary to be smaller than the max size.

c5592fa

See google/sentencepiece#226 for more info.

mallamanis mentioned this issue Apr 25, 2020

Allow BPE vocabulary to be smaller than the max size. microsoft/dpu-utils#52

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

could not save model against my own txt #226

could not save model against my own txt #226

wanghaisheng commented Oct 28, 2018

taku910 commented Oct 29, 2018

wanghaisheng commented Oct 29, 2018

could not save model against my own txt #226

could not save model against my own txt #226

Comments

wanghaisheng commented Oct 28, 2018

taku910 commented Oct 29, 2018

wanghaisheng commented Oct 29, 2018