Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

could not save model against my own txt #226

Closed
wanghaisheng opened this issue Oct 28, 2018 · 2 comments
Closed

could not save model against my own txt #226

wanghaisheng opened this issue Oct 28, 2018 · 2 comments

Comments

@wanghaisheng
Copy link

root@7fbdb34e97e0:/opt/sentencepiece# spm_train --input=data/his.txt --model_prefix=mx --vocab_size=4000



opt/sentencepiece/src/trainer_interface.cc(225) LOG(INFO) Loaded 87599 sentences
/opt/sentencepiece/src/trainer_interface.cc(226) LOG(INFO) Loaded 0 test sentences
/opt/sentencepiece/src/trainer_interface.cc(250) LOG(INFO) all chars count=1455858
/opt/sentencepiece/src/trainer_interface.cc(258) LOG(INFO) Done: 99.9501% characters are covered.
/opt/sentencepiece/src/trainer_interface.cc(267) LOG(INFO) Alphabet size=482
/opt/sentencepiece/src/trainer_interface.cc(268) LOG(INFO) Final character coverage=0.999501
/opt/sentencepiece/src/trainer_interface.cc(300) LOG(INFO) Done! 87599 sentences are loaded
/opt/sentencepiece/src/unigram_model_trainer.cc(127) LOG(INFO) Using 87599 sentences for making seed sentencepieces
/opt/sentencepiece/src/unigram_model_trainer.cc(155) LOG(INFO) Making suffix array...
/opt/sentencepiece/src/unigram_model_trainer.cc(159) LOG(INFO) Extracting frequent sub strings...
/opt/sentencepiece/src/unigram_model_trainer.cc(210) LOG(INFO) Initialized 5129 seed sentencepieces
/opt/sentencepiece/src/trainer_interface.cc(306) LOG(INFO) Tokenizing input sentences with whitespace: 87599
/opt/sentencepiece/src/trainer_interface.cc(315) LOG(INFO) Done! 12256
/opt/sentencepiece/src/unigram_model_trainer.cc(502) LOG(INFO) Using 12256 sentences for EM training
/opt/sentencepiece/src/unigram_model_trainer.cc(518) LOG(INFO) EM sub_iter=0 size=2619 obj=27.6329 num_tokens=69501 num_tokens/piece=26.5372
/opt/sentencepiece/src/unigram_model_trainer.cc(518) LOG(INFO) EM sub_iter=1 size=2035 obj=19.1728 num_tokens=70687 num_tokens/piece=34.7356
/opt/sentencepiece/src/trainer_interface.cc(371) LOG(INFO) Saving model: mx.model
/opt/sentencepiece/src/spm_train_main.cc(159) [_status.ok()] Internal: /opt/sentencepiece/src/trainer_interface.cc(362) [(trainer_spec_.vocab_size()) == (model_proto->pieces_size())] 
Program terminated with an unrecoverable error.

@taku910
Copy link
Collaborator

taku910 commented Oct 29, 2018

This happens when the training data is too small and the maximum pieces reserved is less than 4000.

You might want to decrease --vocab size or --hard_vocab_limit=false, which automatically shrink the vocab size.

@wanghaisheng
Copy link
Author

--hard_vocab_limit=false works

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants