We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
root@7fbdb34e97e0:/opt/sentencepiece# spm_train --input=data/his.txt --model_prefix=mx --vocab_size=4000 opt/sentencepiece/src/trainer_interface.cc(225) LOG(INFO) Loaded 87599 sentences /opt/sentencepiece/src/trainer_interface.cc(226) LOG(INFO) Loaded 0 test sentences /opt/sentencepiece/src/trainer_interface.cc(250) LOG(INFO) all chars count=1455858 /opt/sentencepiece/src/trainer_interface.cc(258) LOG(INFO) Done: 99.9501% characters are covered. /opt/sentencepiece/src/trainer_interface.cc(267) LOG(INFO) Alphabet size=482 /opt/sentencepiece/src/trainer_interface.cc(268) LOG(INFO) Final character coverage=0.999501 /opt/sentencepiece/src/trainer_interface.cc(300) LOG(INFO) Done! 87599 sentences are loaded /opt/sentencepiece/src/unigram_model_trainer.cc(127) LOG(INFO) Using 87599 sentences for making seed sentencepieces /opt/sentencepiece/src/unigram_model_trainer.cc(155) LOG(INFO) Making suffix array... /opt/sentencepiece/src/unigram_model_trainer.cc(159) LOG(INFO) Extracting frequent sub strings... /opt/sentencepiece/src/unigram_model_trainer.cc(210) LOG(INFO) Initialized 5129 seed sentencepieces /opt/sentencepiece/src/trainer_interface.cc(306) LOG(INFO) Tokenizing input sentences with whitespace: 87599 /opt/sentencepiece/src/trainer_interface.cc(315) LOG(INFO) Done! 12256 /opt/sentencepiece/src/unigram_model_trainer.cc(502) LOG(INFO) Using 12256 sentences for EM training /opt/sentencepiece/src/unigram_model_trainer.cc(518) LOG(INFO) EM sub_iter=0 size=2619 obj=27.6329 num_tokens=69501 num_tokens/piece=26.5372 /opt/sentencepiece/src/unigram_model_trainer.cc(518) LOG(INFO) EM sub_iter=1 size=2035 obj=19.1728 num_tokens=70687 num_tokens/piece=34.7356 /opt/sentencepiece/src/trainer_interface.cc(371) LOG(INFO) Saving model: mx.model /opt/sentencepiece/src/spm_train_main.cc(159) [_status.ok()] Internal: /opt/sentencepiece/src/trainer_interface.cc(362) [(trainer_spec_.vocab_size()) == (model_proto->pieces_size())] Program terminated with an unrecoverable error.
The text was updated successfully, but these errors were encountered:
This happens when the training data is too small and the maximum pieces reserved is less than 4000.
You might want to decrease --vocab size or --hard_vocab_limit=false, which automatically shrink the vocab size.
Sorry, something went wrong.
--hard_vocab_limit=false works
Allow BPE vocabulary to be smaller than the max size.
c5592fa
See google/sentencepiece#226 for more info.
No branches or pull requests
The text was updated successfully, but these errors were encountered: