Split longer word, rather than word by word #18

minhtringo141 · 2017-07-14T07:47:37Z

Firstly, greatly appreciate your library, it's very useful and easy to to use. But when using i've a trouble,.In Vietnamese vocabulary, a meaning word sometimes includes more than 1 word. For example, sentence "I live in Ha Noi", i want "Ha Noi" will stand together after being split. Is there any way or any parameter to handle this case ? Best wishes !

taku910 · 2017-07-15T05:30:51Z

Thank you for using sentencepiece.

spm_train --split_by_whitespace=false allows you to extract pieces crossing whitespaces. So, "Ha Noi" may be extracted if it appears frequently in the corpus.

However, my experience says that the whitespace constraint is reasonably useful to extract meaningful pieces.

minhtringo141 · 2017-07-15T05:33:58Z

It works for me, thanks you so much ! Best wishes !

taku910 closed this as completed Jul 19, 2017

taku910 reopened this Jul 19, 2017

taku910 closed this as completed Jul 19, 2017

taku910 mentioned this issue Jun 18, 2020

sentencepiece==0.1.92 seems breaking something #505

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split longer word, rather than word by word #18

Split longer word, rather than word by word #18

minhtringo141 commented Jul 14, 2017

taku910 commented Jul 15, 2017

minhtringo141 commented Jul 15, 2017

Split longer word, rather than word by word #18

Split longer word, rather than word by word #18

Comments

minhtringo141 commented Jul 14, 2017

taku910 commented Jul 15, 2017

minhtringo141 commented Jul 15, 2017