Tokenize words, rather than wordparts #11

zachmayer · 2017-05-01T17:44:56Z

First of all, thanks for sharing such a useful too! I really like this library.

Second, I'm working on a non-translation task where I think I want to be workings with words, rather than word parts. Are there any settings I can use in sentencepiece to tend to favor longer word units?

justinchiu · 2018-04-09T19:17:30Z

I'm working on this in a fork -- is there any reason for the 64 character limit in the bpe trainer? Is it okay to increase that?

taku910 · 2018-04-17T12:00:04Z

There are three parameters to control the piece length and shape.

--max_sentencepiece_length (default 16)
--split_by_unicode_script (default true)
--split_by_whitespace (default true)

When the sentence is tokenized with spaces, --split_by_whitespace=false may extract pieces crossing word boundaries.

In addition, I've increased the maximum value of --max_sentencepiece_length from 64 to 512 so we can extract longer pieces.

Thank you.

taku910 closed this as completed Apr 17, 2018

taku910 mentioned this issue Jun 18, 2020

sentencepiece==0.1.92 seems breaking something #505

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenize words, rather than wordparts #11

Tokenize words, rather than wordparts #11

zachmayer commented May 1, 2017

justinchiu commented Apr 9, 2018 •

edited

taku910 commented Apr 17, 2018

Tokenize words, rather than wordparts #11

Tokenize words, rather than wordparts #11

Comments

zachmayer commented May 1, 2017

justinchiu commented Apr 9, 2018 • edited

taku910 commented Apr 17, 2018

justinchiu commented Apr 9, 2018 •

edited