Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenize words, rather than wordparts #11

Closed
zachmayer opened this issue May 1, 2017 · 2 comments
Closed

Tokenize words, rather than wordparts #11

zachmayer opened this issue May 1, 2017 · 2 comments

Comments

@zachmayer
Copy link

First of all, thanks for sharing such a useful too! I really like this library.

Second, I'm working on a non-translation task where I think I want to be workings with words, rather than word parts. Are there any settings I can use in sentencepiece to tend to favor longer word units?

@justinchiu
Copy link

justinchiu commented Apr 9, 2018

I'm working on this in a fork -- is there any reason for the 64 character limit in the bpe trainer? Is it okay to increase that?

@taku910
Copy link
Collaborator

taku910 commented Apr 17, 2018

There are three parameters to control the piece length and shape.

--max_sentencepiece_length (default 16)
--split_by_unicode_script (default true)
--split_by_whitespace (default true)

When the sentence is tokenized with spaces, --split_by_whitespace=false may extract pieces crossing word boundaries.

In addition, I've increased the maximum value of --max_sentencepiece_length from 64 to 512 so we can extract longer pieces.

Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants