Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Would plan to support BBPE #621

Open
MrRace opened this issue Feb 2, 2021 · 2 comments
Open

Would plan to support BBPE #621

MrRace opened this issue Feb 2, 2021 · 2 comments
Labels
feature request Add new feature

Comments

@MrRace
Copy link

MrRace commented Feb 2, 2021

Hi,By now sentencepiece support BPE, unigram, char and word. Would you plan to support Byte-Level BPE(BBPE)? Thanks a lot!

@taku910
Copy link
Collaborator

taku910 commented Feb 11, 2021

We do not have a plan, but sentencepiece supports byte-fallback feature (--byte_fallback=true in training phase) where UNK chars are split into utf8 byte. I guess we can almost obtain the same effect.

@taku910 taku910 added the feature request Add new feature label Feb 11, 2021
@asigalov61
Copy link

@MrRace There is a nice BBPE implementation available from fair-seq (Google's competitor Facebook). Here is the link: https://github.com/asigalov61/fairseq/tree/master/examples/byte_level_bpe

Also, check out Texar. AFAIK it has similar functions I think... https://github.com/asyml/texar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Add new feature
Projects
None yet
Development

No branches or pull requests

3 participants