Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monolingual data has a word splitter that won't work for CJK #424

Open
Tracked by #425
gregtatum opened this issue Feb 6, 2024 · 0 comments
Open
Tracked by #425

Monolingual data has a word splitter that won't work for CJK #424

gregtatum opened this issue Feb 6, 2024 · 0 comments
Labels
language-coverage Issues related to covering specific languages

Comments

@gregtatum
Copy link
Member

Right now it splits on word boundaries, and limits the size of the monolingual data to be less than 100 "words". This needs to be changed to support another segmentation strategy for CJK languages, maybe just a byte limit.

@gregtatum gregtatum added the language-coverage Issues related to covering specific languages label Apr 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
language-coverage Issues related to covering specific languages
Projects
None yet
Development

No branches or pull requests

1 participant