Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[meta] Train harder to segment languages, like CJK languages #425

Open
1 of 4 tasks
gregtatum opened this issue Feb 6, 2024 · 0 comments
Open
1 of 4 tasks

[meta] Train harder to segment languages, like CJK languages #425

gregtatum opened this issue Feb 6, 2024 · 0 comments
Labels
epic language-coverage Issues related to covering specific languages

Comments

@gregtatum
Copy link
Member

gregtatum commented Feb 6, 2024

For harder to segment languages we have Chinese, Japanese, and Korean. We'll need to implement better tokenization support and segmentation support for these languages in order to train them. This work should happen after training a subset of the easier to segment language in #524.

Tasks

  1. language-coverage
  2. language-coverage
  3. language-coverage

Native Speakers

If you are a native speaker (L1 language) in any of these languages and want to help out, feel free to leave a comment on this issue or join us in Firefox Translations on matrix. We can always use help with qualitative model evaluation, and questions regarding language.

@gregtatum gregtatum added the epic label Feb 6, 2024
@gregtatum gregtatum added the language-coverage Issues related to covering specific languages label Apr 10, 2024
@gregtatum gregtatum changed the title [meta] Support training CJK languages [meta] Train harder to segment languages, like CJK languages Apr 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
epic language-coverage Issues related to covering specific languages
Projects
None yet
Development

No branches or pull requests

1 participant