Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[meta] Train easy to segment LTR languages #524

Open
gregtatum opened this issue Apr 10, 2024 · 2 comments
Open

[meta] Train easy to segment LTR languages #524

gregtatum opened this issue Apr 10, 2024 · 2 comments
Labels
epic language-coverage Issues related to covering specific languages

Comments

@gregtatum
Copy link
Member

In the short term we are focusing on building up our language list by training easy to segment LTR languages, as they don't require changes to the training pipeline, and are immediately supported in Firefox. These are broken into 3 groups, based on resource count from the OPUS datasets.

Data Availability Sentence Count
High Resource > 80 million
Med Resource 20 - 80 million
Low Resource < 20 million

Assuming that resource availability is roughly equivalent to the quality we will be available to achieve yields the following table:

High Quality Medium Quality Low Quality
Russian (en-ru) Vietnamese Norwegian (Bokmål)
Indonesian Slovak Basque
Czech (en-cs) Ukrainian (en-uk) Galician
Hungarian (en-hu) Slovenian (en-sl) Norwegian (Nynorsk)
Turkish (en-tr) Catalan (ready to ship)
Greek (en-el) Lithuanian
Finnish (en-fi) Croatian
Swedish Serbian
Romanian Latvian
Danish Valenciano
Bosnian

We will focus on potentially "high quality" languages first, and follow-up with "medium quality". It's unclear how well the "low quality" languages will be and if they will meet our shippable criteria or not, but that can be evaluated.

More links

  • We have a dashboard for an up-to-date list of what models we have shipped.
  • To request additional languages post a request on Mozilla Connect or find an existing request for a language and give it a thumbs up.

Native Speakers

If you are a native speaker (L1 language) in any of these languages and want to help out, feel free to leave a comment on this issue or join us in Firefox Translations on matrix. We can always use help with qualitative model evaluation, and questions regarding language.

@gregtatum gregtatum added language-coverage Issues related to covering specific languages epic labels Apr 10, 2024
@gregtatum gregtatum changed the title Train easy to segment LTR languages [meta] Train easy to segment LTR languages Apr 10, 2024
@gregtatum
Copy link
Member Author

gregtatum commented May 23, 2024

For our upcoming training run, this table should summarize what monolingual data is available.

Name Difficulty To en From en Newscrawl
Russian ready to train Released Nightly yes
Indonesian ready to train     yes
Czech ready to train Nightly Nightly yes
Hungarian ready to train Released Nightly yes
Turkish ready to train     yes
Greek ready to train     yes
Finnish ready to train Released Nightly yes
Romanian ready to train     yes
Ukrainian medium resource Released Nightly yes
Lithuanian medium resource Nightly   yes
Croatian medium resource     yes
Serbian medium resource     yes
Latvian medium resource     yes
Bosnian ready to train     yes
Vietnamese medium resource     no
Swedish ready to train     no
Slovak medium resource     no
Danish ready to train     no
Slovenian medium resource     no
Valenciano medium resource     no

@marco-c
Copy link
Collaborator

marco-c commented May 23, 2024

Macocu has monolingual data for some of these languages: https://macocu.eu/#corpora-section.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
epic language-coverage Issues related to covering specific languages
Projects
None yet
Development

No branches or pull requests

2 participants