[meta] Train easy to segment LTR languages #524

gregtatum · 2024-04-10T19:44:27Z

In the short term we are focusing on building up our language list by training easy to segment LTR languages, as they don't require changes to the training pipeline, and are immediately supported in Firefox. These are broken into 3 groups, based on resource count from the OPUS datasets.

Data Availability	Sentence Count
High Resource	> 80 million
Med Resource	20 - 80 million
Low Resource	< 20 million

Assuming that resource availability is roughly equivalent to the quality we will be available to achieve yields the following table:

High Quality	Medium Quality	Low Quality
Russian (en-ru)	Vietnamese	Norwegian (Bokmål)
Indonesian	Slovak	Basque
Czech (en-cs)	Ukrainian (en-uk)	Galician
Hungarian (en-hu)	Slovenian (en-sl)	Norwegian (Nynorsk)
Turkish (en-tr)	Catalan (ready to ship)
Greek (en-el)	Lithuanian
Finnish (en-fi)	Croatian
Swedish	Serbian
Romanian	Latvian
Danish	Valenciano
Bosnian

We will focus on potentially "high quality" languages first, and follow-up with "medium quality". It's unclear how well the "low quality" languages will be and if they will meet our shippable criteria or not, but that can be evaluated.

Native Speakers

If you are a native speaker (L1 language) in any of these languages and want to help out, feel free to leave a comment on this issue or join us in Firefox Translations on matrix. We can always use help with qualitative model evaluation, and questions regarding language.

gregtatum · 2024-05-23T16:25:41Z

For our upcoming training run, this table should summarize what monolingual data is available.

Name	Difficulty	To `en`	From `en`	Newscrawl
Russian	ready to train	Released	Nightly	yes
Indonesian	ready to train			yes
Czech	ready to train	Nightly	Nightly	yes
Hungarian	ready to train	Released	Nightly	yes
Turkish	ready to train			yes
Greek	ready to train			yes
Finnish	ready to train	Released	Nightly	yes
Romanian	ready to train			yes
Ukrainian	medium resource	Released	Nightly	yes
Lithuanian	medium resource	Nightly		yes
Croatian	medium resource			yes
Serbian	medium resource			yes
Latvian	medium resource			yes
Bosnian	ready to train			yes
Vietnamese	medium resource			no
Swedish	ready to train			no
Slovak	medium resource			no
Danish	ready to train			no
Slovenian	medium resource			no
Valenciano	medium resource			no

marco-c · 2024-05-23T16:36:02Z

Macocu has monolingual data for some of these languages: https://macocu.eu/#corpora-section.

gregtatum added language-coverage Issues related to covering specific languages epic labels Apr 10, 2024

gregtatum changed the title ~~Train easy to segment LTR languages~~ [meta] Train easy to segment LTR languages Apr 10, 2024

This was referenced Apr 10, 2024

[meta] Train harder to segment languages, like CJK languages #425

Open

[meta] Train RTL languages like Arabic and Hebrew #525

Open

eu9ene mentioned this issue May 7, 2024

Report empty alignments separately #571

Merged

eu9ene mentioned this issue May 17, 2024

Issues with dependencies for Python 3.10/11 bitextor/bicleaner-ai#31

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[meta] Train easy to segment LTR languages #524

[meta] Train easy to segment LTR languages #524

gregtatum commented Apr 10, 2024

gregtatum commented May 23, 2024 •

edited

marco-c commented May 23, 2024

[meta] Train easy to segment LTR languages #524

[meta] Train easy to segment LTR languages #524

Comments

gregtatum commented Apr 10, 2024

More links

Native Speakers

gregtatum commented May 23, 2024 • edited

marco-c commented May 23, 2024

gregtatum commented May 23, 2024 •

edited