Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[meta] Train RTL languages like Arabic and Hebrew #525

Open
1 task
gregtatum opened this issue Apr 10, 2024 · 1 comment
Open
1 task

[meta] Train RTL languages like Arabic and Hebrew #525

gregtatum opened this issue Apr 10, 2024 · 1 comment
Labels
language-coverage Issues related to covering specific languages meta A collection of sub-issues that uses a tasklist

Comments

@gregtatum
Copy link
Member

gregtatum commented Apr 10, 2024

RTL languages shouldn't affect training, but doing so will require some work on the Firefox side. This meta bug tracks any work that is needed. We should complete a subset of the easier to segment LTR languages in #524 first as they do not require Firefox changes. These will require a bit more work.

Tasks

There might be some tokenization/segmentation work around Arabic as well.

Native Speakers

If you are a native speaker (L1 language) in any of these languages and want to help out, feel free to leave a comment on this issue or join us in Firefox Translations on matrix. We can always use help with qualitative model evaluation, and questions regarding language.

@gregtatum gregtatum added language-coverage Issues related to covering specific languages meta A collection of sub-issues that uses a tasklist labels Apr 10, 2024
@BynariStar
Copy link

I've found some resources I believe could be useful for training Hebrew models.

At the moment, the best language pair datasets available for Hebrew are the large multilingual ones available on OPUS:

Here are a few more not on OPUS that might be worth checking as well:

  • HebNLI - Manually verified machine translation
  • HebWiki QA - Manually verified machine translation
  • word2word - A simple word-to-word pair dataset
  • Hebrew WordNet (Archived) - The website seems to be down. I will try to see if they have a backup.

Some extra resources:

  • HebSpacy - An NER model used by Azure AI Language for TA4H in Hebrew. Made in collaboration between Microsoft, the Israeli Ministry of Health, an Israeli HMO, and others.
  • The ONLP Lab - An NLP research lab at the Bar Ilan University in Israel. They have a lot of Hebrew NLP resources and models.
  • Hebrew NLP Resources - A list of Hebrew NLP resources compiled by NNLP, an Israeli government Hebrew NLP advancement project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
language-coverage Issues related to covering specific languages meta A collection of sub-issues that uses a tasklist
Projects
None yet
Development

No branches or pull requests

2 participants