-
Notifications
You must be signed in to change notification settings - Fork 382
Add all ISO-639-3 languages' names to Language.ts #1082
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Add all ISO-639-3 languages
|
hi @lbourdois, |
|
oh actually can you generate a separate dict (same format) that includes all the ISO-639-3 tags? I'll rework how we merge the two dicts (in a separate PR) |
Delete duplicate lzh and udm tags
|
@julien-c I deleted the Concerning the generation of a dictionary including all ISO-639-3 tags, two questions:
|
|
just to understand better, in your current PR, did you manually remove the tags like |
|
You can find the code I used in the following notebook: https://github.com/lbourdois/iso-639-3/blob/main/Language_ts%20generation.ipynb The logic (even if in practice I do 2) then 1) in the code):
|
|
This PR is for ISO-639 codes, but I've just thought it would be quite easy to add the glottocodes used by linguists, and thus encourage them to put their data on the Hub. In this case, we use the name of the language in English as the key. If you're interested in this approach, I've put the complete dictionary here: https://github.com/lbourdois/iso-639-3/blob/main/Language_with_glottocode |
|
BTW linking this PR back to #193 and huggingface/datasets#4881 among others |
|
@lbourdois honestly i feel like Theoritically this will already close huggingface/datasets#4881 Are you ok if i move forward with this plan? (merge your PR, then move the file to the huggingface.js repo, and use it in the Hub) |
|
@julien-c I suppose it's better than nothing ; So no problem moving forward. |
|
honestly i don't think we need native names (we almost don't use them in the Hub) |
|
I'll tweak this PR today, merge it, then move it to https://github.com/huggingface/huggingface.js as discussed. I think we'll be done for language integration in the Hub 🎉 Thanks for your help @lbourdois |
|
@julien-c |
Hi!
I was wandering around the Hub and came across https://huggingface.co/jbochi/madlad400-3b-mt, which offers over 400 languages.
Looking at them, it turns out that many have only the ISO-639-3 tag displayed but not their name.
I therefore propose this PR adding all the ISO-639-3 tag names (the data comes from the official SIL International website: https://iso639-3.sil.org/sites/iso639-3/files/downloads/iso-639-3.tab).
Languages already present in the
Language.tsfile are not affected by this PR.I add the
"code"and the"name"each time, but not the"nativeName"which is not present in the SIL file.I suppose people will open PRs to add the native name of a language if they express the need. If Wikipedia is well structured, I might be able to scrape it to retrieve these native names from the language pages, but that would be for another PR.
Let me know if there's anything to change.