Skip to content

Conversation

@lbourdois
Copy link
Contributor

Hi!

I was wandering around the Hub and came across https://huggingface.co/jbochi/madlad400-3b-mt, which offers over 400 languages.
Looking at them, it turns out that many have only the ISO-639-3 tag displayed but not their name.

image

I therefore propose this PR adding all the ISO-639-3 tag names (the data comes from the official SIL International website: https://iso639-3.sil.org/sites/iso639-3/files/downloads/iso-639-3.tab).
Languages already present in the Language.ts file are not affected by this PR.
I add the "code" and the "name" each time, but not the "nativeName" which is not present in the SIL file.
I suppose people will open PRs to add the native name of a language if they express the need. If Wikipedia is well structured, I might be able to scrape it to retrieve these native names from the language pages, but that would be for another PR.

Let me know if there's anything to change.

Add all ISO-639-3 languages
@julien-c
Copy link
Member

julien-c commented Nov 7, 2023

hi @lbourdois, lzh and udm were already listed so they're both listed twice now

@julien-c
Copy link
Member

julien-c commented Nov 7, 2023

oh actually can you generate a separate dict (same format) that includes all the ISO-639-3 tags? I'll rework how we merge the two dicts (in a separate PR)

Delete duplicate lzh and udm tags
@lbourdois
Copy link
Contributor Author

@julien-c I deleted the lzh and udm tags that were duplicated.

Concerning the generation of a dictionary including all ISO-639-3 tags, two questions:

  • Do you want me to generate absolutely all languages, i.e. languages currently based on the ISO-639-1 tag in the Language.ts file should also be in ISO-639-3 (for example, French will be tagged fra instead of fr)?
    And then you'll do your own code to either revert to ISO-639-1 for those languages, or something that allows those languages to handle both 639-1 and 639-3 tags.
  • Should I put this generated dictionary in Language.ts or elsewhere?

@julien-c
Copy link
Member

julien-c commented Nov 8, 2023

just to understand better, in your current PR, did you manually remove the tags like fra which cover a language that's already in ISO-639-1?

@lbourdois
Copy link
Contributor Author

lbourdois commented Nov 8, 2023

You can find the code I used in the following notebook: https://github.com/lbourdois/iso-639-3/blob/main/Language_ts%20generation.ipynb

The logic (even if in practice I do 2) then 1) in the code):

  1. I generate all the name/tag pair dictionary in iso-639-3 format from https://iso639-3.sil.org/sites/iso639-3/files/downloads/iso-639-3.tab. I've just put the resulting dictionary on https://github.com/lbourdois/iso-639-3/blob/main/Language.ts so that you can see what it looks like while waiting to put it elsewhere or not. And so, I wonder if it's this file you want or if you're talking about something else.
  2. Knowing that some languages are already present in the Language.ts file of the hub-docs repo, I deleted (automatically not manually, the lzh and udm tags missed was a typo in my code, now it's fixed) the languages in my file obtained in 1) these languages there to avoid duplications.

@lbourdois
Copy link
Contributor Author

This PR is for ISO-639 codes, but I've just thought it would be quite easy to add the glottocodes used by linguists, and thus encourage them to put their data on the Hub.
The dictionary would then look something like:

Abinomn: {
		code: "bsa",
		glottocode: "abin1243",
		near_code: "None",
		name: "Abinomn"
	},
Gallo: {
		code: "None",
		glottocode: "gall1275",
		near_code: "fra",
		name: "Gallo"
	},

In this case, we use the name of the language in English as the key.
In value we have code, which corresponds to what HF currently uses (= the ISO codes), glottocode corresponds to the equivalent of the ISO code but for the glottocode convention, near_code corresponds to the ISO code closest to the language when there is no corresponding ISO code in the language (we have around 7,000 ISO code versus over 20,000 glottocode, so many more languages/dialects taken into account, it also tells us which languages are dialects of a main language); and finally name the name of the language.

If you're interested in this approach, I've put the complete dictionary here: https://github.com/lbourdois/iso-639-3/blob/main/Language_with_glottocode

@julien-c
Copy link
Member

julien-c commented Nov 9, 2023

BTW linking this PR back to #193 and huggingface/datasets#4881 among others

@julien-c
Copy link
Member

julien-c commented Nov 9, 2023

@lbourdois honestly i feel like glottocode are a little bit too complex (for our use-case) and i think it's already an improvement if we have the full ISO-639-3 DB in this file (which i'll move to a https://github.com/huggingface/huggingface.js subpackage if we move forward with that PR)

Theoritically this will already close huggingface/datasets#4881

Are you ok if i move forward with this plan? (merge your PR, then move the file to the huggingface.js repo, and use it in the Hub)

@lbourdois
Copy link
Contributor Author

lbourdois commented Nov 9, 2023

@julien-c I suppose it's better than nothing ; glottocode is indeed complicated and not very intuitive.
The solution would rather be to complain to SIL International that they're missing a bunch of languages from their list and therefore need to add new tags (edit: after some research, the language submission steps for propose an ISO code are very long: filling out 5-page forms for each language + a 3-month review time for each language...).

So no problem moving forward.
I just point out that I scrapped wikipedia and I retrieved 1,794 native names (22.6% of ISO codes).
I've started a manual proofreading (out of the first 180 lines, I had to retake 10). I'll have to finish it. I hope to be able to do it this Friday, if not this weekend, and thus be able to offer the native names in this PR as well.

@julien-c
Copy link
Member

honestly i don't think we need native names (we almost don't use them in the Hub)

@julien-c
Copy link
Member

I'll tweak this PR today, merge it, then move it to https://github.com/huggingface/huggingface.js as discussed.

I think we'll be done for language integration in the Hub 🎉 Thanks for your help @lbourdois

@julien-c julien-c merged commit d500ff2 into huggingface:main Nov 10, 2023
@lbourdois
Copy link
Contributor Author

lbourdois commented Nov 10, 2023

@julien-c
If it can ever be useful, I've added the clean dataset of 1790 native names here: https://huggingface.co/datasets/lbourdois/nativenames_iso639-3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants