Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent ISO Codes #941

Open
Foggalong opened this issue Aug 30, 2021 · 5 comments
Open

Inconsistent ISO Codes #941

Foggalong opened this issue Aug 30, 2021 · 5 comments
Labels
refactor Issues that change how things are done, but not what is done.

Comments

@Foggalong
Copy link
Contributor

Foggalong commented Aug 30, 2021

While digging around for #940 I found this very minor and very nerdy issue; the translation docs say CODE should be the 'short country code' but the actual values are a hodgepodge of four different systems (something again just inherited from the original project).

  • Most are ISO 639-1 language codes, like sr for Serbian (whose country code is rs) or ar for Arabic (whose country code is ma).
  • A few are ISO 3166-1 Alpha-2 country codes, like cz for Czechia (whose language code is cs) or kr for South Korea (Korean's language code is ko).
  • Three use ISO/IEC 15897 locale codes to denote country-specific language variants: pt_br, zh-cn, and zh-tw (though not consistently, e.g. en_gb or ko-kr).* To complicate further, the filenames for these are just the 639-1 codes.
  • Out there on its own, Catalan uses cat which is an ISO 639-2 code; the 639-1 code is ca and its ISO 3166-2 sub-division code is es-ct.**

None of the current translations clash, but it could happen if someone were to submit Argentine Spanish, Canadian French or English, anything from Suriname, etc. If it doesn't otherwise cause issues, I'd suggest maybe moving over to a locale-based coding system (epecially if it were being looked at as part of #503) but either way the documentation needs updating to reflect what option people should actually be using.

@Foggalong
Copy link
Contributor Author

Weird bonus facts I learned researching this:

* The 15897 codes are essentially 3166-1 codes appended to 639-1 codes, and then the POSIX implementation of the standard uses _ while the Windows implementation uses - (where they're known as LCID codes). I kinda feel like the delimiter is a feature that should have been part of the ISO standard, but maybe that's just me 😅

** The subdivision code for Catalonia is es-ct and its locale is ca-es, but in some places online one or the other is erroneously given as es-ca; that's actually the 3166-2 code for Cádiz at the opposite end of Spain.

@Nightfirecat Nightfirecat added the refactor Issues that change how things are done, but not what is done. label Aug 30, 2021
@tupaschoal
Copy link
Member

Thanks for the great investigation. I remember something like this was an issue back when #503 was being developed indeed, but as it never got finished, we never followed up either. I'm not sure I get the direction you're suggesting us to move to, though, but again I'm no expert in the mess that these language codes are.

@AverageHelper
Copy link
Contributor

This explains why some of the codes were harder for me to find via Google than others, last time I was working on translations here. I forget which codes were the harder ones.

+1 for choosing one system and sticking to it. @Foggalong any suggestions?

@Foggalong
Copy link
Contributor Author

If you want to use a standard then since it's languages using one of the two language based standards, 693-2 or 15897, would be the best bet; I don't know how likely it is there'd be a sudden influx of translations but those would be best for really future proofing.

Honestly though, as you say, having a consistent system is the most important thing. Even if it's just grabbing a complete list of locales from somewhere else and using that as the reference for translators would help remove the inconsistency.

@tupaschoal
Copy link
Member

I don't really like 639-2 because it doesn't account for language specifics, like my own Portuguese / Portuguese (Brazilian) which can have significant differences.

Sadly 15897 doesn't seem to sport a list on Wikipedia, which would be the most convenient. Whatever we choose I think needs to have a somewhat serious reliable list accessible and account for these languages sub branches.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
refactor Issues that change how things are done, but not what is done.
Projects
None yet
Development

No branches or pull requests

4 participants