Standardize Languages #8

deobald · 2020-09-12T13:43:58Z

Current Thinking (2021 July 22)

Language: ISO 639-3 (eng, hin, etc.)
Script: ISO 15924 only for Chinese (hans vs. hant)
- Also see the IANA Subtag Registry
Kosa: Construct a flat list of languages on the server side, but content is divided by text vs. audio/video. Non-Chinese text and audio happen to have the same keys. Chinese text content is either keyed as zho-hant or zho-hans, but nothing else. Audio content is keyed as cmn, yue, nan, or hak.
Mobile App: Users select a language and the app constructs a "preferred languages list" behind the scenes. All languages have a backup language of English. Selecting a Chinese language presents the user with a script selection box as well, with options of Traditional or Simplified. The "preferred" language is always zho-hant or zho-hans and the second-most-preferred language will be the spoken Chinese language (of cmn, yue, nan, or hak). Chinese users also get a final backup of English.
Thinking: This system allows Kosa to serve content with a flat language key for anything, greatly simplifying how it tracks languages and preventing a language tree from emerging anywhere in the API. The "preferred languages list" allows us to (a) back up everything with English content and (b) add flexible language preferences and new script options later, if required. The language-selection algorithm can be dumb-but-flexible, allowing us to avoid lookup trees entirely.

Requirements

embed standardized language names in Dart and Ruby/Clojure using ICU libraries (?)
a minimum required set of languages include:
- pali
- english
- espanol
- italiano
- simplified chinese
- francais
- portugues
- srpsko-hrvatski (serbo-croatian)

The complete list of languages currently supported by Pariyatti:

https://store.pariyatti.org/Languages_c_86.html

It seems that ISO 639-3 (an extension of ISO 639-3) has reasonably comprehensive support:

https://en.wikipedia.org/wiki/List_of_ISO_639-2_codes (includes 639-3)
http://www.loc.gov/standards/iso639-2/faq.html#24
https://iso639-3.sil.org/code/zho
https://www.unicode.org/iso15924/iso15924-en.html (hans / hant instead of _CN / _TW?)
- also used by IANA: https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry

My current thinking is ISO 639-3 + (optional) region specifier. Alternatively, some BCP 47 subset... but it's just so complicated.

Wikipedia uses a number of hacks to get around BCP 47 limitations:

Examples explaining why flattening Chinese languages won't work:

Taiwan speaks cmn, nan, hak but always uses zho-hant
Fujian / Guangdong (China) speak nan and hak but always use zho-hans

Chinese scripts can be decoded here:

https://www.chineseconverter.com/en/convert/find-out-if-simplified-or-traditional-chinese

Old notes from Asana:

1:

My first round of research turned up this:

A Language should have three fields: IANA code, English name ("Hindi"), Actual name ("हिंदी")

IANA tag registry is here: https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry

Prefer tag combinations were are nearest matches to the Gettext locale standard, wherever possible:

https://www.gnu.org/software/gettext/manual/html_node/Locale-Names.html#Locale-Names

2:

Ooooohhhkkkayyyyy. It looks like THIS is maybe the standard way to do this? At least according to friends at Wikipedia:

https://github.com/unicode-org/cldr/tree/release-37/common/main

This list is available through ICU libraries. This CLDR format also contains the language name equivalents (आनगराी / English vs. Hindi / हिंदी vs every other possible combination).

3:

The canonical ICU webpage is here: http://site.icu-project.org/home

The Ruby library is listed here (gem icu): http://site.icu-project.org/related

There is a Dart package: https://pub.dev/packages/icu

4: (post-Asana)

Clojure: https://github.com/Vincit/satakieli (wraps ICU4J)
Java: http://site.icu-project.org (ICU4J)

The text was updated successfully, but these errors were encountered:

deobald · 2022-01-17T04:16:01Z

This is completed for Kosa with the inclusion of a complete superset of both dhamma.org and pariyatti.org languages (as a flattened list).

deobald self-assigned this Sep 12, 2020

deobald transferred this issue from pariyatti/kosa-rails Dec 21, 2020

deobald added this to the v1 milestone Dec 21, 2020

deobald mentioned this issue Jun 8, 2021

Standardize Languages pariyatti/mobile-app#49

Closed

deobald closed this as completed Jan 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Standardize Languages #8

Standardize Languages #8

deobald commented Sep 12, 2020 •

edited

Loading

deobald commented Jan 17, 2022

Standardize Languages #8

Standardize Languages #8

Comments

deobald commented Sep 12, 2020 • edited Loading

Current Thinking (2021 July 22)

Requirements

Examples explaining why flattening Chinese languages won't work:

Chinese scripts can be decoded here:

Old notes from Asana:

1:

2:

3:

4: (post-Asana)

deobald commented Jan 17, 2022

deobald commented Sep 12, 2020 •

edited

Loading