Skip to content
This repository has been archived by the owner on Oct 26, 2024. It is now read-only.

Standardize Languages #8

Closed
2 tasks
deobald opened this issue Sep 12, 2020 · 1 comment
Closed
2 tasks

Standardize Languages #8

deobald opened this issue Sep 12, 2020 · 1 comment
Assignees
Milestone

Comments

@deobald
Copy link
Member

deobald commented Sep 12, 2020

Current Thinking (2021 July 22)

  1. Language: ISO 639-3 (eng, hin, etc.)
  2. Script: ISO 15924 only for Chinese (hans vs. hant)
  3. Kosa: Construct a flat list of languages on the server side, but content is divided by text vs. audio/video. Non-Chinese text and audio happen to have the same keys. Chinese text content is either keyed as zho-hant or zho-hans, but nothing else. Audio content is keyed as cmn, yue, nan, or hak.
  4. Mobile App: Users select a language and the app constructs a "preferred languages list" behind the scenes. All languages have a backup language of English. Selecting a Chinese language presents the user with a script selection box as well, with options of Traditional or Simplified. The "preferred" language is always zho-hant or zho-hans and the second-most-preferred language will be the spoken Chinese language (of cmn, yue, nan, or hak). Chinese users also get a final backup of English.
  5. Thinking: This system allows Kosa to serve content with a flat language key for anything, greatly simplifying how it tracks languages and preventing a language tree from emerging anywhere in the API. The "preferred languages list" allows us to (a) back up everything with English content and (b) add flexible language preferences and new script options later, if required. The language-selection algorithm can be dumb-but-flexible, allowing us to avoid lookup trees entirely.

Requirements

  • embed standardized language names in Dart and Ruby/Clojure using ICU libraries (?)
  • a minimum required set of languages include:
    • pali
    • english
    • espanol
    • italiano
    • simplified chinese
    • francais
    • portugues
    • srpsko-hrvatski (serbo-croatian)

The complete list of languages currently supported by Pariyatti:

It seems that ISO 639-3 (an extension of ISO 639-3) has reasonably comprehensive support:

My current thinking is ISO 639-3 + (optional) region specifier. Alternatively, some BCP 47 subset... but it's just so complicated.

Wikipedia uses a number of hacks to get around BCP 47 limitations:

Examples explaining why flattening Chinese languages won't work:

  1. Taiwan speaks cmn, nan, hak but always uses zho-hant
  2. Fujian / Guangdong (China) speak nan and hak but always use zho-hans

Chinese scripts can be decoded here:

https://www.chineseconverter.com/en/convert/find-out-if-simplified-or-traditional-chinese


Old notes from Asana:

1:

My first round of research turned up this:

A Language should have three fields: IANA code, English name ("Hindi"), Actual name ("हिंदी")

IANA tag registry is here: https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry

Prefer tag combinations were are nearest matches to the Gettext locale standard, wherever possible:

https://www.gnu.org/software/gettext/manual/html_node/Locale-Names.html#Locale-Names


2:

Ooooohhhkkkayyyyy. It looks like THIS is maybe the standard way to do this? At least according to friends at Wikipedia:

https://github.com/unicode-org/cldr/tree/release-37/common/main

  • This list is available through ICU libraries. This CLDR format also contains the language name equivalents (आनगराी / English vs. Hindi / हिंदी vs every other possible combination).

3:

The canonical ICU webpage is here: http://site.icu-project.org/home

The Ruby library is listed here (gem icu): http://site.icu-project.org/related

There is a Dart package: https://pub.dev/packages/icu


4: (post-Asana)

Clojure: https://github.com/Vincit/satakieli (wraps ICU4J)
Java: http://site.icu-project.org (ICU4J)

@deobald deobald self-assigned this Sep 12, 2020
@deobald deobald transferred this issue from pariyatti/kosa-rails Dec 21, 2020
@deobald deobald added this to the v1 milestone Dec 21, 2020
@deobald
Copy link
Member Author

deobald commented Jan 17, 2022

This is completed for Kosa with the inclusion of a complete superset of both dhamma.org and pariyatti.org languages (as a flattened list).

@deobald deobald closed this as completed Jan 17, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

1 participant