Number localisation for Asian #16683

tribela · 2021-08-31T09:40:33Z

Pitch

Currently, Mastodon displays statuses count, user count, etc. like "30k", "1.8M".
But it is not translatable for most Asian locale. Because asian uses 10000 based numbering system instead of 1000 based (western) system.

For example: Twitter already implemented this behaviour

Motivation

For better localisation.

tribela · 2021-08-31T10:15:20Z

If using TwitterCldr, It looks like this:

before	after

ClearlyClaire · 2021-09-01T11:36:02Z

This is, I think, supposed to be handled by number_to_human, but we have overridden the precision and other settings because the data from rails-i18n was incorrect for several locales.

It could probably be fixed by using just number_to_human but we'd have to review how it's called as well as each locale definition.

brawaru · 2021-09-09T15:07:58Z

[More of a maintenance note than a contribution to the issue]

My PR #14061 is affected by this issue too and it's a bit tricky to fix because it's nothing but a hack.

The problem with it is that creators of Intl.NumberFormat and Intl.PluralRules browser APIs (which we rely on) did not supply us with methods to correctly pluralise the words based on the number of the compact notation. In some languages, and I can speak for Russian here, plural rules change for compact notation: for example — ‘10 321 пост’, but ‘1.3 тыс. постов’.

In comparison, in ICU4J (Java library), you can format a number to short notation and it returns you an object which you can either convert to string (if you need a value right away) or supply to your plural rule select function, which gets you a plural category to use when localizing the string. That is how things should be, actually, but uh oh. We've got tc39/ecma402#397, but there have been no updates for over a year now.

Current our solution manually finds a ‘best way’ to short a number based on, unfortunately, this 1000 ‘western’ system, while finding a way, it provides us a division, based on which we can calculate a value which we then base plural on. This is why if you worked on a translation you saw two variables count and counter, where count will be that exact ‘pluralisation’ value and the ‘counter’ is actual number of compact notation.

Walk-through

Given number 10,321:

10,321 is less than a million, so we use a ‘thousands’ division. Because it is also less than ten thousand we allow up to 1 fraction digit. The result of 10,321 / 1,000 = 10.321 (that is the number we're going to display in counter, but formatted and will append ‘K’ for ‘thousands’ to it).
To get ‘plural ready’ number:
1. We check that the division is not less than one hundred (100), otherwise we return number as is.
2. We take the division (which is 1,000) and divide it to 10 to get ‘closest scale’.
3. Then we divide our number, 10,321, to the closest scale and throw away the fraction point (giving us nice 103).
4. The result we then multiply by the closest scale, which gets us 10,300.
Then we simply use ‘plural ready’ number when formatting a message while using counter placeholder containing real value inside the plural placeholder.
- {pluralReady, plural, one {{count} user} other {{count} users}} + { pluralReady: 10_300, count: '10K' } ⇒ 10.3K users.
- {pluralReady, plural, one {{count} пользователь} few {{count} пользователя} many {{count} пользователей} other {{count} пользователя}} + { pluralReady: 10_300, count: '10 тыс.' } ⇒ 10.3 тыс. пользователей.
- {count}ユーザー + { pluralReady: 10_300, count: '10.3K' } ⇒ 10.3Kユーザー (expected count to equal 1万)
- {count}사용자 + { pluralReady: 10_300, count: '10.3K' } ⇒ 10.3K사용자 (expected count to equal 1만)

I will be thinking about the solution to that on free time, but right now I have no worthy ideas. Worst case scenario we'll have to wait for someone to propose the solution in TC39.

The showstopper is that we need CLDR data in order to perform smart calculations, in this case we need patterns and thresholds for compact notations (example for Korean, for Japanese, for Russian). Without this data we can't be sure how to exactly get those numbers — one of the steps in CLDR guide basically tells you ‘from a threshold yeet a number of zeroes from the pattern and then you get a divisor’ (e.g. for threshold 10_000 you remove... no zeroes, because one zero is always skipped; but from 100_000 you remove 1 zero (pattern ‘00만’)).

Even Twitter is prone to this issue:

.

It'd be ‘твита’ for 21,243 (matching ‘few’ per Russian plural rules), but it's ‘твитов’ for 21.2 thousands tweets (matching ‘many’ for [visible] 21,200). It'd be ‘other’ for 21.2, which would result in ‘21.2 тыс. твита’, which is basically like a few, and so is incorrect.

So yeah, that's the note. ‘thanks for coming to my TED talk [about how internationalisation can be a headache]’.

mashirozx · 2021-10-04T00:49:17Z

Another strange behavior is it display as something like 1.539k, which seems meaningless…

brawaru · 2021-10-06T15:58:54Z

https://codesandbox.io/s/pn5qp?file=/src/main.tsx

This is one hacky solution, it:

forcefully replaces browser provided Intl.NumberFormat with a polyfill
loads data for required locales (in the example it loads all locales at once but it's pretty possible to load them individually on demand, although that should be happening after injecting Intl.NumberFormat and before actually using it)
(ab)uses internal APIs like ComputeExponent, FormatNumericToString of polyfill implementation
implements CompactNumber component (similar to ShortNumber we already have)

The prototype code and ‘design’ is a mess of course, pretty sure implementation can be cleaner in ways.

But yeah, here's that

brawaru · 2023-10-17T18:43:37Z

Since Mastodon has been updated to the newest versions of FormatJS libraries, perhaps you can check out @vintl/compact-number developed by me based on the solution above. It doesn't require a polyfill since I generate slice of CLDR data manually, but it's still not the smallest library out there, although some stuff will be de-duplicated because it relies of FormatJS APIs.

I don't think the direct fix for this problem lands in browsers any time soon, so we can only resort to all sorts of hacks.

brawaru mentioned this issue Apr 1, 2022

Fix unusual number formatting in some locales #17929

Merged

vmstan added i18n Internationalization and localization suggestion Feature suggestion area/web interface Related to the Mastodon web interface labels Nov 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Number localisation for Asian #16683

Number localisation for Asian #16683

tribela commented Aug 31, 2021

tribela commented Aug 31, 2021

ClearlyClaire commented Sep 1, 2021

brawaru commented Sep 9, 2021 •

edited

mashirozx commented Oct 4, 2021

brawaru commented Oct 6, 2021 •

edited

brawaru commented Oct 17, 2023 •

edited

Number localisation for Asian #16683

Number localisation for Asian #16683

Comments

tribela commented Aug 31, 2021

Pitch

Motivation

tribela commented Aug 31, 2021

ClearlyClaire commented Sep 1, 2021

brawaru commented Sep 9, 2021 • edited

mashirozx commented Oct 4, 2021

brawaru commented Oct 6, 2021 • edited

brawaru commented Oct 17, 2023 • edited

brawaru commented Sep 9, 2021 •

edited

brawaru commented Oct 6, 2021 •

edited

brawaru commented Oct 17, 2023 •

edited