Evaluate CLDR as database for cultures #128

Closed
jzaefferer opened this Issue Oct 18, 2012 · 44 comments

Projects

None yet

9 participants

@jzaefferer
Contributor

We're currently having various issues with the culture files generate from .NET (see label "cultures-bug"). We're considering moving to CLDR as the database for these files.

They're actually working on providing JSON files: http://unicode.org/cldr/trac/ticket/2733

We need to build a prototype to figure out if we can transform the CLDR data into something we can use here directly, or if we need to adapt the Globalize API to the CLDR data.

/cc @clarkbox @slexaxton @krinkle

@SlexAxton

I know they don't love keeping it working (they do, it just is a hassle), but Dojo has scripts to take a lot of the CLDR info and turn it into usable JSON data and functions:

http://bugs.dojotoolkit.org/browser/dojo/util/trunk/buildscripts/cldr

It has some hefty pre-reqs, but that's to be expected with xlst and whatnot. I'd imagine you could do it a touch cleaner in node these days, but the core logic would be the same. I'm definitely looking into doing this, but I'd love for CLDR to just give us this stuff natively. :D

Adam Peller is the expert on this stuff and is/was active on that cldr ticket.

@jzaefferer
Contributor

Thanks Alex, appreciate your input! More puzzle pieces to work with.

@jzaefferer
Contributor

There's some activity on the cldr ticket: http://unicode.org/cldr/trac/ticket/2733#comment:22

Looks like for now our best bet is to just wait a few more weeks and hope for a release of those tools.

@jzaefferer
Contributor

Btw. Tim Wood of Moment.js had general interest in using CLDR, but outlined some issues here: moment/moment#315 - he now also closed that ticket, since he doesn't have time to work on it. If we ever want to add relative time functions to Globalize, Moment.js would be a good starting point.

@jzaefferer jzaefferer referenced this issue in moment/moment Jan 10, 2013
Closed

Explore CLDR for locale database #315

@jzaefferer
Contributor

There's a Ldml2JsonConverter in CLDR now: http://unicode.org/cldr/trac/changeset/7886 Need to try that out.

@jzaefferer
Contributor

Google's Closure library use CLDR as the datasource. I haven't yet figured out how they import the data, but the data itself is in this file: http://docs.closure-library.googlecode.com/git/closure_goog_i18n_datetimesymbols.js.source.html

The equivalent of our format method is in DateTimeFormat: http://docs.closure-library.googlecode.com/git/class_goog_i18n_DateTimeFormat.html
Source, using the above DateTimeSymbols: http://docs.closure-library.googlecode.com/git/closure_goog_i18n_datetimeformat.js.source.html

The source list the tokens it supports, which I can't find in the API document. The header says its based on CLDR standards: "Datetime formatting functions following the pattern specification as defined in JDK, ICU and CLDR, with minor modification for typical usage in JS. Pattern specification: (Refer to JDK/ICU/CLDR)"

@jzaefferer
Contributor

Twitter published a JavaScript port of their Ruby CLDR wrapper 8 months ago: https://github.com/twitter/twitter-cldr-js

@jzaefferer
Contributor

Thanks @dilvie for those links. Those look like it could just load them with JSON.parse, then output the properties we actually need. Using the full CLDR files seems like a bad idea, since they are pretty massive.

Its nice to see that there are properties for relative time formatting, e.g. "6 months ago". That's not something Globalize supports right now, but we could consider adding it.

@ericelliott

@jzaefferer Agreed. There's a lot of data that is not necessary most of the time. Maybe somebody could create a custom build script similar to http://projects.jga.me/jquery-builder/, so we get only the features we really need. That would actually be really nice in Globalize -- especially if all you need are string translations and number formatting. Or string translations and relative time (gentle nudge to anybody with more time than I can spare at the moment).

@scottgonzalez
Contributor

It would be nice if we could just filter, as opposed to reformatting, the data. This would allow anyone who has full CLDR already available to just use that instead of duplicating some data with the Globalize files. We'll have to see if using the existing structure would become awkward.

@andyearnshaw

Hey guys, I noticed this issue earlier when I was investigating how to convert the CLDR data to JSON format. You need the CLDR core.zip and tools.zip files, then you need to use the Ldml2JsonConverter in the tools to convert the data.

You can use what I have so far as an example, in the tools folder at https://github.com/andyearnshaw/Intl.js.

@ericelliott

@andyearnshaw Thanks for the tips!

@rxaviers
Member
rxaviers commented Jul 4, 2013

What about coverage. Does CLDR data cover all culture data needed by Globalize? @jzaefferer have you or anyone looked into it already?

@jzaefferer
Contributor

@rxaviers I don't think anyone did. Though Tim Wood, of moment.js, said that it lacks support for relative time, suggesting that everything else was there. We need to verify that either way.

@rxaviers
Member
rxaviers commented Jul 5, 2013

I started mapping the languages. There are 79 missing languages/cultures on CLDR (that are present on globalize). They are: https://gist.github.com/rxaviers/5933900#file-missing-globalize-cultures-in-cldr

First question: Are we ok dropping that?

Next step: content mapping.

PS: Note that there are languages supported by CLDR that we don't currently support. They are the green ones https://gist.github.com/rxaviers/5933900#file-globalize-vs-cldr-diff


Update: By the time I made this comment, I wasn't aware of LDML inheritance rules. So, my conclusion of 79 missing languages is not correct. See below comments for more accurate info.

@scottgonzalez
Contributor

Anything that's not in CLDR will be dropped. If someone wants us to support a new locale, they'll have to go through CLDR. We will no longer maintain our own data set.

For content mapping, we'll be filtering the data, but we cannot change the structure. We want to be fully compatible with the full data set, so if someone already has the JSON files available from somewhere else, they should be able to use those with Globalize.

@jzaefferer
Contributor

Yeah, now that CLDR has an official JSON format, we should build on top of that, and do the mapping internally.

Might make sense to pick one very simple formatting task and port that to CLDR, so see if we can just rename a few references or have to do heavy refactorings.

@andyearnshaw

You'll probably find the CLDR JSON to be a little bloated and require a
little extra processing (for example, dateTimeFormats), so you're better
off converting that format to the format you're using now.

As far as I can tell at a quick glance, Globalize would lose englishName
and nativeName from its JSON as those aren't included in CLDR.

On Fri, Jul 5, 2013 at 2:57 PM, Jörn Zaefferer notifications@github.comwrote:

Yeah, now that CLDR has an official JSON format, we should build on top of
that, and do the mapping internally.

Might make sense to pick one very simple formatting task and port that to
CLDR, so see if we can just rename a few references or have to do heavy
refactorings.


Reply to this email directly or view it on GitHubhttps://github.com/jquery/globalize/issues/128#issuecomment-20520099
.

@rxaviers
Member
rxaviers commented Jul 8, 2013

Anything that's not in CLDR will be dropped. If someone wants us to support a new locale, they'll have to go through CLDR. We will no longer maintain our own data set.

Great. So, language coverage is not an issue. My next comment is about the content.

@rxaviers
Member
rxaviers commented Jul 8, 2013

Just mapped the content. Follow my findinds below.

Some definitions are mappable, some are not. What are we going to do with those?


Currency

This is implemented in a different way on CLDR. Current Globalize associates one currency symbol per culture/locale. CLDR doesn't. CLDR defines a list of currencies per country code (which is more accurate IMHO). The closest you get using CLDR is: (a) get a list of the territories of a language (by using supplemental.languageData.<lang>["@territories"]), (b) for each territory, get a list of its currencies (by using supplemental.currencyData.region.<territory>).

Note that I am not talking about number symbols (decimal symbol, group separator, plus and minus sign). These are defined on CLDR per locale (just like we do on Globalize).

Calendar

This is also implemented in a different way on CLDR. Current Globalize defines the calendar preference (calendars.standard), and the firstDay preference per locale.

Analogous to the currency above, CLDR defines the calendar preference and the firstDay preference per territory (not per locale) (by using respectively supplemental.calendarPreferenceData.<territory> and supplemental.weekData.firstDay.<territory>).

Each calendar's definitions (eg. Gregorian's names of the days) are defined on CLDR per locale like Globalize does. Some definitions are missing (eg. separator of parts of a time). Some have mappings, but not that simple straightforward mapping, eg. eras.

Number

Some definitions are missing, eg. negative pattern, decimals preference (except for currencies), and groupSizes.


The full mapping is here:
https://gist.github.com/rxaviers/5933850
The comment above each property has its corresponding CLDR map.

@andyearnshaw

Negative patterns aren't missing, they just aren't included where the format is just -. Excerpt from http://www.unicode.org/reports/tr35/tr35-numbers.html:

A pattern contains a positive and may contain a negative subpattern, for
example, "#,##0.00;(#,##0.00)". Each subpattern has a prefix, a numeric
part, and a suffix. If there is no explicit negative subpattern, the
negative subpattern is the localized minus sign prefixed to the positive
subpattern. That is, "0.00" alone is equivalent to "0.00;-0.00". If there
is an explicit negative subpattern, it serves only to specify the negative
prefix and suffix; the number of digits, minimal digits, and other
characteristics are ignored in the negative subpattern. That means that
"#,##0.0#;(#)" has precisely the same result as "#,##0.0#;(#,##0.0#)".

Also, the patterns define the group sizes:

The grouping separator is a character that separates clusters of integer digits to make large numbers more legible. It is commonly used for thousands, but in some locales it separates ten-thousands. The grouping size is the number of digits between the grouping separators, such as 3 for "100,000,000" or 4 for "1 0000 0000". There are actually two different grouping sizes: One used for the least significant integer digits, the primary grouping size, and one used for all others, the secondary grouping size. In most locales these are the same, but sometimes they are different. For example, if the primary grouping interval is 3, and the secondary is 2, then this corresponds to the pattern "#,##,##0", and the number 123456789 is formatted as "12,34,56,789". If a pattern contains multiple grouping separators, the interval between the last one and the end of the integer defines the primary grouping size, and the interval between the last two defines the secondary grouping size. All others are ignored, so "#,##,###,####" == "###,###,####" == "##,#,###,####".

I'd recommend a thorough read of the relevant tr35 sections because you will probably need the knowledge those documents provide, like where multiple inheritence is concerned (for example, with calendar properties).

@rxaviers
Member
rxaviers commented Jul 8, 2013

Great, thank you. So, two less missing maps.

@jzaefferer
Contributor

Regarding currency, the current implementation in Globalize has its limitations anyway. There's some background about that in #66.

We also have a bunch of issues labelled culture-bugs. Switching to CLDR should resolve, or at least help to resolve, those.

As andy commented, negative pattern and groupSize are there, in the pattern, and the same seems to apply to the decimals field. So instead of having three properties for these, there's just one pattern that defines them all.

I suppose negativeInfinity can be inferred just like the negativePattern can be inferred, though there might actually be locales which have a definition for that.

Are there patterns for percent? If so, those should cover the three pattern, decimals and groupSizes properties, just like the regular number patterns.

There's a bunch more stuff in your gist where I don't have a reply. It seems like we can just drop a few things, like the AM/PM fields.

shall we map every single pattern below, or we are just going to stick with new CLDR patterns?

Stick with new CLDR patterns. It seems like we have to change a bunch of stuff anyway that this needs a 2.0 release. I don't think we'll be able to provide backwards compatibility for example for the currency changes.

@rxaviers
Member

I have updated my gist based on Jörn's and Andy's comments. As it points out, for some areas of the Globalize code, the update won't be a simply matter of getting the locale data from somewhere else. But, it's going to be a full refactoring. But for all of those updates, it seems we won't lose any feature. Actually, CLDR seems more complete and better structured.

@jzaefferer
Contributor

Chatted with Rafael on IRC about this. He'll start with some prototyping to figure out what's needed to support those number or date patterns. We can then discuss API changes based on the prototyping results.

@williamkapke

The CLDR data is one of those things that requires us to shift our thinking- and for well vetted reasons.

I've been working a lot with the data- here are a few things I've noticed:

1) Language vs. Territory. A few versions back they moved language_irrelevant data into the supplemental files. For instance: currency is related to the territory and not to the language (Dollars are used in the US whether you speak Spanish, French, or English). This created a problem for frameworks where they don't know the territory; (eg: user supplies "en" for their culture- what is the currency?). To help with that issue, they have likelySubtags which "suggest" the best guess you should use... as determined by the CLDR committee based on feedback. Obviously this can't be perfect- I'm sure many folks in the UK are sad when they see "Dollars" because they were defaulted to en_Latn_US.

2) Inheritance. For those that don't know: CLDR uses a crazy custom multiple inheritance scheme with the XML data. Although the JSON data extracted represents a flattened view of that inheritance, inheritance shouldn't be forgotten about. Their locale inheritance (they call it truncation inheritance) scheme still somewhat applies. For instance, there isn't a en_US JSON file because en_US.xml is (basically) empty- because it inherits everything from en.xml... which inherits from root.xml. Frameworks still need to use CLDR's inheritance to get the user the correct data. In @rxaviers's gist, it lists en-029 (Caribbean) as missing, whereas CLDR logic says "well, it depends on what you're looking for".

  • territoryContainment includes 029 as part of 019 (Americas) and 003 (North America) and 419 (Latin America and the Caribbean). These references may link data to 029.
  • Each language has a translation in //territory[@type="029"] for "Caribbean".
  • It is still "en" so it inherits from there.
  • ... so what does the Caribbean territory have that needs to override what it inherits?
    (I'm not personally asking this question- I'm just saying this is how CLDR approaches it).

If there is indeed something special about 029 that cannot be linked in other areas- then they will create a en_029.xml file.

If all that isn't enough; the documentation also has THIS:

The attribute validSubLocales allows sublocales in a given tree to be treated as though a file for them were present when there is not one.

NOTE: That example is not intended to "tear down" @rxaviers data. I have no idea how many hoops he jumped through to compile that list, I just picked something to illustrate that CLDR is not straight forward.

Another big thing not to miss is /supplemental/parentLocales which overrides the truncation inheritance. A good example is en_AU has its parent redirected to en_GB instead of "en". I pushed my nodejs ldml module I created to query the XML files last night for anyone that is curious or know more.

3) Only 100% approved data. There is a ton of "draft" data in the XML files that is not included in the JSON. This is probably desired- but it means that using the JSON data will mean that any changes submitted to CLDR may not be 100% confirmed for a few version (versions come out every 6 months I believe). CLDR is in the standardization field that has the duty of analyzing the overall consequences of their changes before approving them. This delay to the Globalize users is something to acknowledge.

4) Some data seems to be missing from the JSON. There are aa.xml, aa_DJ.xml, aa_ER.xml, aa_ET.xml files- but not even a aa.json file. I'm not sure why?

Ok, so hopefully it's clear that the CLDR data is complicated if you are to use it correctly. It is this way because that is the nature of globalization. I'm very glad to see @jzaefferer has been evangelizing it. It is a good way to go, but will require refactors- maybe BIG ones to reap the benefits.

This last part is out of the scope of the jQuery Globalize project but its something to ponder:
The CLDR data comes from a mindset (and time) where the entire dataset is intended to be with the user. That's why physical devices have used CLDR for a long time. In the web world- it is WAY too much data to ship so frameworks break it out into chunks that are reasonable and only within the scope of their needs.

So, while its cool to get all the data from the same source- the web as a whole could benefit if those chunks were framework independent so they could be shared. For example- if someone is using Moment and Globalize- they could share the raw data (from CLDR) so it doesn't need to be downloaded twice. Using package managers like Bower and Component, developers could pick and choose which languages they want to include.

Sorry so long. Hope it helps.

@rxaviers
Member

@williamwicks don't worry, you haven't and thanks for your message. LDML has a more accurate definition. It distinguishes: {language, region, and script} in a more realistic way, whereas current Globalize has this three definitions kinda fuzzy (inherited from the standards it had initially chosen to follow). So, I agree with you that CLDR is one of those things that requires us to shift our thinking. By the way, thanks for pointing out that en-029 locale/culture isn't missing (and probably others aren't), which is good. Because, the initial number of cultures/locale were satisfying. But, in fact, it is even bigger. Anyway...

Your module is part of the solution, and here's how I see it: https://github.com/rxaviers/globalize/wiki/Globalize-and-CLDR

Ping me on IRC Freenode @rxaviers.

@rxaviers
Member

Note: the Date Format Patterns are not equivalent http://www.unicode.org/reports/tr35/tr35-dates.html#Date_Format_Patterns

@rxaviers
Member

Does anyone know the difference between "short day" (E...EEE) and "short name" (EEEEEE) on Date_Field_Symbol_Table? What would each respective path be? (eg. dates.calendars.gregorian.days.format.short)

@scottgonzalez
Contributor

I would guess that E..EEE would result in Tue not Tues and be the abbreviated day width, while EEEEEE would be the short day width. If that's correct, the docs need to be updated. I'll contact the editor and see if we can get some help with our questions.

@rxaviers
Member

Your suggestion makes sense, Scott. It's analogous to the order of era, year, and others.

@papandreou

Just stumbled upon this issue as part of figuring out why moment.js isn't already using ICU date formats :)

You can use https://github.com/papandreou/node-cldr to extract data from the CLDR XML files. It takes care of resolving the crazy inheritance scheme and just gives you the resolved chunks of data as JavaScript objects.

Ping me if you need assistance implementing it or if you need some CLDR data that it doesn't yet have an extraction method for.

@papandreou

My inter library has some helper functions for matching and adapting ICU formats: https://github.com/papandreou/inter/blob/master/lib/inter.js#L932-L981 -- could be useful.

And if you want to format date intervals according to CLDR's locale-specific rules (<greatestDifferences> and all that jazz), that's covered as well: https://github.com/papandreou/inter/blob/master/lib/inter.js#L748-L930

@rxaviers
Member
rxaviers commented Nov 1, 2013

@papandreou, our goal is to provide a set of tools that leverage the official CLDR JSON data. Check our in progress implementation on #172. We are using this library https://github.com/rxaviers/cldr to get help on CLDR data access.

Said that, converting XML data into the official JSON bindings is something we delegate to the official CLDR JSON tool.

If you find any issues or want to help us on this process you are welcome.

@papandreou

@rxaviers Yeah, I understood that. But since the JSON data is still incomplete, I just wanted to note that I've written a tool that extracts the data you need from the XML files.

@rxaviers
Member
rxaviers commented Nov 1, 2013

JSON data is still incomplete

What's missing?

@papandreou

What's missing?

The current http://www.unicode.org/Public/cldr/24/json.zip only has 39 locales as opposed to 650+ in the XML data.

I don't know whether the data for the locales that are included is complete. It wasn't when I checked a few months ago.

@rxaviers
Member
rxaviers commented Nov 1, 2013

The available JSON data for download has the top 20 languages they (unicode.org CLDR staff) consider to be the "most used" languages. It contains the complete amount of data per language though. Also, they have been fully resolved.

You can use their official conversion tool (tools.zip) to generate the JSON representation of the languages not available in the ZIP. This ZIP contains a README with instructions on how to build the data. tools/scripts/CLDRWrapper may also be useful. Using the tool, you can opt to either generate resolved data, or unresolved to save space (or bandwidth) (-r false option of the conversion tool).

@papandreou

You can use their official conversion tool (tools.zip) to generate the JSON representation of the languages not available in the ZIP.

Oh, I missed that part. So json.zip is just a small sample of the real stuff. That's good news, I've been waiting for this to arrive :)

Thanks for the info.

@rxaviers
Member
rxaviers commented Nov 1, 2013

Yeap :). You are welcome. If you happen to find any flaws on the generated JSON, I will very much like want to know too. So, please let us know.

@ragulka
ragulka commented Nov 18, 2013

@rxaviers do you have any insight on how to actually use the conversion tool? i tried to follow the readme, but for someone who is not experienced with java, it's just too vague. I also posted the question on SO: http://stackoverflow.com/questions/20046099/how-to-build-json-data-from-cldr-data-using-the-java-conversion-tool

@rxaviers
Member

@ragulka I completely understand your pain. I had the exact same issue as you had. Their README's instructions are currently misleading and should be fixed according to http://unicode.org/cldr/trac/ticket/6726.

@rxaviers
Member

Closed by PR #172

@rxaviers rxaviers closed this Dec 17, 2013
@rxaviers rxaviers added this to the 1.0.0 milestone Mar 20, 2014
@ashensis ashensis pushed a commit to ashensis/globalize that referenced this issue Mar 17, 2016
@parndt parndt Fixes #128 by adding all locale attributes to attr_accessible.
Also by not using mass assignment in set_translations.
This also cleans up after #150 which was the first step in the right direction
9cb6c08
@ashensis ashensis pushed a commit to ashensis/globalize that referenced this issue Mar 17, 2016
@parndt parndt Add :locale to attr_accessible list.
For #128
fc4e110
@ashensis ashensis pushed a commit to ashensis/globalize that referenced this issue Mar 17, 2016
@parndt parndt Merge branch 'attr_accessible_locale'
Fixes #141
Fixes #128
34c5163
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment