New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLDR data is reported as missing when locale is set with a region #357

Closed
arbrown opened this Issue Dec 5, 2014 · 26 comments

Comments

Projects
None yet
8 participants
@arbrown

arbrown commented Dec 5, 2014

When I try to use Globalize with a language-region combo (like en-US,) it always gives me the error:
Error: E_MISSING_CLDR: Missing required CLDR content 'main/en/dates/calendars/gregorian/dateTimeFormats/short'.

The basic code is as follows:

Globalize.load(
{
  "main": {
    "en-US": {
    ... // content omitted
  }
}
});
Globalize.locale('en-US');
var formatter = Globalize.dateFormatter({datetime: 'short'});

At that point, globalize always throws the error. However, if I provide it with CLDR data that is not region-dependent (like just plain 'en') and set my locale to the same, the code works. In both cases, it is looking for cldr data that is just under the language code (ignoring the region for the cldr path.) I have tested de-DE, and pt-BR and see the same results in those languages: it only looks for the language cldr data, ignoring the language-REGION data that was provided.

I saw an issue on cldrjs that implied that the issue could be solved by using just en instead of en-US (which I am doing as a stopgap) but this same issue occurs for any region specific locale, and just using the language won't always be the right thing to do.

I'm not sure if this issue is better here or on cldrjs. Let me know if I need to move it, or provide more information.

@rxaviers

This comment has been minimized.

Member

rxaviers commented Dec 8, 2014

Hi @arbrown, your statement is not precisely correct. Although, I agree such behavior goes against common sense.

Globalize is clever enough to figure out the succinct form of any locale, called languageId in the table below.

locale languageId maxLanguageId language script region
en en en-Latn-US en Latn US
en-US en en-Latn-US en Latn US
de de de-Latn-DE de Latn DE
zh zh zh-Hans-CN zh Hans CN
zh-TW zh-TW zh-Hant-TW zh Hant TW
ar ar ar-Arab-EG ar Arab EG
pt pt pt-Latn-BR pt Latn BR
pt-BR pt pt-Latn-BR pt Latn BR
pt-PT pt-PT pt-Latn-PT pt Latn PT
es es es-Latn-ES es Latn ES
es-AR es-AR es-Latn-AR es Latn AR

You should load CLDR JSON files using locale in its succint form. Therefore, you should load en when you want either en, en-US, en-Latn, en-Latn-US (they are all the same). But, you should load en-GB when you want English as spoken in England, or en-IN when you want English as spoken in India.

You can figure out what the succint form is via [1] or [2].

  1. By looking at the error message. https://gist.github.com/rxaviers/aeb955c22c51d9c172a7
  2. By looking at a Globalize variable. https://gist.github.com/rxaviers/bb143a6715d1392ecc96

Little more details

Globalize (via cldrjs) figures out the correct languageId, which is used traversing CLDR paths. As explained in the issue you've referenced to (rxaviers/cldrjs#17 (comment)), it tries to lookup the path using main/{languageId}/... and languageId is always in the succint form, obtained by removing the likely subtags from maxLanguageId according to the specification.

To-Do

Improving documentation for clarity is an obvious step that could be taken.

  • Improve documentation for clarity.

I'm open to hear if you have any other suggestion.

@rxaviers rxaviers added the docs label Dec 8, 2014

@scottgonzalez

This comment has been minimized.

Contributor

scottgonzalez commented Dec 8, 2014

This will be a huge stumbling block for users. While docs will help, I don't think it's the right solution.

It seems we could take two different approaches.

  1. Check the provided locale first, and if not found, then check the succinct form.
  2. Make Globalize.load() smarter so that it will normalize the data for you. For example, if you load en-US and there is no en locale defined, then store the data in en instead.

Option 1 seems faster and safer. What are your thoughts about that @rxaviers?

@jzaefferer

This comment has been minimized.

Contributor

jzaefferer commented Dec 8, 2014

Check the provided locale first, and if not found, then check the succinct form.

That could happen when calling Globalize.locale. That might also be the right place to throw an error, if a locale is specified that later won't match any data.

@arbrown

This comment has been minimized.

arbrown commented Dec 8, 2014

Improved documentation could help, but I agree with @scottgonzalez that this will be a huge stumbling block. Previously, I was checking a request's accept-language header for their preferred language and loading cldr data based on that. I have cldr data for en-US and I set Globalize's locale to en-US, so I expect it to work. Even if en and en-US have the exact same data, I see no reason why you need to change to the succinct form.

Currently, I would need to implement logic in my program to change a user's accept-language string when it is one of the entries in the above table, and while I could do that, I don't think future users should need to.

I think both @scottgonzalez and @jzaefferer have very good suggestions. At the very least, Globalize should warn you if you set your locale to a long form like pt-BR or en-US when you didn't need to, but I think it should automatically store the data in the right place, or just use the full path when specified. I would prefer the latter since there is no guarantee that in the future en-US will always be the same as en, pt-BR as pt, etc...

@rxaviers

This comment has been minimized.

Member

rxaviers commented Dec 8, 2014

Quick comment.

I do agree users (developers) shouldn't have to bother about it. This should be transparent for them. So, we ought to think of a solution.

About the solution, I would not encourage a workaround in here. Either our implementation (that follows UTS#35, aka CLDR) or UTS#35 itself is not correct (and should be properly fixed). I'm pinging more people to figure out what's wrong and will get this issue update accordingly. Basically, either rxaviers/cldrjs#17 or the docs must be fixed.

@srl295

This comment has been minimized.

srl295 commented Dec 9, 2014

@rxaviers maybe it's a good discussion for cldr-users, but I don't read the CLDR docs as saying that "removing likely subtags" is recommended in all, or most, cases. Edit: and with fully resolved data this could mean duplicate data. So maybe we need to discuss this further.

@rxaviers rxaviers removed the docs label Dec 9, 2014

@rxaviers

This comment has been minimized.

Member

rxaviers commented Dec 9, 2014

@srl295 thanks for your comment (here and in the cldrjs issue as well). Please, in which case is "removing likely subtags" recommended? By the way, I hope I'm making an implementation mistake. Because, this seems so obvious to be a specification problem.

@srl295

This comment has been minimized.

srl295 commented Dec 9, 2014

@rxaviers - I would think it could be valid or recommended in the case of "what language tag is this data valid for"? But not recommended before lookup. It'd be better to add tags before lookup if anything, but depends on what you are look up. But I recommend bringing it up to cldr-users. Edit: I don't see where it is specified.

//cc:@slaneyrw

@patch

This comment has been minimized.

Contributor

patch commented Dec 10, 2014

How I'm used to worked with CLDR-based libraries is that I provide any locale and—whether it's supported or not—it works because of locale fallback, even if it has to fall back to root. At my work a user could have any combination of 20 supported languages and 250 different countries, which means we have up to 5,000 different locale codes in use. Most of those locales are not defined and would fall back, such as ja-UK to ja and pl-MX to pl, but we pass them through anyway and if a new locale is defined in the future, we'll automatically take advantage of it. That's how I implemented CLDR::Number and it even accepts undefined language codes like xx, falling back to root because having root formatting is better than no formatting, plus a new language could (and will) be added to the CLDR in the future, or a new language could be added to your system while you're stuck with an old version of the CLDR.

@rxaviers

This comment has been minimized.

Member

rxaviers commented Dec 10, 2014

@srl295 let's give some time for a couple of more answers on rxaviers/cldrjs#17 and if we can't figure it, I'll go on cldr-users as you suggest. Thanks so far.

@michael-fischer

This comment has been minimized.

michael-fischer commented Jan 8, 2015

@rxaviers, It doesn't look like you got any more answers on rxaviers/cldrjs#17 so I thought I would weigh in with thoughts about a particular use case.

If Globalize is going to be within a web page there is a high likely-hood that the browser settings may be considered for which language to load. This will most likely come in as a two part locale (i.e. en-US or es-BO). At least, it seems that way in the browsers I have checked. It seems odd to me that if the browser is set to "en-US" that I would have to use var globalize = Globalize("en"). and likewise key the JSON data off of "en." It is most natural to assume that "en-US" would be used. This nuance of CLDR is not something that the consumer of Globalize should need to know.

Especially since the browser locale settings have to be provided dynamically through server side processing of the request variables. Right now, since all of my data is stored based on culture, I have to manually tweak both the server side and client side to adapt for this [perceived] idiosyncrasy.I already felt bad coercing some locales into another (i.e. any two part locale that wasn't es-ES into es-419) but I could attribute that to a domain decision and isolate it to the business logic layer on the server. It's use being completely transparent on the client [unless someone was purposely abusing the system].

Let me know which way you think you are going to go [or if I am completely out in left field]. Thanks.

Totally irrelevant side note. At least in my use case en-US is not the exception. All of the languages we are trying to support use the root with the exception of en-GB.

@rxaviers

This comment has been minimized.

Member

rxaviers commented Jan 9, 2015

@michael-fischer, your thoughts and contributions are welcome.

On regard of rxaviers/cldrjs#17, although the issue is indeed quiet, lot has happened on cldr-users mailing list. So, I've just updated the issue accordingly and I'm still waiting for a couple of more answers there.

On regard of your proposal,

if the browser is set to "en-US" that I would have to use var globalize = Globalize("en").

You don't. You can use var globalize = Globalize( "en-US" ) just fine. What happens is that Globalize( "en-US" ) or Globalize( "en" ) or Globalize( "en-Latn-US" ) all will look for the en bundle (according to globalize.cldr.attributes.languageId). To give another example both Globalize( "en-GB" ) or Globalize( "en-Latn-GB" ) will look for the en-GB bundle. Note this is how it works now and it may eventually change after rxaviers/cldrjs#17 is closed.

Before going too deep here, what's your use case? I understood your client-side detects the user's browser locale and then fetches the needed CLDR data dynamically on your server. Is that the case? I also assume your problem is that you're willing to fetch "cldr/main/en-US/ca-gregorian.json" instead of "cldr/main/" + globalize.cldr.attributes.languageId + "/ca-gregorian.json". Is that correct? If so, I understand your pain. Otherwise, please could you clarify?

Totally irrelevant side note. At least in my use case en-US is not the exception. All of the languages we are trying to support use the root with the exception of en-GB.

Please, could you clarify?

@michael-fischer

This comment has been minimized.

michael-fischer commented Jan 9, 2015

@rxaviers,

Short version:
I guess my thought boils down to that if you use the Language-Region your message table still needs to be keyed of the language So even if you use .Globalize( "de-DE" ) you still need to have the language in your json.

Longer version:

You don't. You can use var globalize = Globalize( "de-DE" ) just fine.

I concede this point as using the Language subtag was my way of working around the problem that I was really trying to solve. Namely, the disconnect between using Language-Region when calling LoadMesssage. I assume internally globalize.cldr.attributes.languageId is used so the loaded JSON can't use Language-Region. Or at least I have only had success with:

{
    "root": {
        "actions": "Actions",
        "activityNotes": "Activity Notes",
        ...
    },
    "de": {
        "actions": "Aktionen",
        "activityNotes": "Aktivitätshinweise",
        ...
    }
}

as opposed to

{
    "root": {
        "actions": "Actions",
        "activityNotes": "Activity Notes",
        ...
    },
    "de-DE": {
        "actions": "Aktionen",
        "activityNotes": "Aktivitätshinweise",
        ...
    }
}

Note, I changed your example to "de-DE" from "en-US" since en-US gets loaded into "root" as a fallback and therefore just works.

Before going too deep here, what's your use case? I understood your client-side detects the user's browser locale and then fetches the needed CLDR data dynamically on your server. Is that the case? I also assume your problem is that you're willing to fetch "cldr/main/en-US/ca-gregorian.json" instead of "cldr/main/" + globalize.cldr.attributes.languageId + "/ca-gregorian.json". Is that correct? If so, I understand your pain. Otherwise, please could you clarify?

My use case is slightly more complicated that than but we can boil it down we use the browsers locale. I actually build what to download using a string similar to that. I just don't get the value from the globalize.cldr.attributes.languageId. However, doesn't your example have a chicken before the egg problem? Is it possible to initialize globalize prior to fetching the CLDR data? However, I could do something like the following demo code:

                var culture = "de-DE";   // hard coded for example only.
                var globalize = Globalize(culture);
                var language = globalize.cldr.attributes.languageId;

                var jsonData = $.ajax({
                    url: "json/DB2-" + culture + ".json",
                    async: false,
                    cache: false,
                    dataType: 'json'
                }).responseText;

                var strings = JSON.parse(jsonData);

                // Hack the language key here.
                if (strings.hasOwnProperty(culture) && language !== culture) {
                    strings[language] = strings[culture];
                    delete strings[culture];
                }

            Globalize.loadMessages(strings);

This would make the hack local to the client where it details with CLDR and not something sprinkled through the system. Still something that I shouldn't need to do but an acceptable workaround.

@rxaviers

This comment has been minimized.

@srl295

This comment has been minimized.

srl295 commented Jan 10, 2015

@rxaviers still not sure I follow the issue re: CLDR data. Perhaps we should chat sometime.

@rxaviers

This comment has been minimized.

Member

rxaviers commented Jan 12, 2015

Sure, I've sent you an email.

@rxaviers

This comment has been minimized.

Member

rxaviers commented Jan 14, 2015

😂 You're hacking the likelySubtags... Good job, but you cannot use Zzzz or ZZ as subtags in its entries (as far as I know). Although, this is irrelevant and you could use (for example) "en": "en-Latn-FR" and your example would be valid and work just fine for your case. Note likelySubtags informs Globalize which locales have default content [1]. Default content means that the child content is all in the parent. For example, en's default content is en-Latn-US (unless "hacked"), it means en-Latn-US (or en-US) must inherit everything from en. But, in case you want to tell Globalize that en's default content is actually en-Latn-FR (France), Globalize will expect different data in en-US and therefore will use that different messages.

Summary, if you know what you're doing go ahead. But, I'd first ask myself why my application really needs a different message for en-US than en. All en-US messages should be in en. Other English spoken countries (e.g. en-GB) should specify the overriding messages. (again, unless you really know what're doing.)

1: "The likelySubtag supplemental data provides default information" (ref).


EDIT: My response was to @gkindel, which has deleted his comment.

@rxaviers

This comment has been minimized.

Member

rxaviers commented Jan 14, 2015

Update... So far the most appealing solution for this issue (and rxaviers/cldrjs#17) is the "A simple & bulletproof solution for bundle lookup" described in Bundle Lookup thoughts, which basically is @scottgonzalez's 2nd approach. Although, it's still not final and conversation is still going on.

@gkindel

This comment has been minimized.

gkindel commented Jan 15, 2015

I redacted my earlier comment because I realized it was silly, but have gotten along using parentLocales. My main goal was to have my en-US strings actually labelled as such. Who knows, maybe population data will change and some other culture will become dominant 'en'. I don't like the fact that my strings file HAS to coordinate with the likelySubtatgs file. With the below solution "en" and "en-GB" inherit from "en-US".

    Globalize.load({
        "supplemental": {
            "parentLocales": {
                "parentLocale": {
                    "en" : "en-US"
                }
            },
            "likelySubtags": {
                "de": "de",
                "en": "en",
                "fr": "fr"
            }
        }
    });

    Globalize.loadMessages({
        "en-US": {
            hello : "'Sup",
            world : "world"
        },
        "en-GB": {
            hello: "Cheerio"
        },
        "fr": {
            hello: "Bonjour",
            world : "tout le monde"
        },
        "de": {
            hello: "Guten Tag",
            world : "welt"
        }
    });
@rxaviers

This comment has been minimized.

Member

rxaviers commented Jan 15, 2015

@gkindel this is wrong. Feel free to ping on IRC if you'd like to chat about it.

@gkindel

This comment has been minimized.

gkindel commented Jan 15, 2015

"Wrong" is pretty severe and nonspecific. I realize this is still alpha and I'm muddling along with the documentation that is available, and will look for any updates. And of course I appreciate the tremendous amount of effort you are all putting into an open source cause. I don't expect hand holding, but I can surface up some end-user confusion on these manners. A link to docs, further reading, or keywords to google would be helpful.

As an end user, I don't actually understand why likelySubtags is a hard dependency at all, or why it forces a less specific naming convention for my messages JSON. My impression was that it exists to resolve ambiguity, not introduce it. Seems that it should do nothing for the message formatter if i set a locale of "en-US", but should step in and help with a locale of "en" or "en-foo".

The code may be hacked, but it works to allow me to specify 'en-US' strings. and have them be accessible via 'en-US' or 'en' without throwing an error, which seemed to match the problem the @arbrown was having.

@rxaviers

This comment has been minimized.

Member

rxaviers commented Jan 16, 2015

@gkindel, the data you used for likelySubtags and parentLocales are wrong. It may work for your specific example. But, it will generate other problems. Because, they are wrong. More info about likelySubtags or parentLocales can be found here respectively: http://www.unicode.org/reports/tr35/tr35.html#Likely_Subtags and http://www.unicode.org/reports/tr35/tr35.html#Parent_Locales.

Having said that, you're very welcome to help and thanks for your comments so far. But, the solution for this issue is going in this path: https://docs.google.com/document/d/1qZwEVb4kfODi2TK5f4x15FYWj5rJRijXmSIg5m6OH8s/edit and https://docs.google.com/document/d/1qLbuz659VvCVhgyd08KRP0SMuqCvK9bSS3-0W-kMuuw/edit#heading=h.irqx9cg2caef

@rxaviers

This comment has been minimized.

Member

rxaviers commented Jan 19, 2015

Hello everyone,

rxaviers/cldrjs#26 is being considered as a solution to this issue. More details and examples can be found at https://github.com/rxaviers/cldrjs/blob/fix-17/doc/bundle_lookup_matcher.md.

Feel free to review it and add your comments. I would very much appreciate your feedback.


In other words, the original example of this issue will work just fine.

Globalize.load(
{
  "main": {
    "en-US": {
    ... // content omitted
  }
}
});
Globalize.locale('en-US');
var formatter = Globalize.dateFormatter({datetime: 'short'});

The same is also valid for messages. For example:

Globalize.loadMessages({
    "en-US": {
        hello : "'Sup"
    }
});

Globalize( "en-US" ).formatMessage( "hello" ); // Sup

PS: @gkindel has added a comment about formatting messages using inheritance, which currently works and is still going to work after this fix. Although, his example uses different data in en (default content for en-US) and en-US, which is fundamentally wrong. This particular problem could be discussed via IRC or by creating a separate issue.

@michael-fischer

This comment has been minimized.

michael-fischer commented Jan 20, 2015

@rxaviers, this is good to know. Thank you. I still think that it seems odd that loading en and en-US would change the bundle ID of en-US but hopefully that won't happen much and won't cause consternation moving forward.

@rxaviers

This comment has been minimized.

Member

rxaviers commented Jan 20, 2015

Great.

I still think that it seems odd that loading en and en-US would change the bundle ID of en-US.

Can you think of any problem this could cause?

@rxaviers

This comment has been minimized.

Member

rxaviers commented Jan 20, 2015

I've created PR #384 to address this issue. The interesting unit test is this one. It tests all cases listed as problematic by Mark Davis' Fixing Inheritance document and all cases listed here. All of them pass.

@rxaviers rxaviers closed this in 725e09f Jan 22, 2015

rxaviers added a commit that referenced this issue Mar 4, 2015

Documentation: Messages inheritance update
- Define `de`, `en`, `en-GB`, `fr`, and `pt-PT` with an empty set.
- Update bundle inheritance chain of `en-GB`.

Fixes #408
Ref #357

ashensis pushed a commit to ashensis/globalize that referenced this issue Mar 17, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment