Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I18N isn't working (mostly) via changing the environment variables #783

Closed
brlin-tw opened this issue Nov 29, 2018 · 7 comments
Closed

I18N isn't working (mostly) via changing the environment variables #783

brlin-tw opened this issue Nov 29, 2018 · 7 comments

Comments

@brlin-tw
Copy link
Contributor

brlin-tw commented Nov 29, 2018

I noticed that currently (5.6.0-64-gc6f1946) tidy isn't printing localized messages even when I changing the environmentals according to tidy-html5/localize at next · htacg/tidy-html5:

export LANG=fr_FR
export LC_ALL=fr_FR

Digging the code I found the setlocale call at tidy-html5/tidylib.c at 86b52dc1081ca4b0582c7bad279bf254bad268e1 · htacg/tidy-html5:

    /* Set the locale for tidy's output. This both configures
    ** LibTidy to use the environment's locale as well as the
    ** standard library.
    */
#if SUPPORT_LOCALIZATIONS
    if ( TY_(tidyGetLanguageSetByUser)() == no )
    {
        TY_(tidySetLanguage)( setlocale( LC_ALL, "") );
    }
#endif

I written a small program to test it:

#include <stdlib.h>
#include <stdio.h>
#include <locale.h>

int main(int argc, char* argv[]){
	char* result = NULL;
 
	result = setlocale(LC_ALL, "");

	if(result != NULL){
		printf("setlocale(LC_ALL, \"\") returns %s\n", result);
	}else{
		printf("setlocale(LC_ALL, \"\") returns NULL.\n");
	}
	return EXIT_SUCCESS;
}

The following is my locale:

LANG=zh_TW.UTF-8
LANGUAGE=zh_TW:zh_HK:zh_CN:en_US:en
LC_CTYPE="zh_TW.UTF-8"
LC_NUMERIC=zh_TW.UTF-8
LC_TIME=zh_TW.UTF-8
LC_COLLATE="zh_TW.UTF-8"
LC_MONETARY=zh_TW.UTF-8
LC_MESSAGES="zh_TW.UTF-8"
LC_PAPER=zh_TW.UTF-8
LC_NAME=zh_TW.UTF-8
LC_ADDRESS=zh_TW.UTF-8
LC_TELEPHONE=zh_TW.UTF-8
LC_MEASUREMENT=zh_TW.UTF-8
LC_IDENTIFICATION=zh_TW.UTF-8
LC_ALL=

and the locale definitions compiled in my system:

$ cat /var/lib/locales/supported.d/*
en_US.UTF-8 UTF-8
en_GB.UTF-8 UTF-8
fr_FR.UTF-8 UTF-8

zh_CN.UTF-8 UTF-8
zh_TW.UTF-8 UTF-8
zh_HK.UTF-8 UTF-8

Here's the test results:

$ ./test_setlocale 
setlocale(LC_ALL, "") returns "zh_TW.UTF-8"

$ env LANG=zh_TW.UTF-8 ./test_setlocale 
setlocale(LC_ALL, "") returns "zh_TW.UTF-8"

$ env LANG=fr_FR ./test_setlocale 
setlocale(LC_ALL, "") returns NULL.

$ env LANG=fr_FR.UTF-8 ./test_setlocale 
setlocale(LC_ALL, "") returns "LC_CTYPE=fr_FR.UTF-8;LC_NUMERIC=zh_TW.UTF-8;LC_TIME=zh_TW.UTF-8;LC_COLLATE=fr_FR.UTF-8;LC_MONETARY=zh_TW.UTF-8;LC_MESSAGES=fr_FR.UTF-8;LC_PAPER=zh_TW.UTF-8;LC_NAME=zh_TW.UTF-8;LC_ADDRESS=zh_TW.UTF-8;LC_TELEPHONE=zh_TW.UTF-8;LC_MEASUREMENT=zh_TW.UTF-8;LC_IDENTIFICATION=zh_TW.UTF-8"

$ env LC_ALL=zh_TW.UTF-8 ./test_setlocale 
setlocale(LC_ALL, "") returns "zh_TW.UTF-8"

$ env LC_ALL=zh_TW ./test_setlocale 
setlocale(LC_ALL, "") returns NULL.

$ env LC_ALL=fr_FR.UTF-8 ./test_setlocale 
setlocale(LC_ALL, "") returns "fr_FR.UTF-8"

$ env LC_ALL=fr_FR ./test_setlocale 
setlocale(LC_ALL, "") returns NULL.

Speculations

  • I believe the LC_CTYPE... lines causes the I18N fail as the locale LC_CT doesn't exist
  • I'm not sure why LANG=zh_TW.UTF-8 and LANG=fr_FR.UTF-8 results differently
  • The only configuration I found that makes Tidy's I18N work is env LC_ALL=fr_FR.UTF-8 tidy -help
  • I noticed that env LC_ALL=fr_FR.UTF-8 tidy -help only works when the locale definition of fr_FR.UTF_8is installed
@brlin-tw brlin-tw changed the title I18N isn't via changing the LANG environment variable I18N isn't working via changing the LANG environment variable Nov 29, 2018
@brlin-tw brlin-tw changed the title I18N isn't working via changing the LANG environment variable I18N isn't working (mostly) via changing the environment variables Nov 29, 2018
@geoffmcl
Copy link
Contributor

@Lin-Buo-Ren thanks for the issue... but I am still trying to understand the problem here...

I agree the docs could maybe add say language of your choice, if installed, of course. in reply to your last point... you can't select a temporary language change if it does not exist... docs can always be improved... but...

As part of TY_(tidySetLanguage), the setting of the two dictionaries, the call to TY_(tidyNormalizedLocaleName)( ctmbstr locale ) makes for hard reading... but then goes on to dict1 = TY_(tidyTestLanguage( wantCode )); /* WANTED language */ - yikes, lots to understand here... but it works to select the language... most of the time, if setup properly, maybe...

But, as several of your tests point out, like $ env LC_ALL=fr_FR ./test_setlocale returns NULL, so tidy defaults to english... What is the problem here?

How can this be fixed? That is, if setlocale( LC_ALL, "") is fried, and returns NULL... where else to look?

Maybe, at least in UNIX, libtidy could search for say ENV LANG[UAGE]=xx_yy as the selector? Don't know...

Seek further feedback here, even to help understand the problem... thanks...

@brlin-tw
Copy link
Contributor Author

brlin-tw commented Dec 2, 2018

I've changed the test code and the speculated results to match what Tidy does.

in reply to your last point... you can't select a temporary language change if it does not exist

Please disregard the part I said that the localization works only when the locale is installed, it is merely a speculation due to my unfamiliarity of how the glibc locales work.

The obvious problems to the doc and code are:

  • The method to switch the language doesn't apply to at least the GNU+Linux platform(requires at least language_territory.codeset, language_territory is not enough), I would suggest just mention the LANG environment variable(without example values) as readers may be confused.
  • The setlocale(3) call in some circumstances returns the entire list of locale environment variables LC_CTYPE=...;LC_NUMERIC=...;LC_TIME=... but not the locale name, which will be parsed as language lc_ct by Tidy if I understand the code correctly.

Checking LANG (and probably, LC_MESSAGES) works but as you've said it isn't portable.
(NOTE: LANG and LC_MESSAGES isn't defined in locale.h, probably not in standard C)

Reference: Setting the Locale (The GNU C Library)

@brlin-tw
Copy link
Contributor Author

brlin-tw commented Dec 2, 2018

At least in glibc, the setlocale(LC_ALL, "") or setlocale(LC_ALL, NULL) calls will return a regular locale name when:

  • All the non LC_ALL locale categories are set with the same locale name

The calls will return a composite locale name, which is a semi-colon separated list of entries of the form CATEGORY=VALUE, if:

  • At least one of the non LC_ALL locale categories are set with a different locale name.

Related source: https://sourceware.org/git/?p=glibc.git;a=blob;f=locale/setlocale.c;h=e4de907e1f48396f7f505be359b44479bb1a39b8;hb=HEAD#l269

Not sure for other platforms, though.

@brlin-tw
Copy link
Contributor Author

brlin-tw commented Dec 2, 2018

I would suggest Tidy determine the language in the following order:

  1. Parse the value of LANGUAGE according to https://www.gnu.org/software/gettext/manual/html_node/The-LANGUAGE-variable.html and chooses the first language name which translation is available
  2. Try LC_MESSAGES
  3. Try LANG
  4. Fallback to the current setlocale call

brlin-tw added a commit to brlin-tw/tidy-html5 that referenced this issue Dec 3, 2018
The `setlocale` call doesn't return a single locale name in glibc,
instead it returns a composite locale name which is a concatenation of
the entire list of locale name and its values, causing the language
detection fails.

This patch attempts to set the language via LC_MESSAGES and LANG
environment variables which are commonly used in POSIX-like systems,
then fallbacks to `setlocale` as the last resort.
@brlin-tw
Copy link
Contributor Author

brlin-tw commented Dec 3, 2018

I implemented a fix that takes care of 2, 3, and 4.

brlin-tw added a commit to brlin-tw/tidy-html5 that referenced this issue Dec 3, 2018
The `setlocale` call doesn't return a single locale name in glibc when
any of the locale category variable has a different value, instead it
returns a composite locale name which is a concatenation of the entire
list of locale name and its values, causing the language detection fail.

This patch attempts to set the language via LC_MESSAGES and LANG
environment variables which are commonly used in POSIX-like systems,
then fallbacks to `setlocale` as the last resort.
@geoffmcl
Copy link
Contributor

geoffmcl commented Dec 3, 2018

@Lin-Buo-Ren thanks for all this, and the PR #785 ... all looking good...

In some ways there is an overlap with issue #770, and please see my comment there on the PR... thanks

@balthisar
Copy link
Member

Looks like this was merged in the PR, so closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants