-
Notifications
You must be signed in to change notification settings - Fork 15.2k
[libc++][windows] Use _wsetlocale() in __locale_guard #160479
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Thank you for submitting a Pull Request (PR) to the LLVM Project! This PR will be automatically labeled and the relevant teams will be notified. If you wish to, you can add reviewers by using the "Reviewers" section on this page. If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers. If you have further questions, they may be answered by the LLVM GitHub User Guide. You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums. |
@llvm/pr-subscribers-libcxx Author: None (lb90) ChangesQuerying the current locale string on Windows should always be done with _wsetlocale(). The OS and the CRT support localized language and country names, for example "Norwegian Bokmål_Norway". Narrow setlocale doesn't know what's the expected encoding. Fixes #160478 Full diff: https://github.com/llvm/llvm-project/pull/160479.diff 1 Files Affected:
diff --git a/libcxx/include/__locale_dir/support/windows.h b/libcxx/include/__locale_dir/support/windows.h
index 0df8709f118d0..1d54d4cb119e0 100644
--- a/libcxx/include/__locale_dir/support/windows.h
+++ b/libcxx/include/__locale_dir/support/windows.h
@@ -162,6 +162,12 @@ inline _LIBCPP_HIDE_FROM_ABI char* __setlocale(int __category, const char* __loc
std::__throw_bad_alloc();
return __new_locale;
}
+inline _LIBCPP_HIDE_FROM_ABI wchar_t* __wsetlocale(int __category, const wchar_t* __locale) {
+ wchar_t* __new_locale = ::_wsetlocale(__category, __locale);
+ if (__new_locale == nullptr)
+ std::__throw_bad_alloc();
+ return __new_locale;
+}
_LIBCPP_EXPORTED_FROM_ABI __lconv_t* __localeconv(__locale_t& __loc);
#endif // _LIBCPP_BUILDING_LIBRARY
@@ -309,7 +315,11 @@ struct __locale_guard {
// each category. In the second case, we know at least one category won't
// be what we want, so we only have to check the first case.
if (std::strcmp(__l.__get_locale(), __lc) != 0) {
- __locale_all = _strdup(__lc);
+ // Use wsetlocale to query the current locale string. This avoids a lossy
+ // conversion of the locale string from UTF-16 to the current LC_CTYPE
+ // charset. The Windows CRT allows language/country strings outside of
+ // ASCII, e.g. "Norwegian Bokmål_Norway.utf8"
+ __locale_all = _wcsdup(__locale::__wsetlocale(LC_ALL, nullptr));
if (__locale_all == nullptr)
std::__throw_bad_alloc();
__locale::__setlocale(LC_ALL, __l.__get_locale());
@@ -321,13 +331,13 @@ struct __locale_guard {
// for the different categories in the same format as returned by
// setlocale(LC_ALL, nullptr).
if (__locale_all != nullptr) {
- __locale::__setlocale(LC_ALL, __locale_all);
+ __locale::__wsetlocale(LC_ALL, __locale_all);
free(__locale_all);
}
_configthreadlocale(__status);
}
int __status;
- char* __locale_all = nullptr;
+ wchar_t* __locale_all = nullptr;
};
#endif // _LIBCPP_BUILDING_LIBRARY
|
Querying the current locale string on Windows should always be done with _wsetlocale(). The OS and the CRT support localized language and country names, for example "Norwegian Bokmål_Norway". Narrow setlocale() internally calls _wsetlocale() and converts the returned wide string using the current LC_CTYPE charset. However the string may not be representable in the current LC_CTYPE charset. Additionally, if the LC_CTYPE charset is changed after the query, the returned string becomes invalidly-encoded and cannot be used to restore the locale. This is a problem for code that temporarily changes the thread locale using RAII methods. Fixes llvm#160478
7848bdf
to
231ab41
Compare
I have now also modified include/__cxx03/__locale_dir/locale_base_api/locale_gard.h. That header is only ever included when A bit off-topic:
|
For info, this patch has been affecting GIMP, which has been crashing on Windows when set to specific languages (e.g. Norwegian Bokmål or Turkish): https://gitlab.gnome.org/GNOME/gimp/-/issues/12626 For our next release, we are building Exiv2 with an extraordinary ugly patch because of this: Exiv2/exiv2#3361 We would really appreciate a lot if the real bug could be fixed at the source, i.e. here. Thanks all! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The patch LGTM, but I've got one discussion point about it.
// conversion of the locale string from UTF-16 to the current LC_CTYPE | ||
// charset. The Windows CRT allows language / country strings outside of | ||
// ASCII, e.g. "Norwegian Bokm\u00E5l_Norway.utf8". | ||
__locale_all = _wcsdup(__wsetlocale(nullptr)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit of a pity that this requires calling __wsetlocale()
a second time after the first __setlocale()
above; I'm wondering if this is a risk for performance degradation.
See e.g. #56202 for a case where this has been measured to be a bottleneck - CC @alvinhochun.
Here, I guess the alternative is to unconditionally use __wsetlocale()
above for fetching the name of the current locale, and that requires us to do more of the potentially messy charset conversions. Likewise - the form that this patch suggests feels a bit asymmetrical, when both narrow and wide APIs are being used for the same thing. But I see that it would require more of a mess and more local charset conversions (and require us to decide which charset to use for conversions) if we'd switch over entirely.
So all in all, this is probably fine in this form; I think I agree that this is a reasonable compromise form.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please provide a test to ensure we don't regress on this.
Regarding a test here; I guess a test requires that the host system has a locale available with a non-ascii name. The libcxx tests does have some infrastructure for checking whether certain locales are available, for marking tests as unsupported if missing - but I'm not sure how unicode-safe that whole pipe is - from the libcxx test framework scripts, through lit up to the tests. So perhaps this would need to be a test that just unconditionally tries to set such a locale, bails out if not available. (Is there a return code we can return to mark the test as skipped, rather than succeeded? otherwise succeeded as a no-op probably is the best we can do.)
Thanks for this context; I was wondering how this led to crashes - I thought this fault situation would only lead to the wrong locale being used. But https://developercommunity.visualstudio.com/t/setlocale-may-crash-CRT-in-some-cases/10603395 is the interesting missing clue here; |
On Windows, locale strings are not limited to ASCII characters, and narrow locale strings are encoded relative to the
LC_CTYPE
charset.The
__locale_guard
class does three things in sequence:However, the locale string obtained from
1)
may be in an encoding different from whatsetlocale
expects at3)
To avoid any problem with narrow string encodings, use
_wsetlocale
in__locale_guard
.Question: can we also modify https://github.com/llvm/llvm-project/blob/main/libcxx/include/__cxx03/__locale_dir/locale_base_api/locale_guard.h#L37?EDIT: doneFixes #160478