Skip to content

Conversation

lb90
Copy link

@lb90 lb90 commented Sep 24, 2025

On Windows, locale strings are not limited to ASCII characters, and narrow locale strings are encoded relative to the LC_CTYPE charset.

The __locale_guard class does three things in sequence:

  1. Queries the current locale string
  2. Sets a temporary locale
  3. Restores the locale retrieved at 1

However, the locale string obtained from 1) may be in an encoding different from what setlocale expects at 3)

To avoid any problem with narrow string encodings, use _wsetlocale in __locale_guard.

Question: can we also modify https://github.com/llvm/llvm-project/blob/main/libcxx/include/__cxx03/__locale_dir/locale_base_api/locale_guard.h#L37? EDIT: done

Fixes #160478

@lb90 lb90 requested a review from a team as a code owner September 24, 2025 09:49
Copy link

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

@llvmbot llvmbot added the libc++ libc++ C++ Standard Library. Not GNU libstdc++. Not libc++abi. label Sep 24, 2025
@llvmbot
Copy link
Member

llvmbot commented Sep 24, 2025

@llvm/pr-subscribers-libcxx

Author: None (lb90)

Changes

Querying the current locale string on Windows should always be done with _wsetlocale(). The OS and the CRT support localized language and country names, for example "Norwegian Bokmål_Norway". Narrow setlocale doesn't know what's the expected encoding.

Fixes #160478


Full diff: https://github.com/llvm/llvm-project/pull/160479.diff

1 Files Affected:

  • (modified) libcxx/include/__locale_dir/support/windows.h (+13-3)
diff --git a/libcxx/include/__locale_dir/support/windows.h b/libcxx/include/__locale_dir/support/windows.h
index 0df8709f118d0..1d54d4cb119e0 100644
--- a/libcxx/include/__locale_dir/support/windows.h
+++ b/libcxx/include/__locale_dir/support/windows.h
@@ -162,6 +162,12 @@ inline _LIBCPP_HIDE_FROM_ABI char* __setlocale(int __category, const char* __loc
     std::__throw_bad_alloc();
   return __new_locale;
 }
+inline _LIBCPP_HIDE_FROM_ABI wchar_t* __wsetlocale(int __category, const wchar_t* __locale) {
+  wchar_t* __new_locale = ::_wsetlocale(__category, __locale);
+  if (__new_locale == nullptr)
+    std::__throw_bad_alloc();
+  return __new_locale;
+}
 _LIBCPP_EXPORTED_FROM_ABI __lconv_t* __localeconv(__locale_t& __loc);
 #endif // _LIBCPP_BUILDING_LIBRARY
 
@@ -309,7 +315,11 @@ struct __locale_guard {
     // each category.  In the second case, we know at least one category won't
     // be what we want, so we only have to check the first case.
     if (std::strcmp(__l.__get_locale(), __lc) != 0) {
-      __locale_all = _strdup(__lc);
+      // Use wsetlocale to query the current locale string. This avoids a lossy
+      // conversion of the locale string from UTF-16 to the current LC_CTYPE
+      // charset. The Windows CRT allows language/country strings outside of
+      // ASCII, e.g. "Norwegian Bokmål_Norway.utf8"
+      __locale_all = _wcsdup(__locale::__wsetlocale(LC_ALL, nullptr));
       if (__locale_all == nullptr)
         std::__throw_bad_alloc();
       __locale::__setlocale(LC_ALL, __l.__get_locale());
@@ -321,13 +331,13 @@ struct __locale_guard {
     // for the different categories in the same format as returned by
     // setlocale(LC_ALL, nullptr).
     if (__locale_all != nullptr) {
-      __locale::__setlocale(LC_ALL, __locale_all);
+      __locale::__wsetlocale(LC_ALL, __locale_all);
       free(__locale_all);
     }
     _configthreadlocale(__status);
   }
   int __status;
-  char* __locale_all = nullptr;
+  wchar_t* __locale_all = nullptr;
 };
 #endif // _LIBCPP_BUILDING_LIBRARY
 

@philnik777 philnik777 requested a review from mstorsjo September 24, 2025 09:54
Querying the current locale string on Windows should always be done
with _wsetlocale(). The OS and the CRT support localized language
and country names, for example "Norwegian Bokmål_Norway".

Narrow setlocale() internally calls _wsetlocale() and converts the
returned wide string using the current LC_CTYPE charset. However
the string may not be representable in the current LC_CTYPE charset.
Additionally, if the LC_CTYPE charset is changed after the query,
the returned string becomes invalidly-encoded and cannot be used
to restore the locale.

This is a problem for code that temporarily changes the thread locale
using RAII methods.

Fixes llvm#160478
@lb90
Copy link
Author

lb90 commented Sep 26, 2025

I have now also modified include/__cxx03/__locale_dir/locale_base_api/locale_gard.h. That header is only ever included when _LIBCPP_LOCALE__L_EXTENSIONS is not defined, but _LIBCPP_LOCALE__L_EXTENSIONS is always defined on Windows. So I have modified code that is effectively unused, but things may change in the future...

A bit off-topic:

__libcpp_locale_guard(__libcpp_locale_guard const&) = delete;
uses deleted function declarations which is a C++11 extension. Isn't that header used for C++03 compilation?

@Jehan
Copy link

Jehan commented Sep 30, 2025

For info, this patch has been affecting GIMP, which has been crashing on Windows when set to specific languages (e.g. Norwegian Bokmål or Turkish): https://gitlab.gnome.org/GNOME/gimp/-/issues/12626

For our next release, we are building Exiv2 with an extraordinary ugly patch because of this: Exiv2/exiv2#3361

We would really appreciate a lot if the real bug could be fixed at the source, i.e. here. Thanks all!

Copy link
Member

@mstorsjo mstorsjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The patch LGTM, but I've got one discussion point about it.

// conversion of the locale string from UTF-16 to the current LC_CTYPE
// charset. The Windows CRT allows language / country strings outside of
// ASCII, e.g. "Norwegian Bokm\u00E5l_Norway.utf8".
__locale_all = _wcsdup(__wsetlocale(nullptr));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit of a pity that this requires calling __wsetlocale() a second time after the first __setlocale() above; I'm wondering if this is a risk for performance degradation.

See e.g. #56202 for a case where this has been measured to be a bottleneck - CC @alvinhochun.

Here, I guess the alternative is to unconditionally use __wsetlocale() above for fetching the name of the current locale, and that requires us to do more of the potentially messy charset conversions. Likewise - the form that this patch suggests feels a bit asymmetrical, when both narrow and wide APIs are being used for the same thing. But I see that it would require more of a mess and more local charset conversions (and require us to decide which charset to use for conversions) if we'd switch over entirely.

So all in all, this is probably fine in this form; I think I agree that this is a reasonable compromise form.

Copy link
Contributor

@philnik777 philnik777 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please provide a test to ensure we don't regress on this.

@mstorsjo
Copy link
Member

mstorsjo commented Oct 6, 2025

Please provide a test to ensure we don't regress on this.

Regarding a test here; I guess a test requires that the host system has a locale available with a non-ascii name. The libcxx tests does have some infrastructure for checking whether certain locales are available, for marking tests as unsupported if missing - but I'm not sure how unicode-safe that whole pipe is - from the libcxx test framework scripts, through lit up to the tests. So perhaps this would need to be a test that just unconditionally tries to set such a locale, bails out if not available. (Is there a return code we can return to mark the test as skipped, rather than succeeded? otherwise succeeded as a no-op probably is the best we can do.)

For info, this patch has been affecting GIMP, which has been crashing on Windows when set to specific languages (e.g. Norwegian Bokmål or Turkish): https://gitlab.gnome.org/GNOME/gimp/-/issues/12626

Thanks for this context; I was wondering how this led to crashes - I thought this fault situation would only lead to the wrong locale being used. But https://developercommunity.visualstudio.com/t/setlocale-may-crash-CRT-in-some-cases/10603395 is the interesting missing clue here; setlocale() aborts the process when given a faulty/mangled locale name here - even though it's not supposed to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
libc++ libc++ C++ Standard Library. Not GNU libstdc++. Not libc++abi.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[libc++] [windows] Crash with Norwegian regional settings
5 participants