[libc++] Speed up classic locale #72112

dvyukov · 2023-11-13T12:32:21Z

Locale objects use atomic reference counting, which may be very expensive in parallel applications. The classic locale is used by default by all streams and can be very contended. But it's never destroyed, so the reference counting is also completely pointless on the classic locale. Currently ~70% of time in the parallel stringstream benchmarks is spent in locale ctor/dtor. And the execution radically slows down with more threads.

Avoid reference counting on the classic locale. With this change parallel benchmarks start to scale with threads.

                              │   baseline   │    optimized                            │
                              │    sec/op    │    sec/op      vs base                  │
Istream_numbers/0/threads:1      4.672µ ± 0%   4.419µ ± 0%     -5.42% (p=0.000 n=30+39)
Istream_numbers/0/threads:72   539.817µ ± 0%   9.842µ ± 1%    -98.18% (p=0.000 n=30+40)
Istream_numbers/1/threads:1      4.890µ ± 0%   4.750µ ± 0%     -2.85% (p=0.000 n=30+40)
Istream_numbers/1/threads:72     66.44µ ± 1%   10.14µ ± 1%    -84.74% (p=0.000 n=30+40)
Istream_numbers/2/threads:1      4.888µ ± 0%   4.746µ ± 0%     -2.92% (p=0.000 n=30+40)
Istream_numbers/2/threads:72     494.8µ ± 0%   410.2µ ± 1%    -17.11% (p=0.000 n=30+40)
Istream_numbers/3/threads:1      4.697µ ± 0%   4.695µ ± 5%          ~ (p=0.391 n=30+37)
Istream_numbers/3/threads:72     421.5µ ± 7%   421.9µ ± 9%          ~ (p=0.665 n=30)
Ostream_number/0/threads:1       183.0n ± 0%   141.0n ± 2%    -22.95% (p=0.000 n=30)
Ostream_number/0/threads:72    24196.5n ± 1%   343.5n ± 3%    -98.58% (p=0.000 n=30)
Ostream_number/1/threads:1       250.0n ± 0%   196.0n ± 2%    -21.60% (p=0.000 n=30)
Ostream_number/1/threads:72    16260.5n ± 0%   407.0n ± 2%    -97.50% (p=0.000 n=30)
Ostream_number/2/threads:1       254.0n ± 0%   196.0n ± 1%    -22.83% (p=0.000 n=30)
Ostream_number/2/threads:72      28.49µ ± 1%   18.89µ ± 5%    -33.72% (p=0.000 n=30)
Ostream_number/3/threads:1       185.0n ± 0%   185.0n ± 0%      0.00% (p=0.017 n=30)
Ostream_number/3/threads:72      19.38µ ± 4%   19.33µ ± 5%          ~ (p=0.425 n=30)

llvmbot · 2023-11-13T12:32:52Z

@llvm/pr-subscribers-libcxx

Author: Dmitry Vyukov (dvyukov)

Changes

Locale objects use atomic reference counting, which may be very expensive in parallel applications. The classic locale is used by default by all streams and can be very contended. But it's never destroyed, so the reference counting is also completely pointless on the classic locale. Currently ~70% of time in the parallel stringstream benchmarks is spent in locale ctor/dtor. And the execution radically slows down with more threads.

Avoid reference counting on the classic locale. With this change parallel benchmarks start to scale with threads.

                              │   baseline   │    optimized                            │
                              │    sec/op    │    sec/op      vs base                  │
Istream_numbers/0/threads:1      4.672µ ± 0%   4.419µ ± 0%     -5.42% (p=0.000 n=30+39)
Istream_numbers/0/threads:72   539.817µ ± 0%   9.842µ ± 1%    -98.18% (p=0.000 n=30+40)
Istream_numbers/1/threads:1      4.890µ ± 0%   4.750µ ± 0%     -2.85% (p=0.000 n=30+40)
Istream_numbers/1/threads:72     66.44µ ± 1%   10.14µ ± 1%    -84.74% (p=0.000 n=30+40)
Istream_numbers/2/threads:1      4.888µ ± 0%   4.746µ ± 0%     -2.92% (p=0.000 n=30+40)
Istream_numbers/2/threads:72     494.8µ ± 0%   410.2µ ± 1%    -17.11% (p=0.000 n=30+40)
Istream_numbers/3/threads:1      4.697µ ± 0%   4.695µ ± 5%          ~ (p=0.391 n=30+37)
Istream_numbers/3/threads:72     421.5µ ± 7%   421.9µ ± 9%          ~ (p=0.665 n=30)
Ostream_number/0/threads:1       183.0n ± 0%   141.0n ± 2%    -22.95% (p=0.000 n=30)
Ostream_number/0/threads:72    24196.5n ± 1%   343.5n ± 3%    -98.58% (p=0.000 n=30)
Ostream_number/1/threads:1       250.0n ± 0%   196.0n ± 2%    -21.60% (p=0.000 n=30)
Ostream_number/1/threads:72    16260.5n ± 0%   407.0n ± 2%    -97.50% (p=0.000 n=30)
Ostream_number/2/threads:1       254.0n ± 0%   196.0n ± 1%    -22.83% (p=0.000 n=30)
Ostream_number/2/threads:72      28.49µ ± 1%   18.89µ ± 5%    -33.72% (p=0.000 n=30)
Ostream_number/3/threads:1       185.0n ± 0%   185.0n ± 0%      0.00% (p=0.017 n=30)
Ostream_number/3/threads:72      19.38µ ± 4%   19.33µ ± 5%          ~ (p=0.425 n=30)

Full diff: https://github.com/llvm/llvm-project/pull/72112.diff

2 Files Affected:

(modified) libcxx/benchmarks/stringstream.bench.cpp (+59-3)
(modified) libcxx/src/locale.cpp (+34-14)

diff --git a/libcxx/benchmarks/stringstream.bench.cpp b/libcxx/benchmarks/stringstream.bench.cpp
index ea602557ccd770e..c10ee3a8cc5b83c 100644
--- a/libcxx/benchmarks/stringstream.bench.cpp
+++ b/libcxx/benchmarks/stringstream.bench.cpp
@@ -1,11 +1,12 @@
 #include "benchmark/benchmark.h"
 #include "test_macros.h"
 
+#include <mutex>
 #include <sstream>
 
 TEST_NOINLINE double istream_numbers();
 
-double istream_numbers() {
+double istream_numbers(std::locale* l) {
   const char* a[] = {"-6  69 -71  2.4882e-02 -100 101 -2.00005 5000000 -50000000",
                      "-25 71   7 -9.3262e+01 -100 101 -2.00005 5000000 -50000000",
                      "-14 53  46 -6.7026e-02 -100 101 -2.00005 5000000 -50000000"};
@@ -14,17 +15,72 @@ double istream_numbers() {
   double f1 = 0.0, f2 = 0.0, q = 0.0;
   for (int i = 0; i < 3; i++) {
     std::istringstream s(a[i]);
+    if (l)
+      s.imbue(*l);
     s >> a1 >> a2 >> a3 >> f1 >> a4 >> a5 >> f2 >> a6 >> a7;
     q += (a1 + a2 + a3 + a4 + a5 + a6 + a7 + f1 + f2) / 1000000;
   }
   return q;
 }
 
+struct LocaleSelector {
+  std::locale* imbue;
+  std::locale old;
+
+  LocaleSelector(benchmark::State& state) {
+    static std::mutex mu;
+    std::lock_guard l(mu);
+    switch (state.range(0)) {
+    case 0: {
+      old   = std::locale::global(std::locale::classic());
+      imbue = nullptr;
+      break;
+    }
+    case 1: {
+      old = std::locale::global(std::locale::classic());
+      thread_local std::locale l("en_US.UTF-8");
+      imbue = &l;
+      break;
+    }
+    case 2: {
+      old = std::locale::global(std::locale::classic());
+      static std::locale l("en_US.UTF-8");
+      imbue = &l;
+      break;
+    }
+    case 3: {
+      old   = std::locale::global(std::locale("en_US.UTF-8"));
+      imbue = nullptr;
+      break;
+    }
+    }
+  }
+
+  ~LocaleSelector() {
+    static std::mutex mu;
+    std::lock_guard l(mu);
+    std::locale::global(old);
+  }
+};
+
 static void BM_Istream_numbers(benchmark::State& state) {
+  LocaleSelector sel(state);
   double i = 0;
   while (state.KeepRunning())
-    benchmark::DoNotOptimize(i += istream_numbers());
+    benchmark::DoNotOptimize(i += istream_numbers(sel.imbue));
+}
+BENCHMARK(BM_Istream_numbers)->DenseRange(0, 3)->UseRealTime()->Threads(1)->ThreadPerCpu();
+
+static void BM_Ostream_number(benchmark::State& state) {
+  LocaleSelector sel(state);
+  while (state.KeepRunning()) {
+    std::ostringstream ss;
+    if (sel.imbue)
+      ss.imbue(*sel.imbue);
+    ss << 0;
+    benchmark::DoNotOptimize(ss.str().c_str());
+  }
 }
+BENCHMARK(BM_Ostream_number)->DenseRange(0, 3)->UseRealTime()->Threads(1)->ThreadPerCpu();
 
-BENCHMARK(BM_Istream_numbers)->RangeMultiplier(2)->Range(1024, 4096);
 BENCHMARK_MAIN();
diff --git a/libcxx/src/locale.cpp b/libcxx/src/locale.cpp
index c37e091dcc4671b..58b2d6c33606ba9 100644
--- a/libcxx/src/locale.cpp
+++ b/libcxx/src/locale.cpp
@@ -8,6 +8,7 @@
 
 #include <__utility/unreachable.h>
 #include <algorithm>
+#include <atomic>
 #include <clocale>
 #include <codecvt>
 #include <cstddef>
@@ -80,7 +81,7 @@ locale_t __cloc() {
 
 namespace {
 
-struct release
+struct releaser
 {
     void operator()(locale::facet* p) {p->__release_shared();}
 };
@@ -154,12 +155,16 @@ class _LIBCPP_HIDDEN locale::__imp
         {return static_cast<size_t>(id) < facets_.size() && facets_[static_cast<size_t>(id)];}
     const locale::facet* use_facet(long id) const;
 
+    void acquire();
+    void release();
+
     static const locale& make_classic();
     static       locale& make_global();
 private:
     void install(facet* f, long id);
     template <class F> void install(F* f) {install(f, f->id.__get());}
     template <class F> void install_from(const __imp& other);
+    static std::atomic<__imp*> classic_;
 };
 
 locale::__imp::__imp(size_t refs)
@@ -501,7 +506,7 @@ locale::__imp::__imp(const __imp& other, facet* f, long id)
       name_("*")
 {
     f->__add_shared();
-    unique_ptr<facet, release> hold(f);
+    unique_ptr<facet, releaser> hold(f);
     facets_ = other.facets_;
     for (unsigned i = 0; i < other.facets_.size(); ++i)
         if (facets_[i])
@@ -520,7 +525,7 @@ void
 locale::__imp::install(facet* f, long id)
 {
     f->__add_shared();
-    unique_ptr<facet, release> hold(f);
+    unique_ptr<facet, releaser> hold(f);
     if (static_cast<size_t>(id) >= facets_.size())
         facets_.resize(static_cast<size_t>(id+1));
     if (facets_[static_cast<size_t>(id)])
@@ -538,6 +543,8 @@ locale::__imp::use_facet(long id) const
 
 // locale
 
+std::atomic<locale::__imp*> locale::__imp::classic_;
+
 const locale&
 locale::__imp::make_classic()
 {
@@ -545,9 +552,22 @@ locale::__imp::make_classic()
     alignas(locale) static std::byte buf[sizeof(locale)];
     locale* c = reinterpret_cast<locale*>(&buf);
     c->__locale_ = &make<__imp>(1u);
+    classic_.store(c->__locale_, std::memory_order_relaxed);
     return *c;
 }
 
+void locale::__imp::acquire()
+{
+    if (this != classic_.load(std::memory_order_relaxed))
+        __add_shared();
+}
+
+void locale::__imp::release()
+{
+    if (this != classic_.load(std::memory_order_relaxed))
+        __release_shared();
+}
+
 const locale&
 locale::classic()
 {
@@ -574,25 +594,25 @@ locale::__global()
 locale::locale() noexcept
     : __locale_(__global().__locale_)
 {
-    __locale_->__add_shared();
+    __locale_->acquire();
 }
 
 locale::locale(const locale& l) noexcept
     : __locale_(l.__locale_)
 {
-    __locale_->__add_shared();
+    __locale_->acquire();
 }
 
 locale::~locale()
 {
-    __locale_->__release_shared();
+    __locale_->release();
 }
 
 const locale&
 locale::operator=(const locale& other) noexcept
 {
-    other.__locale_->__add_shared();
-    __locale_->__release_shared();
+    other.__locale_->acquire();
+    __locale_->release();
     __locale_ = other.__locale_;
     return *this;
 }
@@ -601,32 +621,32 @@ locale::locale(const char* name)
     : __locale_(name ? new __imp(name)
                      : (__throw_runtime_error("locale constructed with null"), nullptr))
 {
-    __locale_->__add_shared();
+    __locale_->acquire();
 }
 
 locale::locale(const string& name)
     : __locale_(new __imp(name))
 {
-    __locale_->__add_shared();
+    __locale_->acquire();
 }
 
 locale::locale(const locale& other, const char* name, category c)
     : __locale_(name ? new __imp(*other.__locale_, name, c)
                      : (__throw_runtime_error("locale constructed with null"), nullptr))
 {
-    __locale_->__add_shared();
+    __locale_->acquire();
 }
 
 locale::locale(const locale& other, const string& name, category c)
     : __locale_(new __imp(*other.__locale_, name, c))
 {
-    __locale_->__add_shared();
+    __locale_->acquire();
 }
 
 locale::locale(const locale& other, const locale& one, category c)
     : __locale_(new __imp(*other.__locale_, *one.__locale_, c))
 {
-    __locale_->__add_shared();
+    __locale_->acquire();
 }
 
 string
@@ -642,7 +662,7 @@ locale::__install_ctor(const locale& other, facet* f, long id)
         __locale_ = new __imp(*other.__locale_, f, id);
     else
         __locale_ = other.__locale_;
-    __locale_->__add_shared();
+    __locale_->acquire();
 }
 
 locale

dvyukov · 2023-11-13T12:40:48Z

This is an alternative version of #70631 (without inlining of ctor/dtor).

mordante

Thanks for the patch! I had a short look and see some things I want to look at later this week in more detail.

libcxx/src/locale.cpp

dvyukov

Please take another look.

libcxx/src/locale.cpp

Locale objects use atomic reference counting, which may be very expensive in parallel applications. The classic locale is used by default by all streams and can be very contended. But it's never destroyed, so the reference counting is also completely pointless on the classic locale. Currently ~70% of time in the parallel stringstream benchmarks is spent in locale ctor/dtor. And the execution radically slows down with more threads. Avoid reference counting on the classic locale. With this change parallel benchmarks start to scale with threads. │ baseline │ optimized │ │ sec/op │ sec/op vs base │ Istream_numbers/0/threads:1 4.672µ ± 0% 4.419µ ± 0% -5.42% (p=0.000 n=30+39) Istream_numbers/0/threads:72 539.817µ ± 0% 9.842µ ± 1% -98.18% (p=0.000 n=30+40) Istream_numbers/1/threads:1 4.890µ ± 0% 4.750µ ± 0% -2.85% (p=0.000 n=30+40) Istream_numbers/1/threads:72 66.44µ ± 1% 10.14µ ± 1% -84.74% (p=0.000 n=30+40) Istream_numbers/2/threads:1 4.888µ ± 0% 4.746µ ± 0% -2.92% (p=0.000 n=30+40) Istream_numbers/2/threads:72 494.8µ ± 0% 410.2µ ± 1% -17.11% (p=0.000 n=30+40) Istream_numbers/3/threads:1 4.697µ ± 0% 4.695µ ± 5% ~ (p=0.391 n=30+37) Istream_numbers/3/threads:72 421.5µ ± 7% 421.9µ ± 9% ~ (p=0.665 n=30) Ostream_number/0/threads:1 183.0n ± 0% 141.0n ± 2% -22.95% (p=0.000 n=30) Ostream_number/0/threads:72 24196.5n ± 1% 343.5n ± 3% -98.58% (p=0.000 n=30) Ostream_number/1/threads:1 250.0n ± 0% 196.0n ± 2% -21.60% (p=0.000 n=30) Ostream_number/1/threads:72 16260.5n ± 0% 407.0n ± 2% -97.50% (p=0.000 n=30) Ostream_number/2/threads:1 254.0n ± 0% 196.0n ± 1% -22.83% (p=0.000 n=30) Ostream_number/2/threads:72 28.49µ ± 1% 18.89µ ± 5% -33.72% (p=0.000 n=30) Ostream_number/3/threads:1 185.0n ± 0% 185.0n ± 0% 0.00% (p=0.017 n=30) Ostream_number/3/threads:72 19.38µ ± 4% 19.33µ ± 5% ~ (p=0.425 n=30)

ldionne · 2023-11-23T16:25:52Z

I took the freedom of applying the suggestion that doesn't require atomics at all, and rebasing on top of my other refactoring to save you the trouble of conflict resolution. @mordante @dvyukov What do you folks think about this?

github-actions · 2023-11-23T16:26:53Z

✅ With the latest revision this PR passed the C/C++ code formatter.

libcxx/src/locale.cpp

libcxx/include/__utility/no_destroy.h

libcxx/include/__locale

dvyukov · 2023-11-23T16:32:49Z

I agree this is better. The change looks good to me.
+1 for merging

ldionne

This LGTM in the current state.

mordante

Great to see we can do it without the atomics!

In general happy, but there is one issue and two nits.

libcxx/benchmarks/stringstream.bench.cpp

libcxx/include/__utility/no_destroy.h

libcxx/include/CMakeLists.txt

libcxx/benchmarks/stringstream.bench.cpp

mordante

LGTM! Please address @ldionne's comments before merging.

dvyukov · 2023-11-27T05:06:02Z

Should I squash and merge this? Or what's the next step?

There are some test failures, but they don't look related:

Failed Tests (1):
  llvm-libc++-shared.cfg.in :: std/input.output/string.streams/stringstream/stringstream.members/gcount.pass.cpp

Timed Out Tests (2):
  llvm-libc++-shared.cfg.in :: std/containers/sequences/deque/deque.modifiers/insert_iter_iter.pass.cpp
  llvm-libc++-shared.cfg.in :: std/numerics/rand/rand.dist/rand.dist.bern/rand.dist.bern.negbin/eval.pass.cpp

dvyukov · 2023-11-27T05:59:49Z

I've re-run ninja check-cxx on the latest version and it passed. The change also shows comparable performance improvement.

I am going to merge it now. Let's see if it sticks.

                                       │ ./bench/stream.baseline.res │         bench/stream.latest.res         │
                                       │           sec/op            │    sec/op     vs base                   │
Istream_numbers/0/real_time/threads:1                    4.686µ ± 0%   4.482µ ±  0%   -4.35% (p=0.000 n=24+30)
Istream_numbers/0/real_time/threads:72                 339.712µ ± 4%   9.567µ ±  1%  -97.18% (p=0.000 n=24+30)
Istream_numbers/1/real_time/threads:1                    4.900µ ± 0%   4.810µ ±  0%   -1.84% (p=0.000 n=24+30)
Istream_numbers/1/real_time/threads:72                   49.31µ ± 3%   10.12µ ±  1%  -79.47% (p=0.000 n=24+30)
Istream_numbers/2/real_time/threads:1                    4.902µ ± 0%   4.831µ ±  7%   -1.45% (p=0.037 n=24+30)
Istream_numbers/2/real_time/threads:72                   416.6µ ± 9%   424.0µ ± 15%        ~ (p=0.993 n=24+30)
Istream_numbers/3/real_time/threads:1                    4.716µ ± 0%   4.764µ ±  0%   +1.01% (p=0.000 n=24+30)
Istream_numbers/3/real_time/threads:72                   419.7µ ± 6%   400.9µ ±  5%        ~ (p=0.116 n=24+30)
Ostream_number/0/real_time/threads:1                     184.0n ± 1%   144.0n ±  7%  -21.74% (p=0.000 n=24+30)
Ostream_number/0/real_time/threads:72                  17279.0n ± 4%   339.0n ±  0%  -98.04% (p=0.000 n=24+30)
Ostream_number/1/real_time/threads:1                     253.5n ± 1%   195.0n ±  1%  -23.08% (p=0.000 n=24+30)
Ostream_number/1/real_time/threads:72                  13957.5n ± 2%   412.0n ±  0%  -97.05% (p=0.000 n=24+30)
Ostream_number/2/real_time/threads:1                     253.0n ± 1%   195.0n ±  1%  -22.92% (p=0.000 n=24+30)
Ostream_number/2/real_time/threads:72                    28.04µ ± 4%   18.52µ ± 19%  -33.97% (p=0.000 n=24+30)
Ostream_number/3/real_time/threads:1                     187.0n ± 1%   187.0n ±  0%        ~ (p=0.943 n=24+30)
Ostream_number/3/real_time/threads:72                    20.25µ ± 7%   19.80µ ±  7%        ~ (p=0.473 n=24+30)
geomean                                                  8.260µ        3.463µ        -58.08%

dvyukov · 2023-11-27T06:36:41Z

This caused the following error on bots.

@ldionne perhaps classic_locale_imp_ should be constinit?

https://lab.llvm.org/buildbot/#/builders/168/builds/17053

==1639930==ERROR: AddressSanitizer: initialization-order-fiasco on address 0x7f9576845828 at pc 0x7f95767c189d bp 0x7fff76544cb0 sp 0x7fff76544ca8
WRITE of size 8 at 0x7f9576845828 thread T0
    #0 0x7f95767c189c in __shared_count /b/sanitizer-x86_64-linux-bootstrap-asan/build/libcxx_build_asan/include/c++/v1/__memory/shared_ptr.h:160:11
    #1 0x7f95767c189c in facet /b/sanitizer-x86_64-linux-bootstrap-asan/build/libcxx_build_asan/include/c++/v1/__locale:150:11
    #2 0x7f95767c189c in std::__1::locale::__imp::__imp(unsigned long) /b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/libcxx/src/locale.cpp:169:7
    #3 0x7f95767c4382 in __emplace<unsigned int> /b/sanitizer-x86_64-linux-bootstrap-asan/build/libcxx_build_asan/include/c++/v1/__utility/no_destroy.h:41:19
    #4 0x7f95767c4382 in operator() /b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/libcxx/src/locale.cpp:554:40
    #5 0x7f95767c4382 in std::__1::locale::classic() /b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/libcxx/src/locale.cpp:552:69
    #6 0x7f95767cae5d in __global /b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/libcxx/src/locale.cpp:561:33
    #7 0x7f95767cae5d in std::__1::locale::locale() /b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/libcxx/src/locale.cpp:575:39
    #8 0x7f95767309f2 in std::__1::basic_streambuf<char, std::__1::char_traits<char>>::basic_streambuf() /b/sanitizer-x86_64-linux-bootstrap-asan/build/libcxx_build_asan/include/c++/v1/streambuf:314:35
    #9 0x7f9576762283 in __stdinbuf /b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/libcxx/src/std_stream.h:76:21
    #10 0x7f9576762283 in std::__1::DoIOSInit::DoIOSInit() /b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/libcxx/src/iostream.cpp:124:59
    #11 0x7f9576762e23 in std::__1::ios_base::Init::Init() /b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/libcxx/src/iostream.cpp:161:22
    #12 0x7f9576766450 in __cxx_global_var_init /b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/libcxx/src/iostream_init.h:2:31
    #13 0x7f9576766450 in _GLOBAL__I_000100 /b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/libcxx/src/iostream.cpp
    #14 0x7f9576a0f33d  (/lib64/ld-linux-x86-64.so.2+0x533d) (BuildId: cc23bad777ae59d26be0445659cd1c94244d6bc7)
    #15 0x7f9576a0f427  (/lib64/ld-linux-x86-64.so.2+0x5427) (BuildId: cc23bad777ae59d26be0445659cd1c94244d6bc7)
    #16 0x7f9576a26e0f  (/lib64/ld-linux-x86-64.so.2+0x1ce0f) (BuildId: cc23bad777ae59d26be0445659cd1c94244d6bc7)

0x7f9576845828 is located 8 bytes inside of global variable 'std::__1::locale::__imp::classic_locale_imp_' defined in '/b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/libcxx/src/locale.cpp' (0x7f9576845820) of size 312
  registered at:
    #0 0x557b4c0dae06 in __asan_register_globals /b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/compiler-rt/lib/asan/asan_globals.cpp:369:3
    #1 0x557b4c0dbf29 in __asan_register_elf_globals /b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/compiler-rt/lib/asan/asan_globals.cpp:352:3
    #2 0x7f9576a0f33d  (/lib64/ld-linux-x86-64.so.2+0x533d) (BuildId: cc23bad777ae59d26be0445659cd1c94244d6bc7)

SUMMARY: AddressSanitizer: initialization-order-fiasco /b/sanitizer-x86_64-linux-bootstrap-asan/build/libcxx_build_asan/include/c++/v1/__memory/shared_ptr.h:160:11 in __shared_count
Shadow bytes around the buggy address:
  0x7f9576845580: 00 f9 f9 f9 00 00 00 00 00 00 00 00 00 00 00 00
  0x7f9576845600: f9 f9 f9 f9 00 f9 f9 f9 01 f9 f9 f9 00 f9 f9 f9
  0x7f9576845680: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x7f9576845700: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x7f9576845780: 00 00 00 00 00 00 00 00 00 00 00 00 00 f9 f9 f9
=>0x7f9576845800: 00 f9 f9 f9 f6[f6]f6 f6 f6 f6 f6 f6 f6 f6 f6 f6
  0x7f9576845880: f6 f6 f6 f6 f6 f6 f6 f6 f6 f6 f6 f6 f6 f6 f6 f6
  0x7f9576845900: f6 f6 f6 f6 f6 f6 f6 f6 f6 f6 f6 f6 f6 f6 f6 f6

ldionne · 2023-11-27T16:04:22Z

This caused the following error on bots.

@ldionne perhaps classic_locale_imp_ should be constinit?

https://lab.llvm.org/buildbot/#/builders/168/builds/17053

==1639930==ERROR: AddressSanitizer: initialization-order-fiasco on address 0x7f9576845828 at pc 0x7f95767c189d bp 0x7fff76544cb0 sp 0x7fff76544ca8
WRITE of size 8 at 0x7f9576845828 thread T0
    #0 0x7f95767c189c in __shared_count /b/sanitizer-x86_64-linux-bootstrap-asan/build/libcxx_build_asan/include/c++/v1/__memory/shared_ptr.h:160:11
    #1 0x7f95767c189c in facet /b/sanitizer-x86_64-linux-bootstrap-asan/build/libcxx_build_asan/include/c++/v1/__locale:150:11
    #2 0x7f95767c189c in std::__1::locale::__imp::__imp(unsigned long) /b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/libcxx/src/locale.cpp:169:7
    #3 0x7f95767c4382 in __emplace<unsigned int> /b/sanitizer-x86_64-linux-bootstrap-asan/build/libcxx_build_asan/include/c++/v1/__utility/no_destroy.h:41:19
    #4 0x7f95767c4382 in operator() /b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/libcxx/src/locale.cpp:554:40
    #5 0x7f95767c4382 in std::__1::locale::classic() /b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/libcxx/src/locale.cpp:552:69
    #6 0x7f95767cae5d in __global /b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/libcxx/src/locale.cpp:561:33
    #7 0x7f95767cae5d in std::__1::locale::locale() /b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/libcxx/src/locale.cpp:575:39
    #8 0x7f95767309f2 in std::__1::basic_streambuf<char, std::__1::char_traits<char>>::basic_streambuf() /b/sanitizer-x86_64-linux-bootstrap-asan/build/libcxx_build_asan/include/c++/v1/streambuf:314:35
    #9 0x7f9576762283 in __stdinbuf /b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/libcxx/src/std_stream.h:76:21
    #10 0x7f9576762283 in std::__1::DoIOSInit::DoIOSInit() /b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/libcxx/src/iostream.cpp:124:59
    #11 0x7f9576762e23 in std::__1::ios_base::Init::Init() /b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/libcxx/src/iostream.cpp:161:22
    #12 0x7f9576766450 in __cxx_global_var_init /b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/libcxx/src/iostream_init.h:2:31
    #13 0x7f9576766450 in _GLOBAL__I_000100 /b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/libcxx/src/iostream.cpp
    #14 0x7f9576a0f33d  (/lib64/ld-linux-x86-64.so.2+0x533d) (BuildId: cc23bad777ae59d26be0445659cd1c94244d6bc7)
    #15 0x7f9576a0f427  (/lib64/ld-linux-x86-64.so.2+0x5427) (BuildId: cc23bad777ae59d26be0445659cd1c94244d6bc7)
    #16 0x7f9576a26e0f  (/lib64/ld-linux-x86-64.so.2+0x1ce0f) (BuildId: cc23bad777ae59d26be0445659cd1c94244d6bc7)

0x7f9576845828 is located 8 bytes inside of global variable 'std::__1::locale::__imp::classic_locale_imp_' defined in '/b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/libcxx/src/locale.cpp' (0x7f9576845820) of size 312
  registered at:
    #0 0x557b4c0dae06 in __asan_register_globals /b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/compiler-rt/lib/asan/asan_globals.cpp:369:3
    #1 0x557b4c0dbf29 in __asan_register_elf_globals /b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/compiler-rt/lib/asan/asan_globals.cpp:352:3
    #2 0x7f9576a0f33d  (/lib64/ld-linux-x86-64.so.2+0x533d) (BuildId: cc23bad777ae59d26be0445659cd1c94244d6bc7)

SUMMARY: AddressSanitizer: initialization-order-fiasco /b/sanitizer-x86_64-linux-bootstrap-asan/build/libcxx_build_asan/include/c++/v1/__memory/shared_ptr.h:160:11 in __shared_count
Shadow bytes around the buggy address:
  0x7f9576845580: 00 f9 f9 f9 00 00 00 00 00 00 00 00 00 00 00 00
  0x7f9576845600: f9 f9 f9 f9 00 f9 f9 f9 01 f9 f9 f9 00 f9 f9 f9
  0x7f9576845680: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x7f9576845700: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x7f9576845780: 00 00 00 00 00 00 00 00 00 00 00 00 00 f9 f9 f9
=>0x7f9576845800: 00 f9 f9 f9 f6[f6]f6 f6 f6 f6 f6 f6 f6 f6 f6 f6
  0x7f9576845880: f6 f6 f6 f6 f6 f6 f6 f6 f6 f6 f6 f6 f6 f6 f6 f6
  0x7f9576845900: f6 f6 f6 f6 f6 f6 f6 f6 f6 f6 f6 f6 f6 f6 f6 f6

Thanks for the heads up. See #73533

Looks like it broke the ASAN build: https://lab.llvm.org/buildbot/#/builders/168/builds/17053/steps/9/logs/stdio This reverts commit f8afc53.

kstoimenov · 2023-11-27T20:26:16Z

@dvyukov FYI...

Locale objects use atomic reference counting, which may be very expensive in parallel applications. The classic locale is used by default by all streams and can be very contended. But it's never destroyed, so the reference counting is also completely pointless on the classic locale. Currently ~70% of time in the parallel stringstream benchmarks is spent in locale ctor/dtor. And the execution radically slows down with more threads. Avoid reference counting on the classic locale. With this change parallel benchmarks start to scale with threads. This is a re-application of f8afc53 (aka PR llvm#72112) which was reverted in 4e0c48b because it broke the sanitizer builds due to an initialization order fiasco. This issue has now been fixed by ensuring that the locale is constinit'ed. Co-authored-by: Louis Dionne <ldionne.2@gmail.com>

Locale objects use atomic reference counting, which may be very expensive in parallel applications. The classic locale is used by default by all streams and can be very contended. But it's never destroyed, so the reference counting is also completely pointless on the classic locale. Currently ~70% of time in the parallel stringstream benchmarks is spent in locale ctor/dtor. And the execution radically slows down with more threads. Avoid reference counting on the classic locale. With this change parallel benchmarks start to scale with threads. This is a re-application of f8afc53 (aka PR #72112) which was reverted in 4e0c48b because it broke the sanitizer builds due to an initialization order fiasco. This issue has now been fixed by ensuring that the locale is constinit'ed. Co-authored-by: Dmitry Vyukov <dvyukov@google.com>

dvyukov requested a review from a team as a code owner November 13, 2023 12:32

llvmbot added the libc++ libc++ C++ Standard Library. Not GNU libstdc++. Not libc++abi. label Nov 13, 2023

dvyukov assigned philnik777 Nov 13, 2023

dvyukov mentioned this pull request Nov 13, 2023

[libc++] Speed up classic locale #70631

Closed

mordante self-requested a review November 14, 2023 09:22

mordante reviewed Nov 14, 2023

View reviewed changes

mordante requested changes Nov 15, 2023

View reviewed changes

libcxx/src/locale.cpp Outdated Show resolved Hide resolved

libcxx/src/locale.cpp Outdated Show resolved Hide resolved

dvyukov commented Nov 16, 2023

View reviewed changes

libcxx/src/locale.cpp Outdated Show resolved Hide resolved

libcxx/src/locale.cpp Outdated Show resolved Hide resolved

ldionne requested changes Nov 16, 2023

View reviewed changes

libcxx/src/locale.cpp Outdated Show resolved Hide resolved

dvyukov and others added 4 commits November 23, 2023 11:12

add more comments

8199841

Adjust after rebase

905907b

Implement suggestion without needing atomics

3a4039d

ldionne force-pushed the dvyukov-locale3 branch from c38f931 to 3a4039d Compare November 23, 2023 16:24

ldionne reviewed Nov 23, 2023

View reviewed changes

libcxx/src/locale.cpp Show resolved Hide resolved

ldionne reviewed Nov 23, 2023

View reviewed changes

libcxx/include/__utility/no_destroy.h Show resolved Hide resolved

dvyukov commented Nov 23, 2023

View reviewed changes

libcxx/include/__locale Show resolved Hide resolved

ldionne added 2 commits November 23, 2023 14:02

Fix c++03 issue

8e64a98

Use union to avoid type punning

ecce059

ldionne approved these changes Nov 23, 2023

View reviewed changes

mordante reviewed Nov 23, 2023

View reviewed changes

libcxx/benchmarks/stringstream.bench.cpp Outdated Show resolved Hide resolved

libcxx/include/__utility/no_destroy.h Outdated Show resolved Hide resolved

libcxx/include/CMakeLists.txt Show resolved Hide resolved

ldionne added 4 commits November 23, 2023 17:07

Apply formatting changes

fd1d4ec

Use proper variable names in benchmark

185b2b9

Make mutex guarding calls to global() static

5d17dec

Address review comments

47e346c

ldionne reviewed Nov 23, 2023

View reviewed changes

libcxx/benchmarks/stringstream.bench.cpp Show resolved Hide resolved

mordante approved these changes Nov 25, 2023

View reviewed changes

dvyukov merged commit f8afc53 into llvm:main Nov 27, 2023
36 of 39 checks passed

kstoimenov added a commit that referenced this pull request Nov 27, 2023

Revert "[libc++] Speed up classic locale (#72112)"

4e0c48b

Looks like it broke the ASAN build: https://lab.llvm.org/buildbot/#/builders/168/builds/17053/steps/9/logs/stdio This reverts commit f8afc53.

ldionne mentioned this pull request Nov 28, 2023

[libc++] Speed up classic locale (take 2) #73533

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[libc++] Speed up classic locale #72112

[libc++] Speed up classic locale #72112

dvyukov commented Nov 13, 2023

llvmbot commented Nov 13, 2023

dvyukov commented Nov 13, 2023

mordante left a comment

dvyukov left a comment

ldionne commented Nov 23, 2023

github-actions bot commented Nov 23, 2023 •

edited

dvyukov commented Nov 23, 2023

ldionne left a comment

mordante left a comment

mordante left a comment

dvyukov commented Nov 27, 2023

dvyukov commented Nov 27, 2023

dvyukov commented Nov 27, 2023

ldionne commented Nov 27, 2023

kstoimenov commented Nov 27, 2023

[libc++] Speed up classic locale #72112

[libc++] Speed up classic locale #72112

Conversation

dvyukov commented Nov 13, 2023

llvmbot commented Nov 13, 2023

dvyukov commented Nov 13, 2023

mordante left a comment

Choose a reason for hiding this comment

dvyukov left a comment

Choose a reason for hiding this comment

ldionne commented Nov 23, 2023

github-actions bot commented Nov 23, 2023 • edited

dvyukov commented Nov 23, 2023

ldionne left a comment

Choose a reason for hiding this comment

mordante left a comment

Choose a reason for hiding this comment

mordante left a comment

Choose a reason for hiding this comment

dvyukov commented Nov 27, 2023

dvyukov commented Nov 27, 2023

dvyukov commented Nov 27, 2023

ldionne commented Nov 27, 2023

kstoimenov commented Nov 27, 2023

github-actions bot commented Nov 23, 2023 •

edited