Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[libc++] Speed up classic locale #72112

Merged
merged 10 commits into from Nov 27, 2023
Merged

[libc++] Speed up classic locale #72112

merged 10 commits into from Nov 27, 2023

Conversation

dvyukov
Copy link
Collaborator

@dvyukov dvyukov commented Nov 13, 2023

Locale objects use atomic reference counting, which may be very expensive in parallel applications. The classic locale is used by default by all streams and can be very contended. But it's never destroyed, so the reference counting is also completely pointless on the classic locale. Currently ~70% of time in the parallel stringstream benchmarks is spent in locale ctor/dtor. And the execution radically slows down with more threads.

Avoid reference counting on the classic locale. With this change parallel benchmarks start to scale with threads.

                              │   baseline   │    optimized                            │
                              │    sec/op    │    sec/op      vs base                  │
Istream_numbers/0/threads:1      4.672µ ± 0%   4.419µ ± 0%     -5.42% (p=0.000 n=30+39)
Istream_numbers/0/threads:72   539.817µ ± 0%   9.842µ ± 1%    -98.18% (p=0.000 n=30+40)
Istream_numbers/1/threads:1      4.890µ ± 0%   4.750µ ± 0%     -2.85% (p=0.000 n=30+40)
Istream_numbers/1/threads:72     66.44µ ± 1%   10.14µ ± 1%    -84.74% (p=0.000 n=30+40)
Istream_numbers/2/threads:1      4.888µ ± 0%   4.746µ ± 0%     -2.92% (p=0.000 n=30+40)
Istream_numbers/2/threads:72     494.8µ ± 0%   410.2µ ± 1%    -17.11% (p=0.000 n=30+40)
Istream_numbers/3/threads:1      4.697µ ± 0%   4.695µ ± 5%          ~ (p=0.391 n=30+37)
Istream_numbers/3/threads:72     421.5µ ± 7%   421.9µ ± 9%          ~ (p=0.665 n=30)
Ostream_number/0/threads:1       183.0n ± 0%   141.0n ± 2%    -22.95% (p=0.000 n=30)
Ostream_number/0/threads:72    24196.5n ± 1%   343.5n ± 3%    -98.58% (p=0.000 n=30)
Ostream_number/1/threads:1       250.0n ± 0%   196.0n ± 2%    -21.60% (p=0.000 n=30)
Ostream_number/1/threads:72    16260.5n ± 0%   407.0n ± 2%    -97.50% (p=0.000 n=30)
Ostream_number/2/threads:1       254.0n ± 0%   196.0n ± 1%    -22.83% (p=0.000 n=30)
Ostream_number/2/threads:72      28.49µ ± 1%   18.89µ ± 5%    -33.72% (p=0.000 n=30)
Ostream_number/3/threads:1       185.0n ± 0%   185.0n ± 0%      0.00% (p=0.017 n=30)
Ostream_number/3/threads:72      19.38µ ± 4%   19.33µ ± 5%          ~ (p=0.425 n=30)

@dvyukov dvyukov requested a review from a team as a code owner November 13, 2023 12:32
@llvmbot llvmbot added the libc++ libc++ C++ Standard Library. Not GNU libstdc++. Not libc++abi. label Nov 13, 2023
@llvmbot
Copy link
Collaborator

llvmbot commented Nov 13, 2023

@llvm/pr-subscribers-libcxx

Author: Dmitry Vyukov (dvyukov)

Changes

Locale objects use atomic reference counting, which may be very expensive in parallel applications. The classic locale is used by default by all streams and can be very contended. But it's never destroyed, so the reference counting is also completely pointless on the classic locale. Currently ~70% of time in the parallel stringstream benchmarks is spent in locale ctor/dtor. And the execution radically slows down with more threads.

Avoid reference counting on the classic locale. With this change parallel benchmarks start to scale with threads.

                              │   baseline   │    optimized                            │
                              │    sec/op    │    sec/op      vs base                  │
Istream_numbers/0/threads:1      4.672µ ± 0%   4.419µ ± 0%     -5.42% (p=0.000 n=30+39)
Istream_numbers/0/threads:72   539.817µ ± 0%   9.842µ ± 1%    -98.18% (p=0.000 n=30+40)
Istream_numbers/1/threads:1      4.890µ ± 0%   4.750µ ± 0%     -2.85% (p=0.000 n=30+40)
Istream_numbers/1/threads:72     66.44µ ± 1%   10.14µ ± 1%    -84.74% (p=0.000 n=30+40)
Istream_numbers/2/threads:1      4.888µ ± 0%   4.746µ ± 0%     -2.92% (p=0.000 n=30+40)
Istream_numbers/2/threads:72     494.8µ ± 0%   410.2µ ± 1%    -17.11% (p=0.000 n=30+40)
Istream_numbers/3/threads:1      4.697µ ± 0%   4.695µ ± 5%          ~ (p=0.391 n=30+37)
Istream_numbers/3/threads:72     421.5µ ± 7%   421.9µ ± 9%          ~ (p=0.665 n=30)
Ostream_number/0/threads:1       183.0n ± 0%   141.0n ± 2%    -22.95% (p=0.000 n=30)
Ostream_number/0/threads:72    24196.5n ± 1%   343.5n ± 3%    -98.58% (p=0.000 n=30)
Ostream_number/1/threads:1       250.0n ± 0%   196.0n ± 2%    -21.60% (p=0.000 n=30)
Ostream_number/1/threads:72    16260.5n ± 0%   407.0n ± 2%    -97.50% (p=0.000 n=30)
Ostream_number/2/threads:1       254.0n ± 0%   196.0n ± 1%    -22.83% (p=0.000 n=30)
Ostream_number/2/threads:72      28.49µ ± 1%   18.89µ ± 5%    -33.72% (p=0.000 n=30)
Ostream_number/3/threads:1       185.0n ± 0%   185.0n ± 0%      0.00% (p=0.017 n=30)
Ostream_number/3/threads:72      19.38µ ± 4%   19.33µ ± 5%          ~ (p=0.425 n=30)

Full diff: https://github.com/llvm/llvm-project/pull/72112.diff

2 Files Affected:

  • (modified) libcxx/benchmarks/stringstream.bench.cpp (+59-3)
  • (modified) libcxx/src/locale.cpp (+34-14)
diff --git a/libcxx/benchmarks/stringstream.bench.cpp b/libcxx/benchmarks/stringstream.bench.cpp
index ea602557ccd770e..c10ee3a8cc5b83c 100644
--- a/libcxx/benchmarks/stringstream.bench.cpp
+++ b/libcxx/benchmarks/stringstream.bench.cpp
@@ -1,11 +1,12 @@
 #include "benchmark/benchmark.h"
 #include "test_macros.h"
 
+#include <mutex>
 #include <sstream>
 
 TEST_NOINLINE double istream_numbers();
 
-double istream_numbers() {
+double istream_numbers(std::locale* l) {
   const char* a[] = {"-6  69 -71  2.4882e-02 -100 101 -2.00005 5000000 -50000000",
                      "-25 71   7 -9.3262e+01 -100 101 -2.00005 5000000 -50000000",
                      "-14 53  46 -6.7026e-02 -100 101 -2.00005 5000000 -50000000"};
@@ -14,17 +15,72 @@ double istream_numbers() {
   double f1 = 0.0, f2 = 0.0, q = 0.0;
   for (int i = 0; i < 3; i++) {
     std::istringstream s(a[i]);
+    if (l)
+      s.imbue(*l);
     s >> a1 >> a2 >> a3 >> f1 >> a4 >> a5 >> f2 >> a6 >> a7;
     q += (a1 + a2 + a3 + a4 + a5 + a6 + a7 + f1 + f2) / 1000000;
   }
   return q;
 }
 
+struct LocaleSelector {
+  std::locale* imbue;
+  std::locale old;
+
+  LocaleSelector(benchmark::State& state) {
+    static std::mutex mu;
+    std::lock_guard l(mu);
+    switch (state.range(0)) {
+    case 0: {
+      old   = std::locale::global(std::locale::classic());
+      imbue = nullptr;
+      break;
+    }
+    case 1: {
+      old = std::locale::global(std::locale::classic());
+      thread_local std::locale l("en_US.UTF-8");
+      imbue = &l;
+      break;
+    }
+    case 2: {
+      old = std::locale::global(std::locale::classic());
+      static std::locale l("en_US.UTF-8");
+      imbue = &l;
+      break;
+    }
+    case 3: {
+      old   = std::locale::global(std::locale("en_US.UTF-8"));
+      imbue = nullptr;
+      break;
+    }
+    }
+  }
+
+  ~LocaleSelector() {
+    static std::mutex mu;
+    std::lock_guard l(mu);
+    std::locale::global(old);
+  }
+};
+
 static void BM_Istream_numbers(benchmark::State& state) {
+  LocaleSelector sel(state);
   double i = 0;
   while (state.KeepRunning())
-    benchmark::DoNotOptimize(i += istream_numbers());
+    benchmark::DoNotOptimize(i += istream_numbers(sel.imbue));
+}
+BENCHMARK(BM_Istream_numbers)->DenseRange(0, 3)->UseRealTime()->Threads(1)->ThreadPerCpu();
+
+static void BM_Ostream_number(benchmark::State& state) {
+  LocaleSelector sel(state);
+  while (state.KeepRunning()) {
+    std::ostringstream ss;
+    if (sel.imbue)
+      ss.imbue(*sel.imbue);
+    ss << 0;
+    benchmark::DoNotOptimize(ss.str().c_str());
+  }
 }
+BENCHMARK(BM_Ostream_number)->DenseRange(0, 3)->UseRealTime()->Threads(1)->ThreadPerCpu();
 
-BENCHMARK(BM_Istream_numbers)->RangeMultiplier(2)->Range(1024, 4096);
 BENCHMARK_MAIN();
diff --git a/libcxx/src/locale.cpp b/libcxx/src/locale.cpp
index c37e091dcc4671b..58b2d6c33606ba9 100644
--- a/libcxx/src/locale.cpp
+++ b/libcxx/src/locale.cpp
@@ -8,6 +8,7 @@
 
 #include <__utility/unreachable.h>
 #include <algorithm>
+#include <atomic>
 #include <clocale>
 #include <codecvt>
 #include <cstddef>
@@ -80,7 +81,7 @@ locale_t __cloc() {
 
 namespace {
 
-struct release
+struct releaser
 {
     void operator()(locale::facet* p) {p->__release_shared();}
 };
@@ -154,12 +155,16 @@ class _LIBCPP_HIDDEN locale::__imp
         {return static_cast<size_t>(id) < facets_.size() && facets_[static_cast<size_t>(id)];}
     const locale::facet* use_facet(long id) const;
 
+    void acquire();
+    void release();
+
     static const locale& make_classic();
     static       locale& make_global();
 private:
     void install(facet* f, long id);
     template <class F> void install(F* f) {install(f, f->id.__get());}
     template <class F> void install_from(const __imp& other);
+    static std::atomic<__imp*> classic_;
 };
 
 locale::__imp::__imp(size_t refs)
@@ -501,7 +506,7 @@ locale::__imp::__imp(const __imp& other, facet* f, long id)
       name_("*")
 {
     f->__add_shared();
-    unique_ptr<facet, release> hold(f);
+    unique_ptr<facet, releaser> hold(f);
     facets_ = other.facets_;
     for (unsigned i = 0; i < other.facets_.size(); ++i)
         if (facets_[i])
@@ -520,7 +525,7 @@ void
 locale::__imp::install(facet* f, long id)
 {
     f->__add_shared();
-    unique_ptr<facet, release> hold(f);
+    unique_ptr<facet, releaser> hold(f);
     if (static_cast<size_t>(id) >= facets_.size())
         facets_.resize(static_cast<size_t>(id+1));
     if (facets_[static_cast<size_t>(id)])
@@ -538,6 +543,8 @@ locale::__imp::use_facet(long id) const
 
 // locale
 
+std::atomic<locale::__imp*> locale::__imp::classic_;
+
 const locale&
 locale::__imp::make_classic()
 {
@@ -545,9 +552,22 @@ locale::__imp::make_classic()
     alignas(locale) static std::byte buf[sizeof(locale)];
     locale* c = reinterpret_cast<locale*>(&buf);
     c->__locale_ = &make<__imp>(1u);
+    classic_.store(c->__locale_, std::memory_order_relaxed);
     return *c;
 }
 
+void locale::__imp::acquire()
+{
+    if (this != classic_.load(std::memory_order_relaxed))
+        __add_shared();
+}
+
+void locale::__imp::release()
+{
+    if (this != classic_.load(std::memory_order_relaxed))
+        __release_shared();
+}
+
 const locale&
 locale::classic()
 {
@@ -574,25 +594,25 @@ locale::__global()
 locale::locale() noexcept
     : __locale_(__global().__locale_)
 {
-    __locale_->__add_shared();
+    __locale_->acquire();
 }
 
 locale::locale(const locale& l) noexcept
     : __locale_(l.__locale_)
 {
-    __locale_->__add_shared();
+    __locale_->acquire();
 }
 
 locale::~locale()
 {
-    __locale_->__release_shared();
+    __locale_->release();
 }
 
 const locale&
 locale::operator=(const locale& other) noexcept
 {
-    other.__locale_->__add_shared();
-    __locale_->__release_shared();
+    other.__locale_->acquire();
+    __locale_->release();
     __locale_ = other.__locale_;
     return *this;
 }
@@ -601,32 +621,32 @@ locale::locale(const char* name)
     : __locale_(name ? new __imp(name)
                      : (__throw_runtime_error("locale constructed with null"), nullptr))
 {
-    __locale_->__add_shared();
+    __locale_->acquire();
 }
 
 locale::locale(const string& name)
     : __locale_(new __imp(name))
 {
-    __locale_->__add_shared();
+    __locale_->acquire();
 }
 
 locale::locale(const locale& other, const char* name, category c)
     : __locale_(name ? new __imp(*other.__locale_, name, c)
                      : (__throw_runtime_error("locale constructed with null"), nullptr))
 {
-    __locale_->__add_shared();
+    __locale_->acquire();
 }
 
 locale::locale(const locale& other, const string& name, category c)
     : __locale_(new __imp(*other.__locale_, name, c))
 {
-    __locale_->__add_shared();
+    __locale_->acquire();
 }
 
 locale::locale(const locale& other, const locale& one, category c)
     : __locale_(new __imp(*other.__locale_, *one.__locale_, c))
 {
-    __locale_->__add_shared();
+    __locale_->acquire();
 }
 
 string
@@ -642,7 +662,7 @@ locale::__install_ctor(const locale& other, facet* f, long id)
         __locale_ = new __imp(*other.__locale_, f, id);
     else
         __locale_ = other.__locale_;
-    __locale_->__add_shared();
+    __locale_->acquire();
 }
 
 locale

@dvyukov
Copy link
Collaborator Author

dvyukov commented Nov 13, 2023

This is an alternative version of #70631 (without inlining of ctor/dtor).

Copy link
Member

@mordante mordante left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch! I had a short look and see some things I want to look at later this week in more detail.

libcxx/src/locale.cpp Outdated Show resolved Hide resolved
libcxx/src/locale.cpp Outdated Show resolved Hide resolved
Copy link
Collaborator Author

@dvyukov dvyukov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please take another look.

libcxx/src/locale.cpp Outdated Show resolved Hide resolved
libcxx/src/locale.cpp Outdated Show resolved Hide resolved
libcxx/src/locale.cpp Outdated Show resolved Hide resolved
dvyukov and others added 4 commits November 23, 2023 11:12
Locale objects use atomic reference counting, which may be very expensive
in parallel applications. The classic locale is used by default by all
streams and can be very contended. But it's never destroyed, so the reference
counting is also completely pointless on the classic locale. Currently ~70%
of time in the parallel stringstream benchmarks is spent in locale ctor/dtor.
And the execution radically slows down with more threads.

Avoid reference counting on the classic locale. With this change parallel
benchmarks start to scale with threads.

                              │   baseline   │    optimized                            │
                              │    sec/op    │    sec/op      vs base                  │
Istream_numbers/0/threads:1      4.672µ ± 0%   4.419µ ± 0%     -5.42% (p=0.000 n=30+39)
Istream_numbers/0/threads:72   539.817µ ± 0%   9.842µ ± 1%    -98.18% (p=0.000 n=30+40)
Istream_numbers/1/threads:1      4.890µ ± 0%   4.750µ ± 0%     -2.85% (p=0.000 n=30+40)
Istream_numbers/1/threads:72     66.44µ ± 1%   10.14µ ± 1%    -84.74% (p=0.000 n=30+40)
Istream_numbers/2/threads:1      4.888µ ± 0%   4.746µ ± 0%     -2.92% (p=0.000 n=30+40)
Istream_numbers/2/threads:72     494.8µ ± 0%   410.2µ ± 1%    -17.11% (p=0.000 n=30+40)
Istream_numbers/3/threads:1      4.697µ ± 0%   4.695µ ± 5%          ~ (p=0.391 n=30+37)
Istream_numbers/3/threads:72     421.5µ ± 7%   421.9µ ± 9%          ~ (p=0.665 n=30)
Ostream_number/0/threads:1       183.0n ± 0%   141.0n ± 2%    -22.95% (p=0.000 n=30)
Ostream_number/0/threads:72    24196.5n ± 1%   343.5n ± 3%    -98.58% (p=0.000 n=30)
Ostream_number/1/threads:1       250.0n ± 0%   196.0n ± 2%    -21.60% (p=0.000 n=30)
Ostream_number/1/threads:72    16260.5n ± 0%   407.0n ± 2%    -97.50% (p=0.000 n=30)
Ostream_number/2/threads:1       254.0n ± 0%   196.0n ± 1%    -22.83% (p=0.000 n=30)
Ostream_number/2/threads:72      28.49µ ± 1%   18.89µ ± 5%    -33.72% (p=0.000 n=30)
Ostream_number/3/threads:1       185.0n ± 0%   185.0n ± 0%      0.00% (p=0.017 n=30)
Ostream_number/3/threads:72      19.38µ ± 4%   19.33µ ± 5%          ~ (p=0.425 n=30)
@ldionne
Copy link
Member

ldionne commented Nov 23, 2023

I took the freedom of applying the suggestion that doesn't require atomics at all, and rebasing on top of my other refactoring to save you the trouble of conflict resolution. @mordante @dvyukov What do you folks think about this?

Copy link

github-actions bot commented Nov 23, 2023

✅ With the latest revision this PR passed the C/C++ code formatter.

@dvyukov
Copy link
Collaborator Author

dvyukov commented Nov 23, 2023

I agree this is better. The change looks good to me.
+1 for merging

Copy link
Member

@ldionne ldionne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM in the current state.

Copy link
Member

@mordante mordante left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great to see we can do it without the atomics!

In general happy, but there is one issue and two nits.

libcxx/benchmarks/stringstream.bench.cpp Outdated Show resolved Hide resolved
libcxx/include/__utility/no_destroy.h Outdated Show resolved Hide resolved
libcxx/include/CMakeLists.txt Show resolved Hide resolved
Copy link
Member

@mordante mordante left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Please address @ldionne's comments before merging.

@dvyukov
Copy link
Collaborator Author

dvyukov commented Nov 27, 2023

Should I squash and merge this? Or what's the next step?

There are some test failures, but they don't look related:

Failed Tests (1):
  llvm-libc++-shared.cfg.in :: std/input.output/string.streams/stringstream/stringstream.members/gcount.pass.cpp

Timed Out Tests (2):
  llvm-libc++-shared.cfg.in :: std/containers/sequences/deque/deque.modifiers/insert_iter_iter.pass.cpp
  llvm-libc++-shared.cfg.in :: std/numerics/rand/rand.dist/rand.dist.bern/rand.dist.bern.negbin/eval.pass.cpp

@dvyukov
Copy link
Collaborator Author

dvyukov commented Nov 27, 2023

I've re-run ninja check-cxx on the latest version and it passed. The change also shows comparable performance improvement.

I am going to merge it now. Let's see if it sticks.

                                       │ ./bench/stream.baseline.res │         bench/stream.latest.res         │
                                       │           sec/op            │    sec/op     vs base                   │
Istream_numbers/0/real_time/threads:1                    4.686µ ± 0%   4.482µ ±  0%   -4.35% (p=0.000 n=24+30)
Istream_numbers/0/real_time/threads:72                 339.712µ ± 4%   9.567µ ±  1%  -97.18% (p=0.000 n=24+30)
Istream_numbers/1/real_time/threads:1                    4.900µ ± 0%   4.810µ ±  0%   -1.84% (p=0.000 n=24+30)
Istream_numbers/1/real_time/threads:72                   49.31µ ± 3%   10.12µ ±  1%  -79.47% (p=0.000 n=24+30)
Istream_numbers/2/real_time/threads:1                    4.902µ ± 0%   4.831µ ±  7%   -1.45% (p=0.037 n=24+30)
Istream_numbers/2/real_time/threads:72                   416.6µ ± 9%   424.0µ ± 15%        ~ (p=0.993 n=24+30)
Istream_numbers/3/real_time/threads:1                    4.716µ ± 0%   4.764µ ±  0%   +1.01% (p=0.000 n=24+30)
Istream_numbers/3/real_time/threads:72                   419.7µ ± 6%   400.9µ ±  5%        ~ (p=0.116 n=24+30)
Ostream_number/0/real_time/threads:1                     184.0n ± 1%   144.0n ±  7%  -21.74% (p=0.000 n=24+30)
Ostream_number/0/real_time/threads:72                  17279.0n ± 4%   339.0n ±  0%  -98.04% (p=0.000 n=24+30)
Ostream_number/1/real_time/threads:1                     253.5n ± 1%   195.0n ±  1%  -23.08% (p=0.000 n=24+30)
Ostream_number/1/real_time/threads:72                  13957.5n ± 2%   412.0n ±  0%  -97.05% (p=0.000 n=24+30)
Ostream_number/2/real_time/threads:1                     253.0n ± 1%   195.0n ±  1%  -22.92% (p=0.000 n=24+30)
Ostream_number/2/real_time/threads:72                    28.04µ ± 4%   18.52µ ± 19%  -33.97% (p=0.000 n=24+30)
Ostream_number/3/real_time/threads:1                     187.0n ± 1%   187.0n ±  0%        ~ (p=0.943 n=24+30)
Ostream_number/3/real_time/threads:72                    20.25µ ± 7%   19.80µ ±  7%        ~ (p=0.473 n=24+30)
geomean                                                  8.260µ        3.463µ        -58.08%

@dvyukov dvyukov merged commit f8afc53 into llvm:main Nov 27, 2023
36 of 39 checks passed
@dvyukov
Copy link
Collaborator Author

dvyukov commented Nov 27, 2023

This caused the following error on bots.

@ldionne perhaps classic_locale_imp_ should be constinit?

https://lab.llvm.org/buildbot/#/builders/168/builds/17053

==1639930==ERROR: AddressSanitizer: initialization-order-fiasco on address 0x7f9576845828 at pc 0x7f95767c189d bp 0x7fff76544cb0 sp 0x7fff76544ca8
WRITE of size 8 at 0x7f9576845828 thread T0
    #0 0x7f95767c189c in __shared_count /b/sanitizer-x86_64-linux-bootstrap-asan/build/libcxx_build_asan/include/c++/v1/__memory/shared_ptr.h:160:11
    #1 0x7f95767c189c in facet /b/sanitizer-x86_64-linux-bootstrap-asan/build/libcxx_build_asan/include/c++/v1/__locale:150:11
    #2 0x7f95767c189c in std::__1::locale::__imp::__imp(unsigned long) /b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/libcxx/src/locale.cpp:169:7
    #3 0x7f95767c4382 in __emplace<unsigned int> /b/sanitizer-x86_64-linux-bootstrap-asan/build/libcxx_build_asan/include/c++/v1/__utility/no_destroy.h:41:19
    #4 0x7f95767c4382 in operator() /b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/libcxx/src/locale.cpp:554:40
    #5 0x7f95767c4382 in std::__1::locale::classic() /b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/libcxx/src/locale.cpp:552:69
    #6 0x7f95767cae5d in __global /b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/libcxx/src/locale.cpp:561:33
    #7 0x7f95767cae5d in std::__1::locale::locale() /b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/libcxx/src/locale.cpp:575:39
    #8 0x7f95767309f2 in std::__1::basic_streambuf<char, std::__1::char_traits<char>>::basic_streambuf() /b/sanitizer-x86_64-linux-bootstrap-asan/build/libcxx_build_asan/include/c++/v1/streambuf:314:35
    #9 0x7f9576762283 in __stdinbuf /b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/libcxx/src/std_stream.h:76:21
    #10 0x7f9576762283 in std::__1::DoIOSInit::DoIOSInit() /b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/libcxx/src/iostream.cpp:124:59
    #11 0x7f9576762e23 in std::__1::ios_base::Init::Init() /b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/libcxx/src/iostream.cpp:161:22
    #12 0x7f9576766450 in __cxx_global_var_init /b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/libcxx/src/iostream_init.h:2:31
    #13 0x7f9576766450 in _GLOBAL__I_000100 /b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/libcxx/src/iostream.cpp
    #14 0x7f9576a0f33d  (/lib64/ld-linux-x86-64.so.2+0x533d) (BuildId: cc23bad777ae59d26be0445659cd1c94244d6bc7)
    #15 0x7f9576a0f427  (/lib64/ld-linux-x86-64.so.2+0x5427) (BuildId: cc23bad777ae59d26be0445659cd1c94244d6bc7)
    #16 0x7f9576a26e0f  (/lib64/ld-linux-x86-64.so.2+0x1ce0f) (BuildId: cc23bad777ae59d26be0445659cd1c94244d6bc7)

0x7f9576845828 is located 8 bytes inside of global variable 'std::__1::locale::__imp::classic_locale_imp_' defined in '/b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/libcxx/src/locale.cpp' (0x7f9576845820) of size 312
  registered at:
    #0 0x557b4c0dae06 in __asan_register_globals /b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/compiler-rt/lib/asan/asan_globals.cpp:369:3
    #1 0x557b4c0dbf29 in __asan_register_elf_globals /b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/compiler-rt/lib/asan/asan_globals.cpp:352:3
    #2 0x7f9576a0f33d  (/lib64/ld-linux-x86-64.so.2+0x533d) (BuildId: cc23bad777ae59d26be0445659cd1c94244d6bc7)

SUMMARY: AddressSanitizer: initialization-order-fiasco /b/sanitizer-x86_64-linux-bootstrap-asan/build/libcxx_build_asan/include/c++/v1/__memory/shared_ptr.h:160:11 in __shared_count
Shadow bytes around the buggy address:
  0x7f9576845580: 00 f9 f9 f9 00 00 00 00 00 00 00 00 00 00 00 00
  0x7f9576845600: f9 f9 f9 f9 00 f9 f9 f9 01 f9 f9 f9 00 f9 f9 f9
  0x7f9576845680: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x7f9576845700: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x7f9576845780: 00 00 00 00 00 00 00 00 00 00 00 00 00 f9 f9 f9
=>0x7f9576845800: 00 f9 f9 f9 f6[f6]f6 f6 f6 f6 f6 f6 f6 f6 f6 f6
  0x7f9576845880: f6 f6 f6 f6 f6 f6 f6 f6 f6 f6 f6 f6 f6 f6 f6 f6
  0x7f9576845900: f6 f6 f6 f6 f6 f6 f6 f6 f6 f6 f6 f6 f6 f6 f6 f6

@ldionne
Copy link
Member

ldionne commented Nov 27, 2023

This caused the following error on bots.

@ldionne perhaps classic_locale_imp_ should be constinit?

https://lab.llvm.org/buildbot/#/builders/168/builds/17053

==1639930==ERROR: AddressSanitizer: initialization-order-fiasco on address 0x7f9576845828 at pc 0x7f95767c189d bp 0x7fff76544cb0 sp 0x7fff76544ca8
WRITE of size 8 at 0x7f9576845828 thread T0
    #0 0x7f95767c189c in __shared_count /b/sanitizer-x86_64-linux-bootstrap-asan/build/libcxx_build_asan/include/c++/v1/__memory/shared_ptr.h:160:11
    #1 0x7f95767c189c in facet /b/sanitizer-x86_64-linux-bootstrap-asan/build/libcxx_build_asan/include/c++/v1/__locale:150:11
    #2 0x7f95767c189c in std::__1::locale::__imp::__imp(unsigned long) /b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/libcxx/src/locale.cpp:169:7
    #3 0x7f95767c4382 in __emplace<unsigned int> /b/sanitizer-x86_64-linux-bootstrap-asan/build/libcxx_build_asan/include/c++/v1/__utility/no_destroy.h:41:19
    #4 0x7f95767c4382 in operator() /b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/libcxx/src/locale.cpp:554:40
    #5 0x7f95767c4382 in std::__1::locale::classic() /b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/libcxx/src/locale.cpp:552:69
    #6 0x7f95767cae5d in __global /b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/libcxx/src/locale.cpp:561:33
    #7 0x7f95767cae5d in std::__1::locale::locale() /b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/libcxx/src/locale.cpp:575:39
    #8 0x7f95767309f2 in std::__1::basic_streambuf<char, std::__1::char_traits<char>>::basic_streambuf() /b/sanitizer-x86_64-linux-bootstrap-asan/build/libcxx_build_asan/include/c++/v1/streambuf:314:35
    #9 0x7f9576762283 in __stdinbuf /b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/libcxx/src/std_stream.h:76:21
    #10 0x7f9576762283 in std::__1::DoIOSInit::DoIOSInit() /b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/libcxx/src/iostream.cpp:124:59
    #11 0x7f9576762e23 in std::__1::ios_base::Init::Init() /b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/libcxx/src/iostream.cpp:161:22
    #12 0x7f9576766450 in __cxx_global_var_init /b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/libcxx/src/iostream_init.h:2:31
    #13 0x7f9576766450 in _GLOBAL__I_000100 /b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/libcxx/src/iostream.cpp
    #14 0x7f9576a0f33d  (/lib64/ld-linux-x86-64.so.2+0x533d) (BuildId: cc23bad777ae59d26be0445659cd1c94244d6bc7)
    #15 0x7f9576a0f427  (/lib64/ld-linux-x86-64.so.2+0x5427) (BuildId: cc23bad777ae59d26be0445659cd1c94244d6bc7)
    #16 0x7f9576a26e0f  (/lib64/ld-linux-x86-64.so.2+0x1ce0f) (BuildId: cc23bad777ae59d26be0445659cd1c94244d6bc7)

0x7f9576845828 is located 8 bytes inside of global variable 'std::__1::locale::__imp::classic_locale_imp_' defined in '/b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/libcxx/src/locale.cpp' (0x7f9576845820) of size 312
  registered at:
    #0 0x557b4c0dae06 in __asan_register_globals /b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/compiler-rt/lib/asan/asan_globals.cpp:369:3
    #1 0x557b4c0dbf29 in __asan_register_elf_globals /b/sanitizer-x86_64-linux-bootstrap-asan/build/llvm-project/compiler-rt/lib/asan/asan_globals.cpp:352:3
    #2 0x7f9576a0f33d  (/lib64/ld-linux-x86-64.so.2+0x533d) (BuildId: cc23bad777ae59d26be0445659cd1c94244d6bc7)

SUMMARY: AddressSanitizer: initialization-order-fiasco /b/sanitizer-x86_64-linux-bootstrap-asan/build/libcxx_build_asan/include/c++/v1/__memory/shared_ptr.h:160:11 in __shared_count
Shadow bytes around the buggy address:
  0x7f9576845580: 00 f9 f9 f9 00 00 00 00 00 00 00 00 00 00 00 00
  0x7f9576845600: f9 f9 f9 f9 00 f9 f9 f9 01 f9 f9 f9 00 f9 f9 f9
  0x7f9576845680: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x7f9576845700: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x7f9576845780: 00 00 00 00 00 00 00 00 00 00 00 00 00 f9 f9 f9
=>0x7f9576845800: 00 f9 f9 f9 f6[f6]f6 f6 f6 f6 f6 f6 f6 f6 f6 f6
  0x7f9576845880: f6 f6 f6 f6 f6 f6 f6 f6 f6 f6 f6 f6 f6 f6 f6 f6
  0x7f9576845900: f6 f6 f6 f6 f6 f6 f6 f6 f6 f6 f6 f6 f6 f6 f6 f6

Thanks for the heads up. See #73533

kstoimenov added a commit that referenced this pull request Nov 27, 2023
@kstoimenov
Copy link
Contributor

@dvyukov FYI...

ldionne added a commit to ldionne/llvm-project that referenced this pull request Nov 28, 2023
Locale objects use atomic reference counting, which may be very
expensive in parallel applications. The classic locale is used by
default by all streams and can be very contended. But it's never
destroyed, so the reference counting is also completely pointless on the
classic locale. Currently ~70% of time in the parallel stringstream
benchmarks is spent in locale ctor/dtor. And the execution radically
slows down with more threads.

Avoid reference counting on the classic locale. With this change
parallel benchmarks start to scale with threads.

This is a re-application of f8afc53 (aka PR llvm#72112) which was
reverted in 4e0c48b because it broke the sanitizer builds due
to an initialization order fiasco. This issue has now been fixed by
ensuring that the locale is constinit'ed.

Co-authored-by: Louis Dionne <ldionne.2@gmail.com>
ldionne added a commit that referenced this pull request Nov 29, 2023
Locale objects use atomic reference counting, which may be very
expensive in parallel applications. The classic locale is used by
default by all streams and can be very contended. But it's never
destroyed, so the reference counting is also completely pointless on the
classic locale. Currently ~70% of time in the parallel stringstream
benchmarks is spent in locale ctor/dtor. And the execution radically
slows down with more threads.

Avoid reference counting on the classic locale. With this change
parallel benchmarks start to scale with threads.

This is a re-application of f8afc53 (aka PR #72112) which was
reverted in 4e0c48b because it broke the sanitizer builds due
to an initialization order fiasco. This issue has now been fixed by
ensuring that the locale is constinit'ed.

Co-authored-by: Dmitry Vyukov <dvyukov@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
libc++ libc++ C++ Standard Library. Not GNU libstdc++. Not libc++abi.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants