__cxa_guard_*() functions non-compliant on ARM and big endian platforms #9

Closed
mdempsky opened this Issue Jun 3, 2013 · 10 comments

Projects

None yet

4 participants

@mdempsky
Collaborator
mdempsky commented Jun 3, 2013

Section 2.8 of the Itanium C++ ABI says "The size of the guard variable is 64 bits. The first byte (i.e. the byte at the address of the full variable) shall contain the value 0 prior to initialization of the associated variable, and 1 after initialization is complete."

On non-ARM platforms, libcxxrt uses the upper most byte of the 64-bit guard value to indicate whether initialization is complete. But on big endian platforms, the upper most byte is the last byte of the guard variable.

Also, for ARM platforms, section 3.2.3 of the ARM C++ ABI describes that the least significant bit of the 32-bit guard variable should be used to indicate whether the guard has been successfully initialized. However, libcxxrt compares the full 32-bits for equality against 0x80000000 to indicate successful initialization.

(Tangentially, using (1<<32) on a 32-bit platform triggers undefined behavior according to the C++, because 1<<32 overflows a signed integer. See C++11 section 5.8 paragraph 2: "The value of E1 << E2 is E1 left-shifted E2 bit positions; [...] Otherwise, if E1 has a signed type and non-negative value, and E1 × 2^{E2} is representable in the result type, then that is the resulting value; otherwise, the behavior is undefined.".)

@davidchisnall
Collaborator

On 3 Jun 2013, at 04:53, Matthew Dempsky notifications@github.com wrote:

Section 2.8 of the Itanium C++ ABI says "The size of the guard variable is 64 bits. The first byte (i.e. the byte at the address of the full variable) shall contain the value 0 prior to initialization of the associated variable, and 1 after initialization is complete."

On non-ARM platforms, libcxxrt uses the upper most byte of the 64-bit guard value to indicate whether initialization is complete. But on big endian platforms, the upper most byte is the last byte of the guard variable.

I believe this only matters when using two C++ runtime libraries in the same program, which is something that is decidedly nontrivial to achieve. I'd happily accept patches that fix it, but it's not high up my todo list.

Also, for ARM platforms, section 3.2.3 of the ARM C++ ABI describes that the least significant bit of the 32-bit guard variable should be used to indicate whether the guard has been successfully initialized. However, libcxxrt compares the full 32-bits for equality against 0x80000000 to indicate successful initialization.

Hmm. The code here was wrong in several ways. I believe the new version should be correct.

(Tangentially, using (1<<32) on a 32-bit platform triggers undefined behavior according to the C++, because 1<<32 overflows a signed integer. See C++11 section 5.8 paragraph 2: "The value of E1 << E2 is E1 left-shifted E2 bit positions; [...] Otherwise, if E1 has a signed type and non-negative value, and E1 × 2^{E2} is representable in the result type, then that is the resulting value; otherwise, the behavior is undefined.".)

I think you mean 1<<31 here.

David

@mdempsky
Collaborator
mdempsky commented Jun 3, 2013

I believe this only matters when using two C++ runtime libraries in the same program,

No, the compiler is allowed to emit checks to see if the first byte is set, and if so, it can skip calling __cxa_guard_acquire(), and both GCC 4.2 and Clang 3.3 do this. (E.g., see lines 1142 through 1171 of http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/CodeGen/ItaniumCXXABI.cpp?revision=182870&view=markup)

If __cxa_guard_acquire() uses the first byte as part of its lock, then there's a risk that racing threads will misinterpret that lock byte as indicating initialization has completed when it hasn't.

I think you mean 1<<31 here.

Doh, yes, I did. Sorry about that! :)

@mdempsky
Collaborator
mdempsky commented Jun 3, 2013

Hmm. The code here was wrong in several ways. I believe the new version should be correct.

Just reviewed your ARM changes; looks good to me!

@davidchisnall
Collaborator

On 3 Jun 2013, at 16:43, Matthew Dempsky notifications@github.com wrote:

No, the compiler is allowed to emit checks to see if the first byte is set, and if so, it can skip calling __cxa_guard_acquire(), and both GCC 4.2 and Clang 3.3 do this. (E.g., see lines 1142 through 1171 of http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/CodeGen/ItaniumCXXABI.cpp?revision=182870&view=markup)

If __cxa_guard_acquire() uses the first byte as part of its lock, then there's a risk that racing threads will misinterpret that lock byte as indicating initialization has completed when it hasn't.

Ah, good point. I think the ARM version is actually cleaner than the not-ARM version, and I've appended a patch that should unify the two implementations. I'd appreciate it if you and Chris could review it. The detection of endian is a bit ugly, but GCC only started defining endian macros consistently around 4.5-4.6, and we need to be able to build with GCC 4.2.1.

(If anyone cares enough, then with recent compilers we can weaken the memory barriers slightly, but I think the overhead of the full barrier is likely to be lost in the noise most of the time).

David

diff --git a/src/atomic.h b/src/atomic.h
index a3d2710..cfaff3f 100644
--- a/src/atomic.h
+++ b/src/atomic.h
@@ -28,13 +28,3 @@
(__sync_synchronize(), *addr)
#endif

-#if __has_builtin(__c11_atomic_store)
-#define ATOMIC_STORE(addr, value)\

  • c11_atomic_store((_Atomic(__typeof(addr)))addr, value, __ATOMIC_RELEASE)
    -#else
    -#define ATOMIC_STORE(addr, value)\
  • do {\
  •   *addr = value;\
    
  •   (__sync_synchronize(), *addr);\
    
  • } while (0)
    -#endif
    diff --git a/src/guard.cc b/src/guard.cc
    index 72a53e5..ebfabb4 100644
    --- a/src/guard.cc
    +++ b/src/guard.cc
    @@ -47,22 +47,48 @@
    #include <assert.h>
    #include "atomic.h"

-#ifdef arm
-// ARM ABI - 32-bit guards.
+// Older GCC doesn't define LITTLE_ENDIAN
+#ifndef LITTLE_ENDIAN

  • // If BYTE_ORDER is defined, use that instead
    +# ifdef BYTE_ORDER
    +# if BYTE_ORDER == ORDER_LITTLE_ENDIAN
    +# define LITTLE_ENDIAN
    +# endif
  • // x86 and ARM are the most common little-endian CPUs, so let's have a
  • // special case for them (ARM is already special cased). Assume everything
  • // else is big endian.
    +# elif defined(x86_64) || defined(__i386)
    +# define __LITTLE_ENDIAN

    +# endif
    +#endif

/*

  • The least significant bit of the guard variable indicates that the object
  • has been initialised, the most significant bit is used for a spinlock.
    */
    -static const uint32_t LOCKED = ((uint32_t)1) << 31;
    +#ifdef arm
    +// ARM ABI - 32-bit guards.
    +typedef uint32_t guard_t;
    +static const uint32_t LOCKED = ((guard_t)1) << 31;
    static const uint32_t INITIALISED = 1;
    +#else
    +typedef uint64_t guard_t;
    +# if defined(LITTLE_ENDIAN)
    +static const guard_t LOCKED = ((guard_t)1) << 63;
    +static const guard_t INITIALISED = 1;
    +# else
    +static const guard_t LOCKED = 1;
    +static const guard_t INITIALISED = ((guard_t)1) << 63;
    +# endif
    +#endif

/**

  • Acquires a lock on a guard, returning 0 if the object has already been
  • initialised, and 1 if it has not. If the object is already constructed then
  • this function just needs to read a byte from memory and return.
    */
    -extern "C" int __cxa_guard_acquire(volatile uint32_t *guard_object)
    +extern "C" int __cxa_guard_acquire(volatile guard_t *guard_object)
    {
    // Not an atomic read, doesn't establish a happens-before relationship, but
    // if one is already established and we end up seeing an initialised state
    @@ -105,7 +131,7 @@ extern "C" int __cxa_guard_acquire(volatile uint32_t *guard_object)
  • Releases the lock without marking the object as initialised. This function
  • is called if initialising a static causes an exception to be thrown.
    */
    -extern "C" void **cxa_guard_abort(uint32_t guard_object)
    +extern "C" void __cxa_guard_abort(guard_t guard_object)
    {
    __attribute
    ((unused))
    bool reset = __sync_bool_compare_and_swap(guard_object, LOCKED, 0);
    @@ -115,7 +141,7 @@ extern "C" void __cxa_guard_abort(uint32_t *guard_object)
  • Releases the guard and marks the object as initialised. This function is
  • called after successful initialisation of a static.
    */
    -extern "C" void **cxa_guard_release(uint32_t guard_object)
    +extern "C" void __cxa_guard_release(guard_t guard_object)
    {
    __attribute
    ((unused))
    bool reset = __sync_bool_compare_and_swap(guard_object, LOCKED, INITIALISED);
    @@ -123,74 +149,3 @@ extern "C" void __cxa_guard_release(uint32_t *guard_object)
    }

-#else

-// Itanium ABI: 64-bit guards

-/**

  • * Returns a pointer to the low 32 bits in a 64-bit value, respecting the
  • * platform's byte order.
  • */
    -static int32_t *low_32_bits(volatile int64_t *ptr)
    -{
  • int32_t low= (int32_t)ptr;
  • // Test if the machine is big endian - constant propagation at compile
  • // time should eliminate this completely.
  • int one = 1;
  • if ((char)&one != 1)
  • {
  •   low++;
    
  • }
  • return low;
    -}

-/**

  • * Acquires a lock on a guard, returning 0 if the object has already been
  • * initialised, and 1 if it has not. If the object is already constructed then
  • * this function just needs to read a byte from memory and return.
  • */
    -extern "C" int __cxa_guard_acquire(volatile int64_t *guard_object)
    -{
  • char first_byte = (*guard_object) >> 56;
  • if (1 == first_byte) { return 0; }
  • int32_t *lock = low_32_bits(guard_object);
  • // Simple spin lock using the low 32 bits. We assume that concurrent
  • // attempts to initialize statics are very rare, so we don't need to
  • // optimise for the case where we have lots of threads trying to acquire
  • // the lock at the same time.
  • while (!__sync_bool_compare_and_swap_4(lock, 0, 1))
  • {
  •   if (1 == ((*guard_object) >> 56))
    
  •   {
    
  •       break;
    
  •   }
    
  •   sched_yield();
    
  • }
  • // We have to test the guard again, in case another thread has performed
  • // the initialisation while we were trying to acquire the lock.
  • first_byte = (*guard_object) >> 56;
  • return (1 != first_byte);
    -}

-/**

  • * Releases the lock without marking the object as initialised. This function
  • * is called if initialising a static causes an exception to be thrown.
  • */
    -extern "C" void __cxa_guard_abort(int64_t *guard_object)
    -{
  • int32_t *lock = low_32_bits(guard_object);
  • ATOMIC_STORE(lock, 0);
    -}
    -/**
  • * Releases the guard and marks the object as initialised. This function is
  • * called after successful initialisation of a static.
  • */
    -extern "C" void __cxa_guard_release(int64_t *guard_object)
    -{
  • // Set the first byte to 1
  • // Note: We don't do an atomic load here because getting a stale value
  • // is not a problem in the current implementation.
  • ATOMIC_STORE(guard_object, (*guard_object | ((int64_t)1) << 56));
  • __cxa_guard_abort(guard_object);
    -}

-#endif

@mdempsky
Collaborator
mdempsky commented Jun 3, 2013
On Mon, Jun 3, 2013 at 10:04 AM, davidchisnall <notifications@github.com>wrote:

> Ah, good point. I think the ARM version is actually cleaner than the
> not-ARM version, and I've appended a patch that should unify the two
> implementations. I'd appreciate it if you and Chris could review it. The
> detection of endian is a bit ugly, but GCC only started defining endian
> macros consistently around 4.5-4.6, and we need to be able to build with
> GCC 4.2.1.

Argh, annoying.  There really doesn't seem to be a good portable solution
to this, so I'm happy with whatever you use, and in OpenBSD we can patch it
to use our appropriate endian detection.

POSIX Issue 8 is going to add <endian.h> (
http://austingroupbugs.net/view.php?id=162#c665), so something like

    static const guard_t INITIALISED = le64toh(1);

would work without endian specific checks, but I don't think anybody's
providing <endian.h> yet.  (It's currently provided as <sys/endian.h> on
most systems, and OpenBSD uses letoh64() instead of le64toh() at the
moment... sigh.)

    
Something like:
    
static guard_t initval() {
guard_t r = 0;
*(char *)&r = 1;
return r;
}
    
static const guard_t INITIALISED = initval();
    
is actually portable too, and clang++ can optimize this, but g++ 4.2 is
stuck doing a run-time global initializer.
    
Something not-100% kosher (but supported by G++ and Clang) would be:
    
static const union { char bytes[8]; guard_t g; } INITIALISED_BYTES = {{1}};
#define INITIALISED INITIALISED_BYTES.g
    
but G++ can't inline that either.
    
Blargh!
    
> (If anyone cares enough, then with recent compilers we can weaken the
> memory barriers slightly, but I think the overhead of the full barrier is
> likely to be lost in the noise most of the time).

Noted, though at the moment my focus is on correctness and simplicity
rather than performance, so as long as x86 performance is fast enough for
you, I'm not worried about onerous barriers.

    
> + // x86 and ARM are the most common little-endian CPUs, so let's have a
> + // special case for them (ARM is already special cased). Assume
> everything
> + // else is big endian.
> +# elif defined(x86_64) || defined(__i386)
> +# define __LITTLE_ENDIAN

> +# endif

FWIW, OpenBSD's little endian platforms are alpha, amd64, arm, i386, ia64,
mips64el, sh, and vax.

    
Also tangentially, I think a lot of our architectures can't do 64-bit CAS,
so sync_*_compare_and_swap() is going to fail to compile on those
architectures. But no need to worry about that just yet. :)
    
> +# if defined(__LITTLE_ENDIAN
)
> +static const guard_t LOCKED = ((guard_t)1) << 63;
> +static const guard_t INITIALISED = 1;
> +# else
> +static const guard_t LOCKED = 1;
> +static const guard_t INITIALISED = ((guard_t)1) << 63;
> +# endif
    
Itanium ABI says the first byte should be set to 1, so in the second case
you want to set INITIALISED to ((guard_t)1) << 56.
    
> -extern "C" void __cxa_guard_abort(uint32_t *guard_object)
> +extern "C" void __cxa_guard_abort(guard_t *guard_object)

Perhaps worth making this "volatile guard_t *guard_object"?  (Also, for
__cxa_guard_release.)


Thanks for tackling this!!
@mdempsky
Collaborator
mdempsky commented Jun 3, 2013

[Wow, github loves mangling email comments, and I can't seem to repair it even with editing... still can't get this to look okay, but it's readable now at least.]

On Mon, Jun 3, 2013 at 10:04 AM, davidchisnall notifications@github.comwrote:

Ah, good point. I think the ARM version is actually cleaner than the
not-ARM version, and I've appended a patch that should unify the two
implementations. I'd appreciate it if you and Chris could review it. The
detection of endian is a bit ugly, but GCC only started defining endian
macros consistently around 4.5-4.6, and we need to be able to build with
GCC 4.2.1.

Argh, annoying. There really doesn't seem to be a good portable solution
to this, so I'm happy with whatever you use, and in OpenBSD we can patch it
to use our appropriate endian detection.

POSIX Issue 8 is going to add endian.h (
http://austingroupbugs.net/view.php?id=162#c665), so something like

static const guard_t INITIALISED = le64toh(1);

would work without endian specific checks, but I don't think anybody's
providing <endian.h> yet. (It's currently provided as <sys/endian.h> on
most systems, and OpenBSD uses letoh64() instead of le64toh() at the
moment... sigh.)

Something like:

static guard_t initval() {
    guard_t r = 0;
    *(char *)&r = 1;
    return r;
}

static const guard_t INITIALISED = initval();

is actually portable too, and clang++ can optimize this, but g++ 4.2 is
stuck doing a run-time global initializer.

Something not-100% kosher (but supported by G++ and Clang) would be:

static const union { char bytes[8]; guard_t g; } INITIALISED_BYTES = {{1}};
#define INITIALISED INITIALISED_BYTES.g

but G++ can't inline that either.

Blargh!

(If anyone cares enough, then with recent compilers we can weaken the
memory barriers slightly, but I think the overhead of the full barrier is
likely to be lost in the noise most of the time).

Noted, though at the moment my focus is on correctness and simplicity
rather than performance, so as long as x86 performance is fast enough for
you, I'm not worried about onerous barriers.

  • // x86 and ARM are the most common little-endian CPUs, so let's have a
  • // special case for them (ARM is already special cased). Assume everything
  • // else is big endian.
    +# elif defined(x86_64) || defined(__i386)
    +# define __LITTLE_ENDIAN

    +# endif

FWIW, OpenBSD's little endian platforms are alpha, amd64, arm, i386, ia64,
mips64el, sh, and vax.

Also tangentially, I think a lot of our architectures can't do 64-bit CAS,
so _sync*_compare_and_swap() is going to fail to compile on those
architectures. But no need to worry about that just yet. :)

+# if defined(LITTLE_ENDIAN)
+static const guard_t LOCKED = ((guard_t)1) << 63;
+static const guard_t INITIALISED = 1;
+# else
+static const guard_t LOCKED = 1;
+static const guard_t INITIALISED = ((guard_t)1) << 63;
+# endif

Itanium ABI says the first byte should be set to 1, so in the second case
you want to set INITIALISED to ((guard_t)1) << 56.

-extern "C" void __cxa_guard_abort(uint32_t *guard_object)
+extern "C" void __cxa_guard_abort(guard_t *guard_object)

Perhaps worth making this "volatile guard_t *guard_object"? (Also, for
__cxa_guard_release.)

@davidchisnall
Collaborator

On 3 Jun 2013, at 21:45, Matthew Dempsky notifications@github.com wrote:

[Wow, github loves mangling email comments, and I can't seem to repair it even with editing... still can't get this to look okay, but it's readable now at least.]

On Mon, Jun 3, 2013 at 10:04 AM, davidchisnall notifications@github.comwrote:

Ah, good point. I think the ARM version is actually cleaner than the
not-ARM version, and I've appended a patch that should unify the two
implementations. I'd appreciate it if you and Chris could review it. The
detection of endian is a bit ugly, but GCC only started defining endian
macros consistently around 4.5-4.6, and we need to be able to build with
GCC 4.2.1.

Argh, annoying. There really doesn't seem to be a good portable solution
to this, so I'm happy with whatever you use, and in OpenBSD we can patch it
to use our appropriate endian detection.

Okay.

POSIX Issue 8 is going to add (
http://austingroupbugs.net/view.php?id=162#c665), so something like

static const guard_t INITIALISED = le64toh(1);

would work without endian specific checks, but I don't think anybody's
providing yet. (It's currently provided as on
most systems, and OpenBSD uses letoh64() instead of le64toh() at the
moment... sigh.)

If we could guaranteed endian.h, then we could just use the BIG_ENDIAN and LITTLE_ENDIAN macros that it is supposed to define.
Unfortunately, we can't even add a __has_include() to check for it because the compiler that doesn't sensibly set these macros for the target is also the compiler that doesn't support __has_include().

I can't wait until we can stop caring about gcc 4.2.1...

Something like:

static guard_t initval() {
guard_t r = 0;
*(char *)&r = 1;
return r;
}

static const guard_t INITIALISED = initval();

is actually portable too, and clang++ can optimize this, but g++ 4.2 is
stuck doing a run-time global initializer.

Something not-100% kosher (but supported by G++ and Clang) would be:

static const union { char bytes[8]; guard_t g; } INITIALISED_BYTES = {{1}};
#define INITIALISED INITIALISED_BYTES.g

but G++ can't inline that either.

Blargh!

I think I'll stick with the macro-selected constants, and hopefully it isn't too much effort for you to add a -D__LITTLE_ENDIAN__ where required.

(If anyone cares enough, then with recent compilers we can weaken the
memory barriers slightly, but I think the overhead of the full barrier is
likely to be lost in the noise most of the time).

Noted, though at the moment my focus is on correctness and simplicity
rather than performance, so as long as x86 performance is fast enough for
you, I'm not worried about onerous barriers.

Likewise. These locks should almost always be uncontended, so the strong barriers are likely to be cheap as the cache line will already be in the exclusive state. If anyone can provide a real-world example where they're a bottleneck, then I'll care about optimising them...

• // x86 and ARM are the most common little-endian CPUs, so let's have a
• // special case for them (ARM is already special cased). Assume everything
• // else is big endian. +# elif defined(__x86_64) || defined(__i386) +# define LITTLE_ENDIAN +# endif
FWIW, OpenBSD's little endian platforms are alpha, amd64, arm, i386, ia64,
mips64el, sh, and vax.

If you can give me a list of the compiler predefined macros that are used for detecting these, then I'd be happy to include them.

Also tangentially, I think a lot of our architectures can't do 64-bit CAS,
so _sync*_compare_and_swap() is going to fail to compile on those
architectures. But no need to worry about that just yet. :)

It should just call out to a function in libgcc that implements them, although OpenBSD is missing some of these. Disabling preemption and reenabling it for a bit may be a performance problem though.

I guess a check if sizeof(sigatomic_t) < sizeof(int64_t) should detect the problematic platforms. For the releases, we only really need an atomic store (and it could be two smaller atomic stores if we first set the initialised flag then release the lock), I'm mainly doing the CAS for extra error checking.

+# if defined(LITTLE_ENDIAN)
+static const guard_t LOCKED = ((guard_t)1) << 63;
+static const guard_t INITIALISED = 1;
+# else
+static const guard_t LOCKED = 1;
+static const guard_t INITIALISED = ((guard_t)1) << 63;
+# endif

Itanium ABI says the first byte should be set to 1, so in the second case
you want to set INITIALISED to ((guard_t)1) << 56.

Ooops, yes.

-extern "C" void __cxa_guard_abort(uint32_t *guard_object)
+extern "C" void __cxa_guard_abort(guard_t *guard_object)

Perhaps worth making this "volatile guard_t *guard_object"? (Also, for
__cxa_guard_release.)

It shouldn't be needed if we're doing _sync* things for all of the accesses, but it also won't hurt.

David

@jsonn
Collaborator
jsonn commented Jul 5, 2013

On Mon, Jun 03, 2013 at 10:04:00AM -0700, davidchisnall wrote:

/*

  • The least significant bit of the guard variable indicates that the object
  • has been initialised, the most significant bit is used for a spinlock.
    */
    -static const uint32_t LOCKED = ((uint32_t)1) << 31;
    +#ifdef arm
    +// ARM ABI - 32-bit guards.
    +typedef uint32_t guard_t;
    +static const uint32_t LOCKED = ((guard_t)1) << 31;
    static const uint32_t INITIALISED = 1;
    +#else
    +typedef uint64_t guard_t;
    +# if defined(LITTLE_ENDIAN)
    +static const guard_t LOCKED = ((guard_t)1) << 63;
    +static const guard_t INITIALISED = 1;
    +# else
    +static const guard_t LOCKED = 1;
    +static const guard_t INITIALISED = ((guard_t)1) << 63;
    +# endif

The problem with this definition and the version that ended up being
committed is that it fails if 64bit atomic ops are not supported.

Joerg

@cbergstrom
Collaborator

Joerg - When would atomic ops not be supported? At some point it's time to let some things go :P

@jsonn
Collaborator
jsonn commented Jul 6, 2013

The problem is not atomic ops in general, but 64bit atomic ops. Those are not available on many (older) 32bit architectures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment