Opportunity to use BTS instruction instead of cmpxchg #37322

davidbolvansky · 2018-06-28T08:55:33Z


Bugzilla Link	37974
Version	trunk
OS	Linux
CC	@topperc,@efriedma-quic,@hfinkel,@LebedevRI,@RKSimon,@rotateright

Extended Description

Hello,

Code:

bool foo(int *n, int bit) {
  unsigned mask = (1u << bit);
  return (__sync_fetch_and_or (n, mask) & mask) != 0;
}

Clang -O3:

foo(int*, int):
  mov edx, 1
  mov ecx, esi
  shl edx, cl
  mov eax, dword ptr [rdi]
.LBB0_1: 
  mov ecx, eax
  or ecx, edx
  lock cmpxchg dword ptr [rdi], ecx
  jne .LBB0_1
  test eax, edx
  setne al
  ret

GCC 7+ generates code using BTS instruction:

foo(int*, int):
  lock bts DWORD PTR [rdi], esi
  setc al
  ret

For more code examples see link:
https://godbolt.org/g/28nkRu

The text was updated successfully, but these errors were encountered:

LebedevRI · 2018-06-28T10:05:35Z

Just as a comment, https://reviews.llvm.org/D48606#1144023 suggested that memory versions of BT* should not be used.

topperc · 2018-06-28T18:02:10Z

We also don't optimize the single bit immediate version

bool foo(int *n, int bit) {
  unsigned mask = 0x80;
  return (__sync_fetch_and_or (n, mask) & mask) != 0;
}

topperc · 2018-06-28T18:57:55Z

Even though BTS/BTR/BTC are like >10 uop flows, they are probably still useful for this atomic case because we can remove the entire loop around the cmpxchg.

Unfortunately, this is tricky to implement. The AtomicExpandPass currently turns all atomicrmw or/and/xors that need the previous value into binop + cmpxchg before X86 selection dag. We could try to disable this for the single bit case, but if the shift instruction and the atomicrmw are in separate basic blocks, we won't be able to find the bit position during basic block at a time isel.

We almost need an IR construct for atomic bittest+set/clear/complement as one instruction.

dr-m · 2022-10-29T10:56:17Z

Much of this seems to work in GCC 12.2.0 as well as in clang++-15. For GCC there is a related ticket https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102566

I noticed a missed optimization in both g++-12 and clang++-15: Some operations involving bit 31 degrade to loops around lock cmpxchg. I compiled it with "-c -O2" (AMD64) or "-c -O2 -m32 -march=i686" (IA-32).

#include <atomic>
template<uint32_t b>
void lock_bts(std::atomic<uint32_t> &a) { while (!(a.fetch_or(b) & b)); }
template<uint32_t b>
void lock_btr(std::atomic<uint32_t> &a) { while (a.fetch_and(~b) & b); }
template<uint32_t b>
void lock_btc(std::atomic<uint32_t> &a) { while (a.fetch_xor(b) & b); }
template void lock_bts<1U<<30>(std::atomic<uint32_t> &a);
template void lock_btr<1U<<30>(std::atomic<uint32_t> &a);
template void lock_btc<1U<<30>(std::atomic<uint32_t> &a);
// bug: uses lock cmpxchg
template void lock_bts<1U<<31>(std::atomic<uint32_t> &a);
template void lock_btr<1U<<31>(std::atomic<uint32_t> &a);
template void lock_btc<1U<<31>(std::atomic<uint32_t> &a);

llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Opportunity to use BTS instruction instead of cmpxchg #37322

Opportunity to use BTS instruction instead of cmpxchg #37322

davidbolvansky commented Jun 28, 2018 •

edited by RKSimon

LebedevRI commented Jun 28, 2018

topperc commented Jun 28, 2018 •

edited by RKSimon

topperc commented Jun 28, 2018

dr-m commented Oct 29, 2022

Opportunity to use BTS instruction instead of cmpxchg #37322

Opportunity to use BTS instruction instead of cmpxchg #37322

Comments

davidbolvansky commented Jun 28, 2018 • edited by RKSimon

Extended Description

LebedevRI commented Jun 28, 2018

topperc commented Jun 28, 2018 • edited by RKSimon

topperc commented Jun 28, 2018

dr-m commented Oct 29, 2022

davidbolvansky commented Jun 28, 2018 •

edited by RKSimon

topperc commented Jun 28, 2018 •

edited by RKSimon