Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Opportunity to use BTS instruction instead of cmpxchg #37322

Open
davidbolvansky opened this issue Jun 28, 2018 · 4 comments
Open

Opportunity to use BTS instruction instead of cmpxchg #37322

davidbolvansky opened this issue Jun 28, 2018 · 4 comments
Labels
backend:X86 bugzilla Issues migrated from bugzilla

Comments

@davidbolvansky
Copy link
Collaborator

davidbolvansky commented Jun 28, 2018

Bugzilla Link 37974
Version trunk
OS Linux
CC @topperc,@efriedma-quic,@hfinkel,@LebedevRI,@RKSimon,@rotateright

Extended Description

Hello,

Code:

bool foo(int *n, int bit) {
  unsigned mask = (1u << bit);
  return (__sync_fetch_and_or (n, mask) & mask) != 0;
}

Clang -O3:

foo(int*, int):
  mov edx, 1
  mov ecx, esi
  shl edx, cl
  mov eax, dword ptr [rdi]
.LBB0_1: 
  mov ecx, eax
  or ecx, edx
  lock cmpxchg dword ptr [rdi], ecx
  jne .LBB0_1
  test eax, edx
  setne al
  ret

GCC 7+ generates code using BTS instruction:

foo(int*, int):
  lock bts DWORD PTR [rdi], esi
  setc al
  ret

For more code examples see link:
https://godbolt.org/g/28nkRu

@LebedevRI
Copy link
Member

Just as a comment, https://reviews.llvm.org/D48606#1144023 suggested that memory versions of BT* should not be used.

@topperc
Copy link
Collaborator

topperc commented Jun 28, 2018

We also don't optimize the single bit immediate version

bool foo(int *n, int bit) {
  unsigned mask = 0x80;
  return (__sync_fetch_and_or (n, mask) & mask) != 0;
}

@topperc
Copy link
Collaborator

topperc commented Jun 28, 2018

Even though BTS/BTR/BTC are like >10 uop flows, they are probably still useful for this atomic case because we can remove the entire loop around the cmpxchg.

Unfortunately, this is tricky to implement. The AtomicExpandPass currently turns all atomicrmw or/and/xors that need the previous value into binop + cmpxchg before X86 selection dag. We could try to disable this for the single bit case, but if the shift instruction and the atomicrmw are in separate basic blocks, we won't be able to find the bit position during basic block at a time isel.

We almost need an IR construct for atomic bittest+set/clear/complement as one instruction.

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021
@dr-m
Copy link

dr-m commented Oct 29, 2022

Much of this seems to work in GCC 12.2.0 as well as in clang++-15. For GCC there is a related ticket https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102566

I noticed a missed optimization in both g++-12 and clang++-15: Some operations involving bit 31 degrade to loops around lock cmpxchg. I compiled it with "-c -O2" (AMD64) or "-c -O2 -m32 -march=i686" (IA-32).

#include <atomic>
template<uint32_t b>
void lock_bts(std::atomic<uint32_t> &a) { while (!(a.fetch_or(b) & b)); }
template<uint32_t b>
void lock_btr(std::atomic<uint32_t> &a) { while (a.fetch_and(~b) & b); }
template<uint32_t b>
void lock_btc(std::atomic<uint32_t> &a) { while (a.fetch_xor(b) & b); }
template void lock_bts<1U<<30>(std::atomic<uint32_t> &a);
template void lock_btr<1U<<30>(std::atomic<uint32_t> &a);
template void lock_btc<1U<<30>(std::atomic<uint32_t> &a);
// bug: uses lock cmpxchg
template void lock_bts<1U<<31>(std::atomic<uint32_t> &a);
template void lock_btr<1U<<31>(std::atomic<uint32_t> &a);
template void lock_btc<1U<<31>(std::atomic<uint32_t> &a);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:X86 bugzilla Issues migrated from bugzilla
Projects
None yet
Development

No branches or pull requests

4 participants