Odd inline asm code generation with pointless memory operand #56789

torvalds · 2022-07-28T21:27:11Z

So I noticed this while trying to improve on the kernel bit finding functions.

On x86-64, we use tzcount in an inline asm - encoded as rep bsf for legacy reasons. The inline asm looks like this:

static __always_inline unsigned long __ffs(unsigned long word)
{
        asm("rep; bsf %1,%0"
                : "=r" (word)
                : "rm" (word));
        return word;

}

because the instruction can take either a register or memory operand as input, and obviously a register output.

However, this seems to confuse Clang no end, and it generates completely nonsensical code for it. The code happens to work, but it's unbelievably stupid:

unsigned long test(unsigned long arg)
{
        return __ffs(arg);
}

will generate

test: 
        movq    %rdi, -8(%rsp)
        rep ; bsfq    -8(%rsp), %rax
        retq

instead of the much simpler and more obvious

test:
        rep; bsf %rdi,%rax
        ret

that GCC generates.

There is obviously nothing special about the rep bsf (or tzcount) instruction.

The text was updated successfully, but these errors were encountered:

torvalds · 2022-07-28T21:59:56Z

Side note: I know about __builtin_ctzl(). We could use it in the kernel. But it seems to generate suboptimal code in another way: it generates bsf instead of tzcnt (aka rep bsf).

And because bsf writes to its destination register conditionally, it's not a good instruction to use.

GCC screws up the builtin function too, and when using __builtin_ctzl() seems to generate

xorl	%eax, %eax
rep bsfq	%rdi, %rax
cltq
ret

for some unfathomable reason. Neither the xor nor the cltq make any sense to me.

So using the builtin might fix the "pointlessly use a memory operand" issue, but only causes other issues.

nikic · 2022-07-29T08:43:21Z

Clang generates tzcnt for __builtin_ctzl() with -mbmi: https://c.godbolt.org/z/cc7qY9Kz9

Not sure whether there are any disadvantages to always using it, even if the zero-behavior cannot be relied on.

llvmbot · 2022-07-29T08:43:57Z

@llvm/issue-subscribers-backend-x86

RKSimon · 2022-07-29T09:34:30Z

@nikic #34191 probably needs to be addressed first for that case - but this one appears to be about removing a store + (folded) load because its hidden inside an inline asm block

define i64 @test(i64 noundef %arg) {
entry:
  %0 = tail call i64 asm "rep; bsf $1,$0", "=r,rm,~{dirflag},~{fpsr},~{flags}"(i64 %arg)
  ret i64 %0
}

aaronpuchert · 2022-07-30T02:20:12Z

Neither the xor nor the cltq make any sense to me.

My understanding is that the xor works around a false dependency hardware bug, see #33216. No idea about cltq though.

Clang generates tzcnt for __builtin_ctzl() with -mbmi.

Or, if you don't want to add global flags, use an attribute like __attribute__((target("bmi"))).

nickdesaulniers · 2022-08-01T17:54:16Z

The code happens to work, but it's unbelievably stupid:

Right, this is a duplicate of #20571. Recall our discussion about __builtin_ia32_readeflags_u64; this isn't specific to that sequence, but all inline asm using "rm" constraints. LLVM pessimizes "rm" to "m". It's on the TODO list to fix this.

torvalds · 2022-08-01T19:15:08Z

Clang generates tzcnt for __builtin_ctzl() with -mbmi: https://c.godbolt.org/z/cc7qY9Kz9

Not sure whether there are any disadvantages to always using it, even if the zero-behavior cannot be relied on.

Yeah, a compiler should always generate tzcnt over bsf. There is basically no advantage to generating the old bsf instruction, except for it being one byte shorter if you run it on an old CPU.

Old cpus will treat the two identically, and new cpus will perform better with tzcnt (just writing a different result value for the undefined case).

That's why Intel literally used that odd "rep bsf" encoding - so that you can use the "new" instruction even on old CPUs if you truly were ok with the undefined semantics of bsf with a zero input.

You're supposed to care about the CPUID bit only if you absolutely require the new semantics. And if llvm currently generates bsf, then clearly that cannot be the case.

(The documented behavior of bsf with a zero input is "undefined", but I think it was "don't write the output register" in most - probably all - cores. And people actually depended on that, which is why TZCNT exists as a new instruction at all, rather than Intel just changing the "undocumented" behavior).

phoebewang · 2022-08-02T02:07:59Z

I put a patch to always generate rep prefix for bsf: https://reviews.llvm.org/D130956

nickdesaulniers · 2022-08-03T17:45:10Z

Closing this as a duplicate of #20571.

We'll cherry-pick https://reviews.llvm.org/D130956 / c2066d1 into the clang-15 release here (hopefully, otherwise clang-16 will have it).

It sounds like #33216 is also relevant.

Please let me know if I missed a nuanced issue also described in this thread; just reopen the issue.

github-actions bot added the new issue label Jul 28, 2022

EugeneZelenko added clang Clang issues not falling into any other category and removed new issue labels Jul 28, 2022

nikic added the backend:X86 label Jul 29, 2022

nickdesaulniers closed this as completed Aug 3, 2022

aaronpuchert closed this as not planned Won't fix, can't repro, duplicate, stale Aug 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Odd inline asm code generation with pointless memory operand #56789

Odd inline asm code generation with pointless memory operand #56789

torvalds commented Jul 28, 2022 •

edited by VoltrexKeyva

Loading

torvalds commented Jul 28, 2022 •

edited by VoltrexKeyva

Loading

nikic commented Jul 29, 2022

llvmbot commented Jul 29, 2022

RKSimon commented Jul 29, 2022 •

edited by VoltrexKeyva

Loading

aaronpuchert commented Jul 30, 2022

nickdesaulniers commented Aug 1, 2022

torvalds commented Aug 1, 2022

phoebewang commented Aug 2, 2022

nickdesaulniers commented Aug 3, 2022

Odd inline asm code generation with pointless memory operand #56789

Odd inline asm code generation with pointless memory operand #56789

Comments

torvalds commented Jul 28, 2022 • edited by VoltrexKeyva Loading

torvalds commented Jul 28, 2022 • edited by VoltrexKeyva Loading

nikic commented Jul 29, 2022

llvmbot commented Jul 29, 2022

RKSimon commented Jul 29, 2022 • edited by VoltrexKeyva Loading

aaronpuchert commented Jul 30, 2022

nickdesaulniers commented Aug 1, 2022

torvalds commented Aug 1, 2022

phoebewang commented Aug 2, 2022

nickdesaulniers commented Aug 3, 2022

torvalds commented Jul 28, 2022 •

edited by VoltrexKeyva

Loading

torvalds commented Jul 28, 2022 •

edited by VoltrexKeyva

Loading

RKSimon commented Jul 29, 2022 •

edited by VoltrexKeyva

Loading