Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Odd inline asm code generation with pointless memory operand #56789

Closed
torvalds opened this issue Jul 28, 2022 · 9 comments
Closed

Odd inline asm code generation with pointless memory operand #56789

torvalds opened this issue Jul 28, 2022 · 9 comments
Labels
backend:X86 clang Clang issues not falling into any other category

Comments

@torvalds
Copy link

torvalds commented Jul 28, 2022

So I noticed this while trying to improve on the kernel bit finding functions.

On x86-64, we use tzcount in an inline asm - encoded as rep bsf for legacy reasons. The inline asm looks like this:

static __always_inline unsigned long __ffs(unsigned long word)
{
        asm("rep; bsf %1,%0"
                : "=r" (word)
                : "rm" (word));
        return word;

}

because the instruction can take either a register or memory operand as input, and obviously a register output.

However, this seems to confuse Clang no end, and it generates completely nonsensical code for it. The code happens to work, but it's unbelievably stupid:

unsigned long test(unsigned long arg)
{
        return __ffs(arg);
}

will generate

test: 
        movq    %rdi, -8(%rsp)
        rep ; bsfq    -8(%rsp), %rax
        retq

instead of the much simpler and more obvious

test:
        rep; bsf %rdi,%rax
        ret

that GCC generates.

There is obviously nothing special about the rep bsf (or tzcount) instruction.

@EugeneZelenko EugeneZelenko added clang Clang issues not falling into any other category and removed new issue labels Jul 28, 2022
@torvalds
Copy link
Author

torvalds commented Jul 28, 2022

Side note: I know about __builtin_ctzl(). We could use it in the kernel. But it seems to generate suboptimal code in another way: it generates bsf instead of tzcnt (aka rep bsf).

And because bsf writes to its destination register conditionally, it's not a good instruction to use.

GCC screws up the builtin function too, and when using __builtin_ctzl() seems to generate

xorl	%eax, %eax
rep bsfq	%rdi, %rax
cltq
ret

for some unfathomable reason. Neither the xor nor the cltq make any sense to me.

So using the builtin might fix the "pointlessly use a memory operand" issue, but only causes other issues.

@nikic
Copy link
Contributor

nikic commented Jul 29, 2022

Clang generates tzcnt for __builtin_ctzl() with -mbmi: https://c.godbolt.org/z/cc7qY9Kz9

Not sure whether there are any disadvantages to always using it, even if the zero-behavior cannot be relied on.

@llvmbot
Copy link
Collaborator

llvmbot commented Jul 29, 2022

@llvm/issue-subscribers-backend-x86

@RKSimon
Copy link
Collaborator

RKSimon commented Jul 29, 2022

@nikic #34191 probably needs to be addressed first for that case - but this one appears to be about removing a store + (folded) load because its hidden inside an inline asm block

define i64 @test(i64 noundef %arg) {
entry:
  %0 = tail call i64 asm "rep; bsf $1,$0", "=r,rm,~{dirflag},~{fpsr},~{flags}"(i64 %arg)
  ret i64 %0
}

@aaronpuchert
Copy link
Member

Neither the xor nor the cltq make any sense to me.

My understanding is that the xor works around a false dependency hardware bug, see #33216. No idea about cltq though.

Clang generates tzcnt for __builtin_ctzl() with -mbmi.

Or, if you don't want to add global flags, use an attribute like __attribute__((target("bmi"))).

@nickdesaulniers
Copy link
Member

The code happens to work, but it's unbelievably stupid:

Right, this is a duplicate of #20571. Recall our discussion about __builtin_ia32_readeflags_u64; this isn't specific to that sequence, but all inline asm using "rm" constraints. LLVM pessimizes "rm" to "m". It's on the TODO list to fix this.

@torvalds
Copy link
Author

torvalds commented Aug 1, 2022

Clang generates tzcnt for __builtin_ctzl() with -mbmi: https://c.godbolt.org/z/cc7qY9Kz9

Not sure whether there are any disadvantages to always using it, even if the zero-behavior cannot be relied on.

Yeah, a compiler should always generate tzcnt over bsf. There is basically no advantage to generating the old bsf instruction, except for it being one byte shorter if you run it on an old CPU.

Old cpus will treat the two identically, and new cpus will perform better with tzcnt (just writing a different result value for the undefined case).

That's why Intel literally used that odd "rep bsf" encoding - so that you can use the "new" instruction even on old CPUs if you truly were ok with the undefined semantics of bsf with a zero input.

You're supposed to care about the CPUID bit only if you absolutely require the new semantics. And if llvm currently generates bsf, then clearly that cannot be the case.

(The documented behavior of bsf with a zero input is "undefined", but I think it was "don't write the output register" in most - probably all - cores. And people actually depended on that, which is why TZCNT exists as a new instruction at all, rather than Intel just changing the "undocumented" behavior).

@phoebewang
Copy link
Contributor

I put a patch to always generate rep prefix for bsf: https://reviews.llvm.org/D130956

@nickdesaulniers
Copy link
Member

Closing this as a duplicate of #20571.

We'll cherry-pick https://reviews.llvm.org/D130956 / c2066d1 into the clang-15 release here (hopefully, otherwise clang-16 will have it).

It sounds like #33216 is also relevant.

Please let me know if I missed a nuanced issue also described in this thread; just reopen the issue.

@aaronpuchert aaronpuchert closed this as not planned Won't fix, can't repro, duplicate, stale Aug 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:X86 clang Clang issues not falling into any other category
Projects
None yet
Development

No branches or pull requests

8 participants