Skip to content

Optimize __builtin_parity #2355

@efriedma-quic

Description

@efriedma-quic
Bugzilla Link 1983
Resolution FIXED
Resolved on Oct 10, 2020 12:12
Version trunk
OS Linux
CC @topperc,@gnzlbg,@MaskRay,@RKSimon,@rotateright,@stoklund

Extended Description

Spun off of Bug 1979, because I don't want to pollute that bug with a lot of irrelevant discussion.

Eli: some random unrelated thoughts :). If you're interested in parity, I
wonder if llvm-gcc generates efficient code for __builtin_parity. Could you
check?

It appears llvm-gcc doesn't support __builtin_parity... at least the version at http://llvm.org/demo/index.cgi (I don't have a trunk build).

LLVM currently generates bad code for llvm.ctpop.i32(x) & 1; it just generates the ctpop and ands it with 1, which is really slow for cpu's without a popcount instruction.

It would also be useful to teach instcombine about various idioms for parity,
turning it into llvm.popct(x)&1

What exactly are common idioms for parity? The only idioms that would be easy to detect are popcount & 1 and maybe the pure shift+xor method.

I've dome some rough testing of various ways of computing parity using a benchmark that looks like the testcase for this bug.

For x86, something like the following is probably best; it appears to be as fast as a lookup table.

static unsigned parity2(unsigned x) {
int result;
asm ("movl %1, %%ecx\n"
"shrl $16, %%ecx\n"
"xorl %1, %%ecx\n"
"xorb %%ch, %%cl\n"
"setnp %%cl\n"
"movzbl %%cl, %0\n"
:"=r"(result) /* output /
:"g" (x) /
input /
:"%ecx" /
clobbered register */
);
return result;
}

I don't know how difficult it would be to make LLVM generate something like that, though.

Besides that, the fastest method is a lookup table; however, a full lookup table is at least 256 bytes. Without a lookup table, the method from my testcase is the fastest. The method from http://graphics.stanford.edu/~seander/bithacks.html#ParityParallel is a little slower. Purely using xor+shift is a little slower than that (it adds up to being about twice as slow as the asm or table lookup).

That said, the way I'm benchmarking is not at all representative of real-world code, so my results might be a bit off.

The version of gcc I have converts __builtin_parity into a call to __paritysi2, a method in libgcc which uses a lookup table; however, I think that's changed (http://www.archivum.info/gcc-patches@gcc.gnu.org/2007-02/msg00999.html).

Is there actually any real-world use for 32-bit parity, though? I just looked at it because I was curious...

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions