Optimize __builtin_parity

|  |  |
| --- | --- |
| Bugzilla Link | [1983](https://llvm.org/bz1983) |
| Resolution | FIXED |
| Resolved on | Oct 10, 2020 12:12 |
| Version | trunk |
| OS | Linux |
| CC | @topperc,@gnzlbg,@MaskRay,@RKSimon,@rotateright,@stoklund |

## Extended Description 
Spun off of Bug 1979, because I don't want to pollute that bug with a lot of irrelevant discussion.


> Eli: some random unrelated thoughts :).  If you're interested in parity, I
> wonder if llvm-gcc generates efficient code for __builtin_parity.  Could you
> check?

It appears llvm-gcc doesn't support __builtin_parity... at least the version at http://llvm.org/demo/index.cgi (I don't have a trunk build).

LLVM currently generates bad code for llvm.ctpop.i32(x) & 1; it just generates the ctpop and ands it with 1, which is really slow for cpu's without a popcount instruction.


> It would also be useful to teach instcombine about various idioms for parity,
> turning it into llvm.popct(x)&1

What exactly are common idioms for parity? The only idioms that would be easy to detect are popcount & 1 and maybe the pure shift+xor method.


I've dome some rough testing of various ways of computing parity using a benchmark that looks like the testcase for this bug.

For x86, something like the following is probably best; it appears to be as fast as a lookup table.

static unsigned parity2(unsigned x) {
  int result;
  asm ("movl    %1, %%ecx\n"
       "shrl    $16, %%ecx\n"
       "xorl    %1, %%ecx\n"
       "xorb    %%ch, %%cl\n"
       "setnp   %%cl\n"
       "movzbl  %%cl, %0\n"
        :"=r"(result)    /* output */
        :"g" (x)         /* input */
        :"%ecx"         /* clobbered register */
       );
  return result;
}

I don't know how difficult it would be to make LLVM generate something like that, though.

Besides that, the fastest method is a lookup table; however, a full lookup table is at least 256 bytes.  Without a lookup table, the method from my testcase is the fastest. The method from http://graphics.stanford.edu/~seander/bithacks.html#ParityParallel is a little slower. Purely using xor+shift is a little slower than that (it adds up to being about twice as slow as the asm or table lookup).

That said, the way I'm benchmarking is not at all representative of real-world code, so my results might be a bit off.

The version of gcc I have converts __builtin_parity into a call to __paritysi2, a method in libgcc which uses a lookup table; however, I think that's changed  (http://www.archivum.info/gcc-patches@gcc.gnu.org/2007-02/msg00999.html).

Is there actually any real-world use for 32-bit parity, though?  I just looked at it because I was curious...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize __builtin_parity #2355

Extended Description

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development


Bugzilla Link	1983
Resolution	FIXED
Resolved on	Oct 10, 2020 12:12
Version	trunk
OS	Linux
CC	@topperc,@gnzlbg,@MaskRay,@RKSimon,@rotateright,@stoklund

Optimize __builtin_parity #2355

Description

Extended Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions