-
Notifications
You must be signed in to change notification settings - Fork 15.2k
Open
Description
Noticed while triaging #158646
Many of the basic BMI bit operations could easily be performed purely on the predicate registers, but instead they do a round trip to the gprs:
inline
__mmask8 kblsmsk(__mmask8 x) {
return x ^ (x - 1);
}
inline
__mmask8 kblsr(__mmask8 x) {
return x & (x - 1);
}
inline
__mmask8 kblsi(__mmask8 x) {
return x & -x;
}
(NOTE: The above are hacky implementations making use of the mmask types just being integers in the itrinsics headers)
(x - 1)
can be performed using kadd + allones
-x
might be trickier but it should be doable as not(x) + 1
(the 1 can be done as allones+kshift)
The BMI2 BZHI op and many of the TBM patterns could be easy to implement as well.
https://clang.godbolt.org/z/f1v636frj
Something like the blsi case doesn't even manage to keep the and on the predicate masks:
test_blsi(long long vector[8], long long vector[8], long long vector[8], long long vector[8]):
vpcmpeqq %zmm1, %zmm0, %k0
kmovd %k0, %eax
movl %eax, %ecx
negb %cl
andb %al, %cl
kmovd %ecx, %k1
vpandq %zmm1, %zmm0, %zmm0 {%k1}
retq
-->
test_blsi(long long vector[8], long long vector[8], long long vector[8], long long vector[8]):
vpcmpeqq %zmm1, %zmm0, %k0
kmovd %k0, %eax
movl %eax, %ecx
negb %cl
kmovd %ecx, %k1
kandb %k0, %k1
vpandq %zmm1, %zmm0, %zmm0 {%k1}
retq