-
Notifications
You must be signed in to change notification settings - Fork 15.2k
Description
This is related to #60632
We've noticed that when dealing with partially demanded or short vectors, the values in the undemanded elements can cause performance drops in some fp instructions (most notable in divps
but also sqrtps
/dpps
), even with DAZ/FTZ enabled. This has been noticed most on btver2 targets, but I expect there's other CPUs that can be affected in other ways.
Sometimes this appears to be values that would raise fp-exceptions (fdivzero
etc. - even if they've been disabled), other times its just because the values are particularly large or poorly canonicalized - basically if the element's bits don't represent a typical float value then it seems some weaker fdiv
units are likely to drop to a slower execution path.
Pulling out exact examples is proving to be tricky, but something like:
define <2 x float> @fdiv_post_shuffle(<2 x float> %a0, <2 x float> %a1) {
%d = fdiv <2 x float> %a0, %a1
%s = shufflevector <2 x float> %d, <2 x float> poison, <2 x i32> <i32 1, i32 0>
ret <2 x float> %s
}
fdiv_post_shuffle:
vdivps %xmm1, %xmm0, %xmm0
vpermilps $225, %xmm0, %xmm0 # xmm0 = xmm0[1,0,2,3]
retq
would be better if actually performed as something like:
fdiv_pre_shuffle:
vpermilps $17, %xmm0, %xmm0 # xmm0 = xmm0[1,0,1,0]
vpermilps $17, %xmm1, %xmm1 # xmm1 = xmm1[1,0,1,0]
vdivps %xmm1, %xmm0, %xmm0
retq