Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SLP] Missed icmp/fcmp allof/anyof reductions #40657

Open
RKSimon opened this issue Mar 29, 2019 · 13 comments
Open

[SLP] Missed icmp/fcmp allof/anyof reductions #40657

RKSimon opened this issue Mar 29, 2019 · 13 comments
Assignees
Labels

Comments

@RKSimon
Copy link
Collaborator

@RKSimon RKSimon commented Mar 29, 2019

Bugzilla Link 41312
Version trunk
OS Windows NT
Blocks #49274
CC @alexey-bataev,@dtemirbulatov,@LebedevRI,@MattPD,@rotateright

Extended Description

We should be able to combine these to allof/anyof vector reductions, instead we end up with nested branch trees and scalar selects.

https://gcc.godbolt.org/z/FImIuY

#include <x86intrin.h>

float test_merge_allof_v4sf(__v4sf t) {
if((t[0] < 0.0 && t[1] < 0.0 && t[2] < 0.0 && t[3] < 0.0) ||
(t[0] > 1.0 && t[1] > 1.0 && t[2] > 1.0 && t[3] > 1.0))
return 0;
return t[0] + t[1];
}
float test_merge_anyof_v4sf(__v4sf t) {
if((t[0] < 0.0 || t[1] < 0.0 || t[2] < 0.0 || t[3] < 0.0) ||
(t[0] > 1.0 || t[1] > 1.0 || t[2] > 1.0 || t[3] > 1.0))
return 0;
return t[0] + t[1];
}

float test_separate_allof_v4sf(__v4sf t) {
if((t[0] < 0.0 && t[1] < 0.0 && t[2] < 0.0 && t[3] < 0.0))
return 0;
if((t[0] > 1.0 && t[1] > 1.0 && t[2] > 1.0 && t[3] > 1.0))
return 0;
return t[0] + t[1];
}
float test_separate_anyof_v4sf(__v4sf t) {
if((t[0] < 0.0 || t[1] < 0.0 || t[2] < 0.0 || t[3] < 0.0))
return 0;
if((t[0] > 1.0 || t[1] > 1.0 || t[2] > 1.0 || t[3] > 1.0))
return 0;
return t[0] + t[1];
}

int test_separate_allof_v4si(__v4si t) {
if((t[0] < 1 && t[1] < 1 && t[2] < 1 && t[3] < 1))
return 0;
if((t[0] > 255 && t[1] > 255 && t[2] > 255 && t[3] > 255))
return 0;
return t[0] + t[1];
}
int test_separate_anyof_v4si(__v4si t) {
if((t[0] < 1 || t[1] < 1 || t[2] < 1 || t[3] < 1))
return 0;
if((t[0] > 255 || t[1] > 255 || t[2] > 255 || t[3] > 255))
return 0;
return t[0] + t[1];
}

@RKSimon
Copy link
Collaborator Author

@RKSimon RKSimon commented Mar 29, 2019

assigned to @rotateright

@LebedevRI
Copy link
Member

@LebedevRI LebedevRI commented May 20, 2019

I also just arrived at this.
https://godbolt.org/z/zIyI0_
I did some benchmark (not of just the snippet, but of the entire code),
and this results in up to ~-2% improvement in some tested cases.
(and a +18% regression in a few cases, so cost model!)

Any pointers?

@rotateright
Copy link
Contributor

@rotateright rotateright commented May 21, 2019

I also just arrived at this.
https://godbolt.org/z/zIyI0_
I did some benchmark (not of just the snippet, but of the entire code),
and this results in up to ~-2% improvement in some tested cases.
(and a +18% regression in a few cases, so cost model!)

Any pointers?

The more general case:
bool test(const unsigned char* input) {
return input[0] != 0xFF &&
input[1] != 0xFF &&
input[2] != 0xFF &&
input[3] != 0xFF;
}

...is hard/impossible. If the 1st byte is 0xFF, then we are not allowed to load any further because that could be unmapped memory. I think we would require the 'dereferenceable' attribute to handle that case.

@LebedevRI
Copy link
Member

@LebedevRI LebedevRI commented May 21, 2019

I also just arrived at this.
https://godbolt.org/z/zIyI0_
I did some benchmark (not of just the snippet, but of the entire code),
and this results in up to ~-2% improvement in some tested cases.
(and a +18% regression in a few cases, so cost model!)

Any pointers?

The more general case:
bool test(const unsigned char* input) {
return input[0] != 0xFF &&
input[1] != 0xFF &&
input[2] != 0xFF &&
input[3] != 0xFF;
}

...is hard/impossible. If the 1st byte is 0xFF, then we are not allowed to
load any further because that could be unmapped memory.

I think we would
require the 'dereferenceable' attribute to handle that case.

Yeah, the C variant isn't well-preserved. The whole load if i32 is legal there,
i just kind of forgot to about it.

@rotateright
Copy link
Contributor

@rotateright rotateright commented Sep 16, 2020

There are multiple things potentially to fix here...but here's one that I can try to fix:

SLP does not actually sort the candidates for a reduction, so the instruction order in IR affects the groupings and what comes out. That's part of why we see reductions for some examples but not others.

@rotateright
Copy link
Contributor

@rotateright rotateright commented Sep 16, 2020

SLP does not actually sort the candidates for a reduction, so the
instruction order in IR affects the groupings and what comes out. That's
part of why we see reductions for some examples but not others.

Proposal:
https://reviews.llvm.org/D87772

@RKSimon
Copy link
Collaborator Author

@RKSimon RKSimon commented Jun 2, 2021

The code is now typically something like:

define float @​test_merge_anyof_v4sf(<4 x float> %0) {
%2 = extractelement <4 x float> %0, i32 0
%3 = fcmp olt float %2, 0.000000e+00
%4 = extractelement <4 x float> %0, i32 1
%5 = fcmp olt float %4, 0.000000e+00
%6 = select i1 %3, i1 true, i1 %5
%7 = extractelement <4 x float> %0, i32 2
%8 = fcmp olt float %7, 0.000000e+00
%9 = select i1 %6, i1 true, i1 %8
%10 = extractelement <4 x float> %0, i32 3
%11 = fcmp olt float %10, 0.000000e+00
%12 = select i1 %9, i1 true, i1 %11
%13 = fcmp ogt float %2, 1.000000e+00
%14 = select i1 %12, i1 true, i1 %13
%15 = fcmp ogt float %4, 1.000000e+00
%16 = select i1 %14, i1 true, i1 %15
%17 = fcmp ogt float %7, 1.000000e+00
%18 = select i1 %16, i1 true, i1 %17
%19 = fcmp ogt float %10, 1.000000e+00
%20 = select i1 %18, i1 true, i1 %19
%21 = fadd float %2, %4
%22 = select i1 %20, float 0.000000e+00, float %21
ret float %22
}

which lowers to:

test_merge_anyof_v4sf:
vmovss .LCPI0_0(%rip), %xmm5 # xmm5 = mem[0],zero,zero,zero
vmovshdup %xmm0, %xmm1 # xmm1 = xmm0[1,1,3,3]
vpermilps $255, %xmm0, %xmm3 # xmm3 = xmm0[3,3,3,3]
vpermilpd $1, %xmm0, %xmm2 # xmm2 = xmm0[1,0]
vaddss %xmm1, %xmm0, %xmm4
vcmpltss %xmm3, %xmm5, %xmm6
vandnps %xmm4, %xmm6, %xmm4
vcmpltss %xmm2, %xmm5, %xmm6
vandnps %xmm4, %xmm6, %xmm4
vcmpltss %xmm1, %xmm5, %xmm6
vcmpltss %xmm0, %xmm5, %xmm5
vandnps %xmm4, %xmm6, %xmm4
vandnps %xmm4, %xmm5, %xmm4
vxorps %xmm5, %xmm5, %xmm5
vcmpltss %xmm5, %xmm3, %xmm3
vcmpltss %xmm5, %xmm2, %xmm2
vcmpltss %xmm5, %xmm1, %xmm1
vcmpltss %xmm5, %xmm0, %xmm0
vandnps %xmm4, %xmm3, %xmm3
vandnps %xmm3, %xmm2, %xmm2
vandnps %xmm2, %xmm1, %xmm1
vandnps %xmm1, %xmm0, %xmm0
retq

@RKSimon
Copy link
Collaborator Author

@RKSimon RKSimon commented Jun 2, 2021

We have to be careful in IR as the

select i1 %12, i1 true, i1 %13

pattern is sensitive to poison, I don't think we can create reduction intrinsics from it?

@rotateright
Copy link
Contributor

@rotateright rotateright commented Jul 7, 2021

We have to be careful in IR as the

select i1 %12, i1 true, i1 %13

pattern is sensitive to poison, I don't think we can create reduction
intrinsics from it?

We can do it with freeze:
https://alive2.llvm.org/ce/z/RQ_yhV

So we need to:

  1. Have SLP recognize the select form of logic ops
  2. Insert a freeze to make a transform poison-safe.

@rotateright
Copy link
Contributor

@rotateright rotateright commented Jul 7, 2021

I drafted a hack for SLP reduction matching, and it partly works on the minimal patterns without obviously breaking anything else.

I'll try to clean that up.

Not sure yet if we'll still need other changes (SimplifyCFG or codegen?).

@rotateright
Copy link
Contributor

@rotateright rotateright commented Jul 9, 2021

@RKSimon
Copy link
Collaborator Author

@RKSimon RKSimon commented Jul 10, 2021

https://reviews.llvm.org/D105730

+1 Thank goodness for freeze :)

@RKSimon
Copy link
Collaborator Author

@RKSimon RKSimon commented Nov 27, 2021

mentioned in issue #49274

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants