Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[opt] Failure to recognise equivalent shuffled ops #60632

Closed
RKSimon opened this issue Feb 9, 2023 · 4 comments
Closed

[opt] Failure to recognise equivalent shuffled ops #60632

RKSimon opened this issue Feb 9, 2023 · 4 comments

Comments

@RKSimon
Copy link
Collaborator

RKSimon commented Feb 9, 2023

https://gcc.godbolt.org/z/dnsebnPz5

We seeing cases where multiple uses of a node is preventing vector-combine from merging equivalent shuffles.

Sorry the test case is still more convoluted than necessary :(

define <2 x float> @foo(<2 x float> %a0, <2 x float> %a1, <2 x float> %a3, <2 x float> %a4) {
  ; dot product
  %dp0 = fmul <2 x float> %a0, %a1
  %dp1 = shufflevector <2 x float> %dp0, <2 x float> poison, <2 x i32> <i32 1, i32 undef>
  %dp2 = fadd <2 x float> %dp0, %dp1
  %dp3x = extractelement <2 x float> %dp2, i32 0

  ; scalar fdiv
  %a3x = extractelement <2 x float> %a3, i32 0
  %x = fdiv float %dp3x, %a3x

  ; first use
  %xsplat0 = insertelement <2 x float> poison, float %x, i32 0
  %xsplat1 = shufflevector <2 x float> %xsplat0, <2 x float> poison, <2 x i32> zeroinitializer
  %vv = fmul <2 x float> %xsplat1, %a4

  ; second use
  %a4x = extractelement <2 x float> %a4, i32 0
  %q = fmul float %x, %a4x
  %qsplat0 = insertelement <2 x float> poison, float %q, i32 0
  %qsplat1 = shufflevector <2 x float> %qsplat0, <2 x float> poison, <2 x i32> zeroinitializer

  %res = fadd <2 x float> %vv, %qsplat1
  ret <2 x float> %res
}

opt -O3

define <2 x float> @foo(<2 x float> %a0, <2 x float> %a1, <2 x float> %a3, <2 x float> %a4) {
  %dp0 = fmul <2 x float> %a0, %a1
  %dp1 = shufflevector <2 x float> %dp0, <2 x float> poison, <2 x i32> <i32 1, i32 undef>
  %dp2 = fadd <2 x float> %dp0, %dp1
  %1 = fdiv <2 x float> %dp2, %a3
  %xsplat1 = shufflevector <2 x float> %1, <2 x float> poison, <2 x i32> zeroinitializer
  %vv = fmul <2 x float> %xsplat1, %a4
  %2 = fmul <2 x float> %1, %a4
  %qsplat1 = shufflevector <2 x float> %2, <2 x float> poison, <2 x i32> zeroinitializer
  %res = fadd <2 x float> %vv, %qsplat1
  ret <2 x float> %res
}

as the %2 fmul case will be splatted, we should be able to use %vv again:

define <2 x float> @foo(<2 x float> %a0, <2 x float> %a1, <2 x float> %a3, <2 x float> %a4) {
  %dp0 = fmul <2 x float> %a0, %a1
  %dp1 = shufflevector <2 x float> %dp0, <2 x float> poison, <2 x i32> <i32 1, i32 undef>
  %dp2 = fadd <2 x float> %dp0, %dp1
  %1 = fdiv <2 x float> %dp2, %a3
  %xsplat1 = shufflevector <2 x float> %1, <2 x float> poison, <2 x i32> zeroinitializer
  %vv = fmul <2 x float> %xsplat1, %a4
  %qsplat1 = shufflevector <2 x float> %vv, <2 x float> poison, <2 x i32> zeroinitializer
  %res = fadd <2 x float> %vv, %qsplat1
  ret <2 x float> %res
}
@rotateright
Copy link
Contributor

rotateright commented Feb 9, 2023

We've seen kind of problem before, but I'm not finding a previous report. Either way, we never had a solution.
This isn't really a "combine" in LLVM terms because we need to look at a vector instruction with limited demand of its elements, do a search of independent values, and find a suitable replacement.

Here's a reduction of the original example:

define <2 x i4> @src(<2 x i4> %x, <2 x i4> %y) {
  %xsplat = shufflevector <2 x i4> %x, <2 x i4> poison, <2 x i32> zeroinitializer
  %vv = mul <2 x i4> %xsplat, %y
  %m = mul <2 x i4> %x, %y
  %msplat = shufflevector <2 x i4> %m, <2 x i4> poison, <2 x i32> zeroinitializer
  %res = add <2 x i4> %vv, %msplat
  ret <2 x i4> %res
}

define <2 x i4> @tgt(<2 x i4> %x, <2 x i4> %y) {
  %xsplat = shufflevector <2 x i4> %x, <2 x i4> poison, <2 x i32> zeroinitializer
  %vv = mul <2 x i4> %xsplat, %y
  %msplat = shufflevector <2 x i4> %vv, <2 x i4> poison, <2 x i32> zeroinitializer
  %res = add <2 x i4> %vv, %msplat
  ret <2 x i4> %res
}

And here's an example with an extractelement rather than splat at the end:

declare void @use(<2 x i4>)
define i4 @src(<2 x i4> %x, <2 x i4> %y) {
  %xsplat = shufflevector <2 x i4> %x, <2 x i4> poison, <2 x i32> zeroinitializer
  %vv = mul <2 x i4> %xsplat, %y
  call void @use(<2 x i4> %vv)
  %m = mul <2 x i4> %x, %y
  %m0 = extractelement <2 x i4> %m, i32 0
  ret i4 %m0
}

define i4 @tgt(<2 x i4> %x, <2 x i4> %y) {
  %xsplat = shufflevector <2 x i4> %x, <2 x i4> poison, <2 x i32> zeroinitializer
  %vv = mul <2 x i4> %xsplat, %y
  call void @use(<2 x i4> %vv)
  %m0 = extractelement <2 x i4> %vv, i32 0
  ret i4 %m0
}

@RKSimon
Copy link
Collaborator Author

RKSimon commented Feb 16, 2023

If we consider looking at this in DAG - we already have SelectionDAG::doesNodeExist - I wonder how well a SelectionDAG::doesShuffleNodeExist would work in helping us see if an equivalent shuffle already exists?

Then the existing shuffle(binop(x,y)) -> binop(shuffle(x),shuffle(y)) folds might be extended to use it.

It wouldn't help with the extractelement case though

rotateright added a commit that referenced this issue Feb 23, 2023
… shuffle-of-binop

This fold was added with https://reviews.llvm.org/D135876 ,
but we missed the one-use check.

This might be the root cause for issue #60632.
@rotateright
Copy link
Contributor

Attempt to get this in IR via demanded elements:
https://reviews.llvm.org/D144760

Also, I'm not sure if 40d772c changed anything for the motivating program(s) - might want to see how the redundant math is created originally.

rotateright added a commit that referenced this issue Feb 28, 2023
…undant instructions

In issue #60632, we have vector math ops that differ because an
operand is shuffled, but the math has limited demanded elements,
so it can be replaced by another instruction:
https://alive2.llvm.org/ce/z/TKqq7H

I don't think we have anything like this yet - it's like a
CSE/GVN fold, but driven by demanded elements of a vector op.
This is limited to splat-0 as a first step to keep it simple.

Differential Revision: https://reviews.llvm.org/D144760
@rotateright
Copy link
Contributor

I think between 40d772c and 3b090ff, this is fixed. If not, please re-open or file another example.

There's a TODO in the demanded elements code about detecting more than splat-of-zero-element shuffles, but I can't tell if that is a concern for the motivating program(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants