New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
At some point llvm re-added pavgw intrinsics #6302
Conversation
This is a good thing, because these do not reliably trigger from the pattern in runtime/x86.ll
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know what version of LLVM these came back in? I guess we can just wait and see what the build bots say.
@@ -30,28 +30,6 @@ define weak_odr <8 x i16> @psubuswx8(<8 x i16> %a0, <8 x i16> %a1) nounwind alwa | |||
ret <8 x i16> %3 | |||
} | |||
|
|||
; Note that this is only used for LLVM 6.0+ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there are also some things in x86_avx2.ll that should be deleted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
x86_avx.ll had some. Deleted.
That was my plan. I'll see what the bots say. |
What do you mean by "reliably"? After inlining? Any particular problematic snippets? |
When multiple averaging operations are combined together into a tree, I wasn't seeing as many pavgw instructions in the generated code as expected. This PR fixes it, and makes Halide generate many fewer ops for averaging trees. In general whenever LLVM removes an intrinsic and replaces it with pattern matching, instruction selection gets worse whenever that pattern has instructions that can be folded with surrounding ones. This is also true of Halide in the places where we rely on pattern matching. You can emit the perfect Expr for a particular instruction, and it compiles to that instruction if that's the only thing the Func is doing, but in a more complex expression as soon as the compiler sees a chance to do some CSE or simplification, things can go south. Sadly, pattern-based instruction selection is inherently brittle. I'll go dig up the llvm IR that was causing the problem ... |
Halide Expr:
Halide Stmt:
llvm IR before this PR. pavgw lowered as i16((i32(a) + i32(b) + 1)>>1)
llvm IR after this PR, with direct use of intrinsics:
Assembly before. God knows what has gone wrong here. I see two pavgw instructions, but also lots of 32-bit integer math.
Assembly after using intrinsics directly:
|
Thanks!
(on godbolt: https://godbolt.org/z/T63WE4xcn) |
Also being fixed upstream: Thanks @LebedevRI ! I'll leave this PR as-is, so that we have two chances to catch pavgw. If we miss it in our pattern-matching, hopefully llvm will catch it for us. |
…131) As noted in halide/Halide#6302, we hilariously fail to match PAVG if we even as much as look at it the wrong way. In this particular case, the problem stems from the fact that `PAVG` root (def) is a `trunc`, and leafs (uses) are `zext`'s, and InstCombine really loves to get rid of both of these, for example replace them with a bit mask. So we may not have said `zext`. Instead of checking for that + type match, i think we should rely on the actual active type, as per the knownbits. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111571
…131) As noted in halide/Halide#6302, we hilariously fail to match PAVG if we even as much as look at it the wrong way. In this particular case, the problem stems from the fact that `PAVG` root (def) is a `trunc`, and leafs (uses) are `zext`'s, and InstCombine really loves to get rid of both of these, for example replace them with a bit mask. So we may not have said `zext`. Instead of checking for that + type match, i think we should rely on the actual active type, as per the knownbits. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D111571
This is a good thing, because these do not reliably trigger from the
pattern in runtime/x86.ll