-
Notifications
You must be signed in to change notification settings - Fork 15.1k
Closed
Labels
Description
Bugzilla Link | 44113 |
Version | 9.0 |
OS | All |
Reporter | LLVM Bugzilla Contributor |
CC | @topperc,@DougGregor,@RKSimon,@zygoloid,@rotateright |
Extended Description
In a recent Stackoverflow discussion (link: https://stackoverflow.com/questions/58954801/avx-equivalent-for-mm-movelh-ps) we found out that the instruction mm256_shuffle_ps(,_,0x44) is compiled to vunpcklpd by Clang. This is a possible optimization for Skylake and other processors which have identical throughput for shuffles and unpacks.
But as the Stackoverflow-user Peter Cordes mentioned in his answer, Ice Lake processors have a higher throughput for shuffles than for unpacks:
Therefore, the performed replacement of the shuffle is contra-productive on an Ice Lake processor.
Even with -march= icelake-client, Clang replaces the shuffle with vunpcklpd:
Same goes for mm256_shuffle_ps(,_,0xee) vunpckhpd.