Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
cmd/compile: optimize overhead from CPU feature detection #36351
As investigated in #36196, the overhead of checking for hardware FMA on every iteration of a loop causes it to slow down. @josharian's CL 212360, which introduces a
One method is to hoist the check outside the loop. To quote #15808 (comment)
For large loops and operations that permit > 2 implementations, the above optimization could result in inflated binaries, but it works well for smaller loops.
Another method is to set a function pointer to the preferred implementation on program initialization, so that all invocations incur an indirect function call overhead, with the benefit that the implementation wouldn't change at runtime. This would be akin to the dispatcher in GCC's function multi-versioning.
It is worth further investigating opportunities for optimization in this space.
One option would be to allow people to hoist the check manually and then elide the checks inside the loops.
This becomes even more noticeable when the loop is unrolled at all, because then you don't even have another branch between the things.
I did some testing to find out what the impact was, and because I was not awake enough at the time, I ended up with an implementation that had the untaken branches to call the library functions, but not the popcount instructions, and discovered that the cost of that, plus the cost of the popcount operation without the branches, is much smaller than the cost of the branches and the popcount instructions. I'm not sure why. But the net impact is that the cost of the branch before every popcount, even though it's obviously a completely predictable branch (and