Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
I've added additional C++ optimizations and limited the number of threads we create #4
One of the biggest hotspots in the code looked like this:
Which means the function was spending over a third of its execution time in those last couple of instructions. The processor can't parallelize them well because instruction 3 depends on the previous 2 instructions, and instruction 4 depends on instruction 3. The idea was that since the check is usually false, by moving the final subtraction to after the jump we can speculatively do the jump without a stall, and since the jump points to the loop bounds check (which is all integer math) it won't have to wait on the floating point computations. And it seems to have worked:
Now the only dependant instruction is the compare, so we can go straight into the predicted jump. It's only a 3% total improvement, but they all add up.
Thanks for the explanation. I thought it would be something in this line, and did a few similar changes on the Go side and they seem to have paid off as well.
Thanks to your changes, C++ is way ahead of the Go version now. I am rerunning the benchmarks and putting up a comparo.