-
Notifications
You must be signed in to change notification settings - Fork 7.9k
Powerpc64 fixes #1326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Powerpc64 fixes #1326
Conversation
Detecting overflow with the XER is slow, partially because we have to clear it before use. We can do better by using a trick where we compare the high 64 bits of the result with the low 64 bits shifted right 63 bits. This is 7% faster on a POWER8 running a simple testcase: <?php function testcase($count = 100000000) { for ($i = 0; $i < $count; $i++) { $x = 1; $x = $x * 2; $x = $x * 2; $x = $x * 2; $x = $x * 2; } } testcase(); ?>
Detecting overflow with the XER is slow, partially because we have to clear it before use. gcc does a better job of detecting overflow of an increment or decrement than we can with inline assembly. It knows that an increment will only overflow if it is one less than the overflow value. This means we end up with a simple compare/branch. Furthermore, leaving it in c allows gcc to schedule the instructions better. This is 6% faster on a POWER8 running a simple testcase: <?php function testcase($count = 100000000) { $x = 1; for ($i = 0; $i < $count; $i++) { $x++; $x++; $x++; $x++; $x++; } } testcase(); ?>
Detecting overflow with the XER is slow, partially because we have to clear it before use. PHP already has a fast way of detecting overflow in its fallback c implementation. Overflow only occurs if the signs of the two operands are the same and the sign of the result is different. Furthermore, leaving it in c allows gcc to schedule the instructions better. This is 9% faster on a POWER8 running a simple testcase: <?php function testcase($count = 100000000) { $x = 1; for ($i = 0; $i < $count; $i++) { $x = $x + 1; $x = $x + 1; $x = $x + 1; $x = $x + 1; $x = $x + 1; } } testcase(); ?>
@remicollet could you please check this one? ... I'm not sure removing a lot of code were a good idea, but if it's something worky and proven ... Thanks. |
I'm happy to discuss this more. The current inline assembly was added without any performance data to back it up. It is much slower than what gcc does, verified by running both versions through our simulators. I spent time working on better inline assembly versions, and in most cases gcc beat the best I could do. ZEND_SIGNED_MULTIPLY_LONG is the exception. Finally, I was able to benchmark clear improvements with each patch in isolation via microbenchmarks. The performance improvements are in the commit messages. |
@weltling I will try to get a ppc64 to check this. |
Just run some tests
I don't see any regression in test suite, but observe quite different results:
@antonblanchard any idea about these results ? |
Hi @remicollet, we do know of a gcc TOC save/restore issue that could be masking any improvements. Can you try testing with -msave-toc-indirect? My numbers were on POWER8, I'll grab a POWER7 to test. |
Hi @remicollet, I tested on a POWER7 running Fedora20. CFLAGS="-O3 -g -msave-toc-indirect" I can't reproduce the add issue: baseline: 8.733s Are the results repeatable? Perhaps you are running on a shared processor box? |
Actually I see a big regression when I add -mcpu=power7 to my CFLAGS, my baseline drops from 8.733s to 10.9s. Investigating. |
I see the problem - the Fedora 20 toolchain is adding static branch hints when -mcpu=power7 is added:
Static branch prediction will disable any dynamic branch prediction. Get it wrong and you will mispredict every time. We stopped gcc from generating static branch quite a while ago because it caused such bad performance. @remicollet: could you retry your runs without -mcpu=power7? |
I've just tried dropping the increment assembly on x86-64 and seeing a 2.5% improvement there on your test script. So I think we should just drop assembly for increment/decrement altogether (for all platforms). |
Also relevant for #1245 @nikic - If dropping is better, I think should go for it. @remicollet - any opinion here? |
Maybe we shouldn't kick it out but make it still available per config option. I was doing some research some time ago about this topic https://github.com/weltling/ml64_exercise about using ASM on 64-bit Windows. It turned out, that in many cases we could enormously optimize the bottlenecks, just in this particular case the Visual Studio is not able to inline ASM on 64-bit. But in general - one could still imagine that these ASM bricks can be useful in some particular situation or platform. Of course, in many cases a C compiler would do a great job, that's not to dispute. But IMHO these inlined pieces can be still reasonable, especially as we won't be able to test every single possible platform. Thanks. |
I can confirm that this improves performance - I've compiled on PowerPC64 Little Endian on Ubuntu - in my benchmark of a tight increment loop the performance goes up by over 10%. If you want to keep ASM for other platforms, maybe we could just merge this to improve performance on POWER and leave the other architectures in? |
@remicollet did you have any luck rerunning the tests with the suggested compilation flag changes? |
@weltling @remicollet we are keen to get this patch upstream. Is there anything else you need us to do? |
@antonblanchard this doesn't make sense to me, as this is the default flag on RHEL-7. P.S. I don't mean -mtune7 is the right choice for everyone, just that +10% for power8 vs -40% for power7 should be considered. |
@remicollet Thanks. FYI the original patch was added without any performance data to back it up. All of us at IBM agree that it should be removed, I've even run it through our simulation environment to confirm. |
@antonblanchard I personally would be more of fan to deactivate by a config option, at least for now. Everyone can switch and compare, also on the other platforms. With the concrete case - we probably should wait for the feedback from Remi's collegues. Power64 is a rare platform so not everyone can test on it, therefore the more feedback - the better. When it's confirmed by @remicollet - probably it were fine to switch to the plain C, also for the maintainability reasons. Thanks. |
I've just committed the multiply and increment part of this PR (as we observe perf improvment in all case) |
Comment on behalf of remi at php.net: Merged |
Everything is merged. |
I've reviewed the powerpc64 inline assembly and found a number of issues. The following patches show measurable improvements in performance on POWER8.