Skip to content

Powerpc64 fixes #1326

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed

Conversation

antonblanchard
Copy link
Contributor

I've reviewed the powerpc64 inline assembly and found a number of issues. The following patches show measurable improvements in performance on POWER8.

Detecting overflow with the XER is slow, partially because we have to
clear it before use. We can do better by using a trick where we compare
the high 64 bits of the result with the low 64 bits shifted right
63 bits.

This is 7% faster on a POWER8 running a simple testcase:

<?php
        function testcase($count = 100000000) {
                for ($i = 0; $i < $count; $i++) {
                        $x = 1;
                        $x = $x * 2;
                        $x = $x * 2;
                        $x = $x * 2;
                        $x = $x * 2;
                }
        }

        testcase();
?>
Detecting overflow with the XER is slow, partially because we have to
clear it before use.

gcc does a better job of detecting overflow of an increment or decrement
than we can with inline assembly. It knows that an increment will only
overflow if it is one less than the overflow value. This means we end
up with a simple compare/branch. Furthermore, leaving it in c allows gcc
to schedule the instructions better.

This is 6% faster on a POWER8 running a simple testcase:

<?php
        function testcase($count = 100000000) {
		$x = 1;
                for ($i = 0; $i < $count; $i++) {
                        $x++;
                        $x++;
                        $x++;
                        $x++;
                        $x++;
                }
        }

        testcase();
?>
Detecting overflow with the XER is slow, partially because we have to
clear it before use.

PHP already has a fast way of detecting overflow in its fallback
c implementation. Overflow only occurs if the signs of the two
operands are the same and the sign of the result is different.
Furthermore, leaving it in c allows gcc to schedule the instructions
better.

This is 9% faster on a POWER8 running a simple testcase:

<?php
        function testcase($count = 100000000) {
		$x = 1;
                for ($i = 0; $i < $count; $i++) {
                        $x = $x + 1;
                        $x = $x + 1;
                        $x = $x + 1;
                        $x = $x + 1;
                        $x = $x + 1;
                }
        }

        testcase();
?>
@weltling
Copy link
Contributor

@remicollet could you please check this one? ... I'm not sure removing a lot of code were a good idea, but if it's something worky and proven ...

Thanks.

@antonblanchard
Copy link
Contributor Author

I'm happy to discuss this more. The current inline assembly was added without any performance data to back it up. It is much slower than what gcc does, verified by running both versions through our simulators.

I spent time working on better inline assembly versions, and in most cases gcc beat the best I could do. ZEND_SIGNED_MULTIPLY_LONG is the exception.

Finally, I was able to benchmark clear improvements with each patch in isolation via microbenchmarks. The performance improvements are in the commit messages.

@remicollet
Copy link
Member

@weltling I will try to get a ppc64 to check this.

@remicollet
Copy link
Member

Just run some tests

cpu  : POWER8 (architected), altivec supported
os   : RHEL 7.1 (ppc64, big endian)
build: (default RH RPM build optioins) -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mcpu=power7 -mtune=power7

I don't see any regression in test suite, but observe quite different results:

  • muliply: from 10.2 to 8.9 (12% saved)
  • increment: from 8.3 to 7.7 (7% saved)
  • add: from 10.1 to 14.4 (40% lost)

@antonblanchard any idea about these results ?

@antonblanchard
Copy link
Contributor Author

Hi @remicollet, we do know of a gcc TOC save/restore issue that could be masking any improvements. Can you try testing with -msave-toc-indirect?

My numbers were on POWER8, I'll grab a POWER7 to test.

@antonblanchard
Copy link
Contributor Author

Hi @remicollet, I tested on a POWER7 running Fedora20.

CFLAGS="-O3 -g -msave-toc-indirect"

I can't reproduce the add issue:

baseline: 8.733s
add patch: 7.971s

Are the results repeatable? Perhaps you are running on a shared processor box?

@antonblanchard
Copy link
Contributor Author

Actually I see a big regression when I add -mcpu=power7 to my CFLAGS, my baseline drops from 8.733s to 10.9s. Investigating.

@antonblanchard
Copy link
Contributor Author

I see the problem - the Fedora 20 toolchain is adding static branch hints when -mcpu=power7 is added:

104b0b74:   bne-    cr7,104b0bc8 <.ZEND_ADD_SPEC_CV_CONST_HANDLER+0x88>
104b0b78:   lwz     r5,8(r31)
104b0b7c:   cmplwi  cr7,r5,4
104b0b80:   bne-    cr7,104b0c3c <.ZEND_ADD_SPEC_CV_CONST_HANDLER+0xfc>

Static branch prediction will disable any dynamic branch prediction. Get it wrong and you will mispredict every time. We stopped gcc from generating static branch quite a while ago because it caused such bad performance.

@remicollet: could you retry your runs without -mcpu=power7?

@nikic
Copy link
Member

nikic commented Jun 16, 2015

I've just tried dropping the increment assembly on x86-64 and seeing a 2.5% improvement there on your test script. So I think we should just drop assembly for increment/decrement altogether (for all platforms).

@kaplanlior
Copy link
Contributor

Also relevant for #1245

@nikic - If dropping is better, I think should go for it. @remicollet - any opinion here?

@weltling
Copy link
Contributor

Maybe we shouldn't kick it out but make it still available per config option. I was doing some research some time ago about this topic https://github.com/weltling/ml64_exercise about using ASM on 64-bit Windows. It turned out, that in many cases we could enormously optimize the bottlenecks, just in this particular case the Visual Studio is not able to inline ASM on 64-bit.

But in general - one could still imagine that these ASM bricks can be useful in some particular situation or platform. Of course, in many cases a C compiler would do a great job, that's not to dispute. But IMHO these inlined pieces can be still reasonable, especially as we won't be able to test every single possible platform.

Thanks.

@daxtens
Copy link
Contributor

daxtens commented Jun 30, 2015

I can confirm that this improves performance - I've compiled on PowerPC64 Little Endian on Ubuntu - in my benchmark of a tight increment loop the performance goes up by over 10%.

If you want to keep ASM for other platforms, maybe we could just merge this to improve performance on POWER and leave the other architectures in?

@antonblanchard
Copy link
Contributor Author

@remicollet did you have any luck rerunning the tests with the suggested compilation flag changes?

@antonblanchard
Copy link
Contributor Author

@weltling @remicollet we are keen to get this patch upstream. Is there anything else you need us to do?

@remicollet
Copy link
Member

could you retry your runs without -mcpu=power7?

@antonblanchard this doesn't make sense to me, as this is the default flag on RHEL-7.
I have pinged PowerPC at work export about this, but haven't got any other feedback.

P.S. I don't mean -mtune7 is the right choice for everyone, just that +10% for power8 vs -40% for power7 should be considered.

@antonblanchard
Copy link
Contributor Author

@remicollet Thanks. FYI the original patch was added without any performance data to back it up. All of us at IBM agree that it should be removed, I've even run it through our simulation environment to confirm.

@weltling
Copy link
Contributor

@antonblanchard I personally would be more of fan to deactivate by a config option, at least for now. Everyone can switch and compare, also on the other platforms.

With the concrete case - we probably should wait for the feedback from Remi's collegues. Power64 is a rare platform so not everyone can test on it, therefore the more feedback - the better. When it's confirmed by @remicollet - probably it were fine to switch to the plain C, also for the maintainability reasons.

Thanks.

@remicollet
Copy link
Member

I've just committed the multiply and increment part of this PR (as we observe perf improvment in all case)

@php-pulls
Copy link

Comment on behalf of remi at php.net:

Merged

@php-pulls php-pulls closed this Jul 29, 2015
@remicollet
Copy link
Member

Everything is merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants