Powerpc64 fixes #1326

antonblanchard · 2015-06-05T11:22:43Z

I've reviewed the powerpc64 inline assembly and found a number of issues. The following patches show measurable improvements in performance on POWER8.

Detecting overflow with the XER is slow, partially because we have to clear it before use. We can do better by using a trick where we compare the high 64 bits of the result with the low 64 bits shifted right 63 bits. This is 7% faster on a POWER8 running a simple testcase: <?php function testcase($count = 100000000) { for ($i = 0; $i < $count; $i++) { $x = 1; $x = $x * 2; $x = $x * 2; $x = $x * 2; $x = $x * 2; } } testcase(); ?>

Detecting overflow with the XER is slow, partially because we have to clear it before use. gcc does a better job of detecting overflow of an increment or decrement than we can with inline assembly. It knows that an increment will only overflow if it is one less than the overflow value. This means we end up with a simple compare/branch. Furthermore, leaving it in c allows gcc to schedule the instructions better. This is 6% faster on a POWER8 running a simple testcase: <?php function testcase($count = 100000000) { $x = 1; for ($i = 0; $i < $count; $i++) { $x++; $x++; $x++; $x++; $x++; } } testcase(); ?>

Detecting overflow with the XER is slow, partially because we have to clear it before use. PHP already has a fast way of detecting overflow in its fallback c implementation. Overflow only occurs if the signs of the two operands are the same and the sign of the result is different. Furthermore, leaving it in c allows gcc to schedule the instructions better. This is 9% faster on a POWER8 running a simple testcase: <?php function testcase($count = 100000000) { $x = 1; for ($i = 0; $i < $count; $i++) { $x = $x + 1; $x = $x + 1; $x = $x + 1; $x = $x + 1; $x = $x + 1; } } testcase(); ?>

weltling · 2015-06-10T20:48:00Z

@remicollet could you please check this one? ... I'm not sure removing a lot of code were a good idea, but if it's something worky and proven ...

Thanks.

antonblanchard · 2015-06-16T06:13:58Z

I'm happy to discuss this more. The current inline assembly was added without any performance data to back it up. It is much slower than what gcc does, verified by running both versions through our simulators.

I spent time working on better inline assembly versions, and in most cases gcc beat the best I could do. ZEND_SIGNED_MULTIPLY_LONG is the exception.

Finally, I was able to benchmark clear improvements with each patch in isolation via microbenchmarks. The performance improvements are in the commit messages.

remicollet · 2015-06-16T06:32:04Z

@weltling I will try to get a ppc64 to check this.

remicollet · 2015-06-16T07:22:55Z

Just run some tests

cpu  : POWER8 (architected), altivec supported
os   : RHEL 7.1 (ppc64, big endian)
build: (default RH RPM build optioins) -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 -mcpu=power7 -mtune=power7

I don't see any regression in test suite, but observe quite different results:

muliply: from 10.2 to 8.9 (12% saved)
increment: from 8.3 to 7.7 (7% saved)
add: from 10.1 to 14.4 (40% lost)

@antonblanchard any idea about these results ?

antonblanchard · 2015-06-16T08:17:08Z

Hi @remicollet, we do know of a gcc TOC save/restore issue that could be masking any improvements. Can you try testing with -msave-toc-indirect?

My numbers were on POWER8, I'll grab a POWER7 to test.

antonblanchard · 2015-06-16T10:31:08Z

Hi @remicollet, I tested on a POWER7 running Fedora20.

CFLAGS="-O3 -g -msave-toc-indirect"

I can't reproduce the add issue:

baseline: 8.733s
add patch: 7.971s

Are the results repeatable? Perhaps you are running on a shared processor box?

antonblanchard · 2015-06-16T10:32:04Z

Actually I see a big regression when I add -mcpu=power7 to my CFLAGS, my baseline drops from 8.733s to 10.9s. Investigating.

antonblanchard · 2015-06-16T11:43:54Z

I see the problem - the Fedora 20 toolchain is adding static branch hints when -mcpu=power7 is added:

104b0b74:   bne-    cr7,104b0bc8 <.ZEND_ADD_SPEC_CV_CONST_HANDLER+0x88>
104b0b78:   lwz     r5,8(r31)
104b0b7c:   cmplwi  cr7,r5,4
104b0b80:   bne-    cr7,104b0c3c <.ZEND_ADD_SPEC_CV_CONST_HANDLER+0xfc>

Static branch prediction will disable any dynamic branch prediction. Get it wrong and you will mispredict every time. We stopped gcc from generating static branch quite a while ago because it caused such bad performance.

@remicollet: could you retry your runs without -mcpu=power7?

nikic · 2015-06-16T16:45:43Z

I've just tried dropping the increment assembly on x86-64 and seeing a 2.5% improvement there on your test script. So I think we should just drop assembly for increment/decrement altogether (for all platforms).

kaplanlior · 2015-06-18T14:56:03Z

Also relevant for #1245

@nikic - If dropping is better, I think should go for it. @remicollet - any opinion here?

weltling · 2015-06-19T09:45:12Z

Maybe we shouldn't kick it out but make it still available per config option. I was doing some research some time ago about this topic https://github.com/weltling/ml64_exercise about using ASM on 64-bit Windows. It turned out, that in many cases we could enormously optimize the bottlenecks, just in this particular case the Visual Studio is not able to inline ASM on 64-bit.

But in general - one could still imagine that these ASM bricks can be useful in some particular situation or platform. Of course, in many cases a C compiler would do a great job, that's not to dispute. But IMHO these inlined pieces can be still reasonable, especially as we won't be able to test every single possible platform.

Thanks.

daxtens · 2015-06-30T01:54:25Z

I can confirm that this improves performance - I've compiled on PowerPC64 Little Endian on Ubuntu - in my benchmark of a tight increment loop the performance goes up by over 10%.

If you want to keep ASM for other platforms, maybe we could just merge this to improve performance on POWER and leave the other architectures in?

antonblanchard · 2015-07-06T01:56:43Z

@remicollet did you have any luck rerunning the tests with the suggested compilation flag changes?

antonblanchard · 2015-07-28T15:47:50Z

@weltling @remicollet we are keen to get this patch upstream. Is there anything else you need us to do?

remicollet · 2015-07-28T15:54:08Z

could you retry your runs without -mcpu=power7?

@antonblanchard this doesn't make sense to me, as this is the default flag on RHEL-7.
I have pinged PowerPC at work export about this, but haven't got any other feedback.

P.S. I don't mean -mtune7 is the right choice for everyone, just that +10% for power8 vs -40% for power7 should be considered.

antonblanchard · 2015-07-28T16:06:14Z

@remicollet Thanks. FYI the original patch was added without any performance data to back it up. All of us at IBM agree that it should be removed, I've even run it through our simulation environment to confirm.

weltling · 2015-07-28T16:20:20Z

@antonblanchard I personally would be more of fan to deactivate by a config option, at least for now. Everyone can switch and compare, also on the other platforms.

With the concrete case - we probably should wait for the feedback from Remi's collegues. Power64 is a rare platform so not everyone can test on it, therefore the more feedback - the better. When it's confirmed by @remicollet - probably it were fine to switch to the plain C, also for the maintainability reasons.

Thanks.

remicollet · 2015-07-28T16:39:15Z

I've just committed the multiply and increment part of this PR (as we observe perf improvment in all case)

php-pulls · 2015-07-29T07:24:14Z

Comment on behalf of remi at php.net:

Merged

remicollet · 2015-07-29T07:25:04Z

Everything is merged.

antonblanchard added 3 commits June 1, 2015 11:45

smalyshev added the Feature label Jun 26, 2015

php-pulls closed this Jul 29, 2015

Powerpc64 fixes #1326

Powerpc64 fixes #1326

Uh oh!

Conversation

antonblanchard commented Jun 5, 2015

Uh oh!

weltling commented Jun 10, 2015

Uh oh!

antonblanchard commented Jun 16, 2015

Uh oh!

remicollet commented Jun 16, 2015

Uh oh!

remicollet commented Jun 16, 2015

Uh oh!

antonblanchard commented Jun 16, 2015

Uh oh!

antonblanchard commented Jun 16, 2015

Uh oh!

antonblanchard commented Jun 16, 2015

Uh oh!

antonblanchard commented Jun 16, 2015

Uh oh!

nikic commented Jun 16, 2015

Uh oh!

kaplanlior commented Jun 18, 2015

Uh oh!

weltling commented Jun 19, 2015

Uh oh!

daxtens commented Jun 30, 2015

Uh oh!

antonblanchard commented Jul 6, 2015

Uh oh!

antonblanchard commented Jul 28, 2015

Uh oh!

remicollet commented Jul 28, 2015

Uh oh!

antonblanchard commented Jul 28, 2015

Uh oh!

weltling commented Jul 28, 2015

Uh oh!

remicollet commented Jul 28, 2015

Uh oh!

php-pulls commented Jul 29, 2015

Uh oh!

remicollet commented Jul 29, 2015

Uh oh!

Uh oh!