New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: avoid reloads of temp register on ppc64le, others #17110

Open
laboger opened this Issue Sep 14, 2016 · 8 comments

Comments

Projects
None yet
8 participants
@laboger
Contributor

laboger commented Sep 14, 2016

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (go version)?

go version devel +9fcbde7 Wed Sep 14 10:07:17 2016 -0500 linux/ppc64le

What operating system and processor architecture are you using (go env)?

Ubuntu 16.04 ppc64le

What did you do?

Built the test from the math package, then inspected the objdump.

What did you expect to see?

Looking for opportunities for improvement.

What did you see instead?

In main.main, this code is generated:
111c8: 1f 00 e0 3f lis r31,31
111cc: 98 29 9f e8 ld r4,10648(r31)
111d0: 1f 00 e0 3f lis r31,31
111d4: 90 29 bf e8 ld r5,10640(r31)
111d8: 1f 00 e0 3f lis r31,31
111dc: e0 29 df e8 ld r6,10720(r31)
111e0: 1f 00 e0 3f lis r31,31
111e4: d8 29 ff e8 ld r7,10712(r31)
111e8: 1f 00 e0 3f lis r31,31
111ec: d0 29 1f e9 ld r8,10704(r31)
111f0: 1f 00 e0 3f lis r31,31
111f4: a0 29 3f e9 ld r9,10656(r31)
111f8: 1f 00 e0 3f lis r31,31
111fc: b0 29 5f e9 ld r10,10672(r31)
11200: 1f 00 e0 3f lis r31,31
11204: b8 29 7f e9 ld r11,10680(r31)
11208: 1f 00 e0 3f lis r31,31
1120c: c0 29 9f e9 ld r12,10688(r31)

r31 is reloaded each time with the same value. I believe this happens because this is a MOVD based on SB which implicitly generates two instructions, and uses r31 for the temp register and possibly that is not visible to the register allocator.

@mwhudson

This comment has been minimized.

Contributor

mwhudson commented Oct 5, 2016

Yeah, this happens because a single AMOVD becomes two real instructions so there's not really a stage where this can be fixed in the current way of working. I guess the move needs to be decomposed earlier.

I think a similar issue affects ARM64 (and probably the other fixed-width instruction architectures).

@dr2chase

This comment has been minimized.

Contributor

dr2chase commented Oct 6, 2016

Did this very thing on Sparc decades ago; the inner loop of fpppp (which was huge) had 42 different "base registers" which was of course far more than the number of registers.

@dr2chase

This comment has been minimized.

Contributor

dr2chase commented Nov 2, 2016

I'd like to push this off until 1.9. Note that this is problem occurs on other architectures, not just PPC.

@laboger

This comment has been minimized.

Contributor

laboger commented Nov 2, 2016

OK, so do you want the title changed?

@dr2chase

This comment has been minimized.

Contributor

dr2chase commented Nov 2, 2016

That would be fine. Any architecture with load/store instructions using a register + small immediate to form the address is eligible for this optimization. I suspect it might need linker or at least assembler assistance, since we'll be emitting relocations along the lines of

  base := &global1
  x := *(base + (&global2-&global1))

and we'd want to usually guess right at the rough magnitude of the difference of those two globals.

@rsc rsc modified the milestones: Go1.9, Go1.8 Nov 2, 2016

@rsc rsc changed the title from cmd/compile: avoid reloads of temp register on ppc64le to cmd/compile: avoid reloads of temp register on ppc64le, others Nov 2, 2016

@randall77 randall77 modified the milestones: Go1.10, Go1.9 May 31, 2017

@laboger

This comment has been minimized.

Contributor

laboger commented Jul 7, 2017

Where this problem shows up most often is when referencing constants, which is common in many of functions in the math package. It seems that with constants, they should all be in rodata, and then the base could be the start of the rodata (named: runtime.rodata). Then the offset of the constant from the start of rodata would be known at compile time.

Likewise something similar could be done for static data.

@laboger

This comment has been minimized.

Contributor

laboger commented Sep 19, 2017

@dr2chase I'd like to try and look at this further and try to get it working. If you have some clues or pointers that would be helpful.

  • I can see that linker assistance is needed but not sure how to do that.
  • I see that the base register needs to be split from the load of the constant because now they are always done together, but where does that need to be done so that common base registers can be recognized.
    Any help appreciated...
@gopherbot

This comment has been minimized.

gopherbot commented Oct 4, 2017

Change https://golang.org/cl/68250 mentions this issue: cmd/compile: don't fold address of global into load/store on PPC64

gopherbot pushed a commit that referenced this issue Oct 31, 2017

cmd/compile: don't fold address of global into load/store on PPC64
On PPC64 (and a few other architectures), accessing global
requires multiple instructions and use of temp register.
The compiler emits a single MOV prog, and the assembler
expands it to multiple instructions. If globals are accessed
multiple times, each time it generates a reload of the temp
register. As this is done by the assembler, the compiler
cannot optimize it.

This CL makes the compiler not fold address of global into load
and store. If a global is accessed multiple times, or multiple
fields of a struct are accessed, the compiler can CSE the
address. Currently, this doesn't help the case where different
globals are accessed, even though they may be close to each
other in the address space (which we don't know at compile time).

It helps a little bit in go1 benchmark:

name                     old time/op    new time/op    delta
BinaryTree17-2              4.84s ± 1%     4.84s ± 1%    ~     (p=0.796 n=10+10)
Fannkuch11-2                4.10s ± 0%     4.08s ± 0%  -0.58%  (p=0.000 n=9+8)
FmtFprintfEmpty-2          97.9ns ± 1%    96.8ns ± 1%  -1.08%  (p=0.000 n=10+10)
FmtFprintfString-2          147ns ± 0%     147ns ± 1%    ~     (p=0.129 n=9+10)
FmtFprintfInt-2             152ns ± 0%     152ns ± 0%    ~     (p=0.294 n=10+8)
FmtFprintfIntInt-2          218ns ± 1%     217ns ± 0%  -0.64%  (p=0.000 n=10+8)
FmtFprintfPrefixedInt-2     263ns ± 1%     256ns ± 0%  -2.77%  (p=0.000 n=10+8)
FmtFprintfFloat-2           375ns ± 1%     368ns ± 0%  -1.95%  (p=0.000 n=10+7)
FmtManyArgs-2               849ns ± 0%     850ns ± 0%    ~     (p=0.621 n=8+9)
GobDecode-2                12.3ms ± 1%    12.2ms ± 1%  -0.94%  (p=0.003 n=10+10)
GobEncode-2                10.3ms ± 1%    10.5ms ± 1%  +2.03%  (p=0.000 n=10+10)
Gzip-2                      414ms ± 1%     414ms ± 0%    ~     (p=0.842 n=9+10)
Gunzip-2                   66.3ms ± 0%    66.4ms ± 0%    ~     (p=0.077 n=9+9)
HTTPClientServer-2         66.3µs ± 5%    66.4µs ± 1%    ~     (p=0.661 n=10+9)
JSONEncode-2               23.9ms ± 1%    23.9ms ± 1%    ~     (p=0.905 n=10+9)
JSONDecode-2                119ms ± 1%     116ms ± 0%  -2.65%  (p=0.000 n=10+10)
Mandelbrot200-2            5.11ms ± 0%    4.92ms ± 0%  -3.71%  (p=0.000 n=10+10)
GoParse-2                  5.81ms ± 1%    5.84ms ± 1%    ~     (p=0.052 n=10+10)
RegexpMatchEasy0_32-2       315ns ± 0%     317ns ± 0%  +0.67%  (p=0.000 n=10+10)
RegexpMatchEasy0_1K-2       658ns ± 0%     638ns ± 0%  -3.01%  (p=0.000 n=9+9)
RegexpMatchEasy1_32-2       315ns ± 1%     317ns ± 0%  +0.56%  (p=0.000 n=9+9)
RegexpMatchEasy1_1K-2       935ns ± 0%     926ns ± 0%  -0.96%  (p=0.000 n=9+9)
RegexpMatchMedium_32-2      394ns ± 0%     396ns ± 1%  +0.46%  (p=0.001 n=10+10)
RegexpMatchMedium_1K-2     65.1µs ± 0%    64.5µs ± 0%  -0.90%  (p=0.000 n=9+9)
RegexpMatchHard_32-2       3.16µs ± 0%    3.17µs ± 0%  +0.35%  (p=0.000 n=10+9)
RegexpMatchHard_1K-2       89.4µs ± 0%    89.3µs ± 0%    ~     (p=0.136 n=9+9)
Revcomp-2                   703ms ± 2%     694ms ± 2%  -1.41%  (p=0.009 n=10+10)
Template-2                  107ms ± 1%     107ms ± 1%    ~     (p=0.053 n=9+10)
TimeParse-2                 526ns ± 0%     524ns ± 0%  -0.34%  (p=0.002 n=9+9)
TimeFormat-2                534ns ± 0%     504ns ± 1%  -5.51%  (p=0.000 n=10+10)
[Geo mean]                 93.8µs         93.1µs       -0.70%

It also helps in the case mentioned in issue #17110, main.main
in package math's test. Now it generates 4 loads of R31 instead
of 10, for the same piece of code.

This causes a slight increase of binary size: cmd/go increases
0.66%.

If this is a good idea, we should do it on other architectures
where accessing global is expensive.

Updates #17110.

Change-Id: I2687af6eafc04f2a57c19781ec300c33567094b6
Reviewed-on: https://go-review.googlesource.com/68250
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>

@bradfitz bradfitz modified the milestones: Go1.10, Go1.11 Nov 28, 2017

@bradfitz bradfitz modified the milestones: Go1.11, Unplanned May 18, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment