Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: avoid reloads of temp register on ppc64le, others #17110

Open
laboger opened this issue Sep 14, 2016 · 15 comments
Open

cmd/compile: avoid reloads of temp register on ppc64le, others #17110

laboger opened this issue Sep 14, 2016 · 15 comments

Comments

@laboger
Copy link
Contributor

@laboger laboger commented Sep 14, 2016

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (go version)?

go version devel +9fcbde7 Wed Sep 14 10:07:17 2016 -0500 linux/ppc64le

What operating system and processor architecture are you using (go env)?

Ubuntu 16.04 ppc64le

What did you do?

Built the test from the math package, then inspected the objdump.

What did you expect to see?

Looking for opportunities for improvement.

What did you see instead?

In main.main, this code is generated:
111c8: 1f 00 e0 3f lis r31,31
111cc: 98 29 9f e8 ld r4,10648(r31)
111d0: 1f 00 e0 3f lis r31,31
111d4: 90 29 bf e8 ld r5,10640(r31)
111d8: 1f 00 e0 3f lis r31,31
111dc: e0 29 df e8 ld r6,10720(r31)
111e0: 1f 00 e0 3f lis r31,31
111e4: d8 29 ff e8 ld r7,10712(r31)
111e8: 1f 00 e0 3f lis r31,31
111ec: d0 29 1f e9 ld r8,10704(r31)
111f0: 1f 00 e0 3f lis r31,31
111f4: a0 29 3f e9 ld r9,10656(r31)
111f8: 1f 00 e0 3f lis r31,31
111fc: b0 29 5f e9 ld r10,10672(r31)
11200: 1f 00 e0 3f lis r31,31
11204: b8 29 7f e9 ld r11,10680(r31)
11208: 1f 00 e0 3f lis r31,31
1120c: c0 29 9f e9 ld r12,10688(r31)

r31 is reloaded each time with the same value. I believe this happens because this is a MOVD based on SB which implicitly generates two instructions, and uses r31 for the temp register and possibly that is not visible to the register allocator.

@mwhudson
Copy link
Contributor

@mwhudson mwhudson commented Oct 5, 2016

Yeah, this happens because a single AMOVD becomes two real instructions so there's not really a stage where this can be fixed in the current way of working. I guess the move needs to be decomposed earlier.

I think a similar issue affects ARM64 (and probably the other fixed-width instruction architectures).

Loading

@dr2chase
Copy link
Contributor

@dr2chase dr2chase commented Oct 6, 2016

Did this very thing on Sparc decades ago; the inner loop of fpppp (which was huge) had 42 different "base registers" which was of course far more than the number of registers.

Loading

@dr2chase
Copy link
Contributor

@dr2chase dr2chase commented Nov 2, 2016

I'd like to push this off until 1.9. Note that this is problem occurs on other architectures, not just PPC.

Loading

@laboger
Copy link
Contributor Author

@laboger laboger commented Nov 2, 2016

OK, so do you want the title changed?

Loading

@dr2chase
Copy link
Contributor

@dr2chase dr2chase commented Nov 2, 2016

That would be fine. Any architecture with load/store instructions using a register + small immediate to form the address is eligible for this optimization. I suspect it might need linker or at least assembler assistance, since we'll be emitting relocations along the lines of

  base := &global1
  x := *(base + (&global2-&global1))

and we'd want to usually guess right at the rough magnitude of the difference of those two globals.

Loading

@rsc rsc added this to the Go1.9 milestone Nov 2, 2016
@rsc rsc removed this from the Go1.8 milestone Nov 2, 2016
@rsc rsc changed the title cmd/compile: avoid reloads of temp register on ppc64le cmd/compile: avoid reloads of temp register on ppc64le, others Nov 2, 2016
@randall77 randall77 added this to the Go1.10 milestone May 31, 2017
@randall77 randall77 removed this from the Go1.9 milestone May 31, 2017
@laboger
Copy link
Contributor Author

@laboger laboger commented Jul 7, 2017

Where this problem shows up most often is when referencing constants, which is common in many of functions in the math package. It seems that with constants, they should all be in rodata, and then the base could be the start of the rodata (named: runtime.rodata). Then the offset of the constant from the start of rodata would be known at compile time.

Likewise something similar could be done for static data.

Loading

@laboger
Copy link
Contributor Author

@laboger laboger commented Sep 19, 2017

@dr2chase I'd like to try and look at this further and try to get it working. If you have some clues or pointers that would be helpful.

  • I can see that linker assistance is needed but not sure how to do that.
  • I see that the base register needs to be split from the load of the constant because now they are always done together, but where does that need to be done so that common base registers can be recognized.
    Any help appreciated...

Loading

@gopherbot
Copy link

@gopherbot gopherbot commented Oct 4, 2017

Change https://golang.org/cl/68250 mentions this issue: cmd/compile: don't fold address of global into load/store on PPC64

Loading

gopherbot pushed a commit that referenced this issue Oct 31, 2017
On PPC64 (and a few other architectures), accessing global
requires multiple instructions and use of temp register.
The compiler emits a single MOV prog, and the assembler
expands it to multiple instructions. If globals are accessed
multiple times, each time it generates a reload of the temp
register. As this is done by the assembler, the compiler
cannot optimize it.

This CL makes the compiler not fold address of global into load
and store. If a global is accessed multiple times, or multiple
fields of a struct are accessed, the compiler can CSE the
address. Currently, this doesn't help the case where different
globals are accessed, even though they may be close to each
other in the address space (which we don't know at compile time).

It helps a little bit in go1 benchmark:

name                     old time/op    new time/op    delta
BinaryTree17-2              4.84s ± 1%     4.84s ± 1%    ~     (p=0.796 n=10+10)
Fannkuch11-2                4.10s ± 0%     4.08s ± 0%  -0.58%  (p=0.000 n=9+8)
FmtFprintfEmpty-2          97.9ns ± 1%    96.8ns ± 1%  -1.08%  (p=0.000 n=10+10)
FmtFprintfString-2          147ns ± 0%     147ns ± 1%    ~     (p=0.129 n=9+10)
FmtFprintfInt-2             152ns ± 0%     152ns ± 0%    ~     (p=0.294 n=10+8)
FmtFprintfIntInt-2          218ns ± 1%     217ns ± 0%  -0.64%  (p=0.000 n=10+8)
FmtFprintfPrefixedInt-2     263ns ± 1%     256ns ± 0%  -2.77%  (p=0.000 n=10+8)
FmtFprintfFloat-2           375ns ± 1%     368ns ± 0%  -1.95%  (p=0.000 n=10+7)
FmtManyArgs-2               849ns ± 0%     850ns ± 0%    ~     (p=0.621 n=8+9)
GobDecode-2                12.3ms ± 1%    12.2ms ± 1%  -0.94%  (p=0.003 n=10+10)
GobEncode-2                10.3ms ± 1%    10.5ms ± 1%  +2.03%  (p=0.000 n=10+10)
Gzip-2                      414ms ± 1%     414ms ± 0%    ~     (p=0.842 n=9+10)
Gunzip-2                   66.3ms ± 0%    66.4ms ± 0%    ~     (p=0.077 n=9+9)
HTTPClientServer-2         66.3µs ± 5%    66.4µs ± 1%    ~     (p=0.661 n=10+9)
JSONEncode-2               23.9ms ± 1%    23.9ms ± 1%    ~     (p=0.905 n=10+9)
JSONDecode-2                119ms ± 1%     116ms ± 0%  -2.65%  (p=0.000 n=10+10)
Mandelbrot200-2            5.11ms ± 0%    4.92ms ± 0%  -3.71%  (p=0.000 n=10+10)
GoParse-2                  5.81ms ± 1%    5.84ms ± 1%    ~     (p=0.052 n=10+10)
RegexpMatchEasy0_32-2       315ns ± 0%     317ns ± 0%  +0.67%  (p=0.000 n=10+10)
RegexpMatchEasy0_1K-2       658ns ± 0%     638ns ± 0%  -3.01%  (p=0.000 n=9+9)
RegexpMatchEasy1_32-2       315ns ± 1%     317ns ± 0%  +0.56%  (p=0.000 n=9+9)
RegexpMatchEasy1_1K-2       935ns ± 0%     926ns ± 0%  -0.96%  (p=0.000 n=9+9)
RegexpMatchMedium_32-2      394ns ± 0%     396ns ± 1%  +0.46%  (p=0.001 n=10+10)
RegexpMatchMedium_1K-2     65.1µs ± 0%    64.5µs ± 0%  -0.90%  (p=0.000 n=9+9)
RegexpMatchHard_32-2       3.16µs ± 0%    3.17µs ± 0%  +0.35%  (p=0.000 n=10+9)
RegexpMatchHard_1K-2       89.4µs ± 0%    89.3µs ± 0%    ~     (p=0.136 n=9+9)
Revcomp-2                   703ms ± 2%     694ms ± 2%  -1.41%  (p=0.009 n=10+10)
Template-2                  107ms ± 1%     107ms ± 1%    ~     (p=0.053 n=9+10)
TimeParse-2                 526ns ± 0%     524ns ± 0%  -0.34%  (p=0.002 n=9+9)
TimeFormat-2                534ns ± 0%     504ns ± 1%  -5.51%  (p=0.000 n=10+10)
[Geo mean]                 93.8µs         93.1µs       -0.70%

It also helps in the case mentioned in issue #17110, main.main
in package math's test. Now it generates 4 loads of R31 instead
of 10, for the same piece of code.

This causes a slight increase of binary size: cmd/go increases
0.66%.

If this is a good idea, we should do it on other architectures
where accessing global is expensive.

Updates #17110.

Change-Id: I2687af6eafc04f2a57c19781ec300c33567094b6
Reviewed-on: https://go-review.googlesource.com/68250
Run-TryBot: Cherry Zhang <cherryyz@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
@bradfitz bradfitz removed this from the Go1.10 milestone Nov 28, 2017
@bradfitz bradfitz added this to the Go1.11 milestone Nov 28, 2017
@bradfitz bradfitz removed this from the Go1.11 milestone May 18, 2018
@bradfitz bradfitz added this to the Unplanned milestone May 18, 2018
@laboger
Copy link
Contributor Author

@laboger laboger commented Apr 9, 2020

@cherrymui @jeremyfaller Will there be support in the new linker to fix this problem? It still exists, mostly in the math package.

Loading

@laboger
Copy link
Contributor Author

@laboger laboger commented Apr 9, 2020

Here's an example of some code from math.sin. R31 is reloaded with the same value over and over.

                x = Sin(Pi * (x - 2))
  0x94cb4               3fe0001b                ADDIS $0,$27,R31                     // lis r31,27              
  0x94cb8               c85fba90                LFD -17776(R31),F2                   // lfd f2,-17776(r31)      
  0x94cbc               fc211028                FSUB F1,F2,F1                        // fsub f1,f1,f2           
  0x94cc0               3fe0001b                ADDIS $0,$27,R31                     // lis r31,27              
  0x94cc4               c85fbaf8                LFD -17672(R31),F2                   // lfd f2,-17672(r31)      
  0x94cc8               fc2100b2                FMUL F1,F2,F1                        // fmul f1,f1,f2           
  0x94ccc               d8210020                STFD 32(R1),F1                       // stfd f1,32(r1)          
  0x94cd0               480027f1                CALL math.Sin(SB)                    // bl 0x974c0              
  0x94cd4               c8210028                LFD 40(R1),F1                        // lfd f1,40(r1)           
        return -x
  0x94cd8               fc200850                FNEG F1,F1                           // fneg f1,f1      
  0x94cdc               d8210060                STFD 96(R1),F1                       // stfd f1,96(r1)  
  0x94ce0               ebe10000                MOVD 0(R1),R31                       // ld r31,0(r1)    
  0x94ce4               7fe803a6                MOVD R31,LR                          // mtlr r31        
  0x94ce8               38210038                ADD R1,$56,R1                        // addi r1,r1,56   
  0x94cec               4e800020                RET                                  // bclr 20,lt,0    
                x = Cos(Pi * (0.5 - x))
  0x94cf0               3fe0001b                ADDIS $0,$27,R31                     // lis r31,27              
  0x94cf4               c85fb8e0                LFD -18208(R31),F2                   // lfd f2,-18208(r31)      
  0x94cf8               fc220828                FSUB F2,F1,F1                        // fsub f1,f2,f1           
  0x94cfc               3fe0001b                ADDIS $0,$27,R31                     // lis r31,27              
  0x94d00               c85fbaf8                LFD -17672(R31),F2                   // lfd f2,-17672(r31)      
  0x94d04               fc2100b2                FMUL F1,F2,F1                        // fmul f1,f1,f2           
  0x94d08               d8210020                STFD 32(R1),F1                       // stfd f1,32(r1)          
  0x94d0c               480027d5                CALL math.Cos(SB)                    // bl 0x974e0              
  0x94d10               c8210028                LFD 40(R1),F1                        // lfd f1,40(r1)           
  0x94d14               4bffffc4                BR 0x94cd8                           // b 0x94cd8               
                x = Sin(Pi * x)
  0x94d18               3fe0001b                ADDIS $0,$27,R31                     // lis r31,27              
  0x94d1c               c85fbaf8                LFD -17672(R31),F2                   // lfd f2,-17672(r31)      
  0x94d20               fc2100b2                FMUL F1,F2,F1                        // fmul f1,f1,f2           
  0x94d24               d8210020                STFD 32(R1),F1                       // stfd f1,32(r1)          
  0x94d28               48002799                CALL math.Sin(SB)                    // bl 0x974c0              
  0x94d2c               c8210028                LFD 40(R1),F1                        // lfd f1,40(r1)           
  0x94d30               4bffffa8                BR 0x94cd8                           // b 0x94cd8               

Loading

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Apr 10, 2020

I don't think the new linker does anything with this. This is all in the compiler.

I wonder if marking global address (i.e. a MOVDaddr with the operand being SB) not rematerializeable helps. If we do that, the register allocator will try to keep the address in a register, instead of rematerialize it each time. It doesn't last across a call, though.

Loading

@laboger
Copy link
Contributor Author

@laboger laboger commented Apr 10, 2020

There are two problems here that I am aware of.

First, when loading a constant or static variable, the base address is always generated using 2 instructions in asm9.go. The relocated value for the upper part of the base address is loaded into r31 (temp register) then added into the relocated value for the lower part of the base address. Because it is done this way, the register allocator is not able to find the first value (in r31) in common, even though it frequently is reloading the same value over and over. This could be split out into 2 instructions in obj9.go (or elsewhere?) and I would be willing to experiment with this.

The second problem is that these values are created as relocated values relative to the constant or static variable. The actual values aren't known until link time and I don't think they really look like the same value to the register allocator so it wouldn't know they are the same even if it had the information to try.

We would need to do something like David Chase mentions in his post on Nov. 2016. Maybe I'm jumping to conclusions, but I thought maybe since it looks like a TOC is now always being generated that would help this problem. I thought all addresses of this type would be in the TOC and the offset within the TOC would be known so that the same base could be used when referencing them. Or maybe I'm making some incorrect assumptions on how this works.

Loading

@randall77
Copy link
Contributor

@randall77 randall77 commented Apr 10, 2020

I agree - we'll need to put all the constants together in a group (or groups, if there are too many?), add an SSA instruction to get the base of that group, and have each constant load be an offset from its group base.

There's no PC-relative load in powerpc? That would make life much easier.

Loading

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Apr 11, 2020

This is a place where a globally reserved SB register can be useful. Maybe we could do that now with the internal ABI.

Loading

@mwhudson
Copy link
Contributor

@mwhudson mwhudson commented Apr 11, 2020

There's no PC-relative load in powerpc? That would make life much easier.

There is not, but I'm not actually sure it would help. cmd/compile has code to maintain R2 as a TOC pointer (it's needed for shared libraries) which could be globally enabled but the load instructions only have a 16 bit displacement and I don't think you know enough at this point of codegen to know if the base value can be shared between two loads?

Loading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
9 participants