Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: add GOAMD64 architecture variants #25489

josharian opened this issue May 22, 2018 · 20 comments

cmd/compile: add GOAMD64 architecture variants #25489

josharian opened this issue May 22, 2018 · 20 comments


Copy link

@josharian josharian commented May 22, 2018

This is a data-gathering bug to collect instances in which a GOAMD64 environment variable would be helpful. Please feel free to add to it. My hope is that over time, if/when these use cases accumulate, we will get a clearer picture of whether to add GOAMD64, and if so, what shape it should take.


cc @randall77 @martisch @TocarIP @quasilyte @dr2chase

@josharian josharian added this to the Unplanned milestone May 22, 2018
Copy link

@TocarIP TocarIP commented May 22, 2018

I'd like to add:

  • non-destructive floating point math (X2 = X1 + X0, instead of X1 += X0)
  • faster duffcopy and OpMove via AVX
  • some scalar instructions like ANDN (and-not)
  • sane versions of Ctz, without undefined results on 0 argument
  • removing CPUID checks from intrinsified math functions (mostly rounding)
  • shifts/rotates that don't clobber flags and allow
  • AVX512-specific fp-to int64 conversions
  • extra AVX512 registers

There is also a problem with opting into 512-bit version of AVX512.
By default only 256-bit version should be used to avoid frequency drop, but there should be a mechanism to opt-in for 512-bit version for faster memmove/crypto/etc

Copy link

@mvdan mvdan commented May 22, 2018

I thought that the Go runtime and standard library already had ways to check for certain hardware features at run-time. How do these enhancements compare to that? If these were to be made at compile-time, would the run-time checks be removed?

Copy link

@randall77 randall77 commented May 22, 2018

It is preferable to use runtime detection when possible. For any sizeable chunk of code, that's what we currently do. But especially for individual instructions, the overhead for runtime checking can be large. In that case, it can be preferable to "bake in" the decision during compilation. That's what such an environment variable would do.

Copy link

@mvdan mvdan commented May 22, 2018

Apologues for asking more questions - if we add an environment variable, wouldn't it be simpler and faster to simply use the optimal code for the hardware, getting rid of all run-time detection?

Unless the point is to do a best-effort combination of the both, so that very expensive code like crypto would still perform well on newer hardware even if the environment variable wasn't set up properly?

Copy link

@randall77 randall77 commented May 22, 2018

We want to use the environment variable as little as possible, as it bakes the optimization decision in at compile time. As a consequence, the binary cannot then adapt at runtime; it would fail on an old machine, or not use optimal code on a new machine.
This will lead to people having to distribute multiple copies of their binary, possibly many depending on how many options are available.
It's also extra API surface to the compiler which we would have to spec & maintain.
So I think we'd need a really good reason to go down this road.

Copy link

@quasilyte quasilyte commented May 22, 2018

function multi-versioning was proposed some time ago as a compromise between less run-time checks and portability between older/newer CPUs.

I can't find discussion right now but the approach with checking CPUID features once and then using
specialized version looks good (at least on paper).

I have no specific proposals, more over, I'm not familiar with referenced technique well.
This is just a note that there is another approaches between "full compile-time" and "full run-time" selection. Thank you.

Copy link

@quasilyte quasilyte commented May 22, 2018

For some cases indirect call would be slower than call + branch inside it though. Needs a lot of investigation.

There is also JIT-like approach where we re-write function code segment during startup if there are more optimized versions available. This is not possible for every imaginable platforms, but given this is mostly about AMD64, seems doable in that context.

Copy link

@agnivade agnivade commented May 23, 2018

Great ! I have been wanting this for a long time -!topic/golang-nuts/b75xYuORaok :) My thinking was that using the non-destructive 3 operand form of VEX encoded instructions will alleviate register pressure and will make the binaries much smaller (Unfortunately I know nothing about compiler internals to do a PoC and confirm this).

@ianlancetaylor had mentioned -

I don't think it's worth adding a GOAMD64 environment variable for
this. I don't think the gain is worth the complexity. Happy to be
proven wrong.

Copy link

@TocarIP TocarIP commented Jun 7, 2018

I've implemented a prototype in to see how beneficial this can be.
@agnivade can you check if it with your binaries?

Copy link

@agnivade agnivade commented Jun 8, 2018

With pleasure :)

Copy link

@agnivade agnivade commented Jun 10, 2018

I have done some testing.

On binary size, I am seeing a slight decrease 🎉. I tested with 3 production applications. 2 network services and 1 webapp. This is without -ldflags="-s -w". Baseline was with current master.

Service1 - 17488784 to 17482928 = 5856
Service2 - 16163230 to 16151311 = 11919
Webapp - 21802366 to 21794461= 7905

On the perf side, general application benchmarks do not show noticeable improvements. But if I do micro-benchmarks targeting very specific functions - for eg if I benchmark math.Atan2 which has lot of FMA statements, I can see improvement of 2ns.

I looked into the objdump of the binaries to investigate possible reasons for this (with whatever meagre assembly knowledge I have).

One thing that I noticed is that the V(ADD|SUB|MUL|DIV)SD operations always act on registers, and not on memory operands which they are especially useful for in reducing register pressure.

For eg. I see patterns like this throughout -

vmovsd 0x66323(%rip),%xmm3        # 477a38 <$f64.4059000000000000>
vdivsd %xmm3,%xmm2,%xmm2

vmovsd 0x662bf(%rip),%xmm3        # 4779e0 <$f64.3ff0000000000000>
vsubsd %xmm3,%xmm0,%xmm0

vmovsd 0xc6b83(%rip),%xmm4        # 4d82b0 <runtime.memstats+0x1730>
vsubsd %xmm4,%xmm2,%xmm5

which I believe can be optimized to -

vdivsd 0x66323(%rip),%xmm2,%xmm2

vsubsd 0x662bf(%rip),%xmm0,%xmm0

vsubsd 0xc6b83(%rip),%xmm2,%xmm5

That's all I could glean. Happy to provide more info if required.

Copy link

@dr2chase dr2chase commented Jun 11, 2018

Benchmarks that I ran Friday, got busy and forgot to post:
Quiet machine, using Austin's perflock to make things even more sane.
Saw -0.5% geomean time change, one benchmark got worse for unclear reasons.
Binaries tended slightly smaller:

size *_*
   text	   data	    bss	    dec	    hex	filename
3745235	 104712	 137584	3987531	 3cd84b	capnproto2_Tip
3743659	 104712	 137584	3985955	 3cd223	capnproto2_TryCL117925
2380835	 101072	 136296	2618203	 27f35b	ethereum_bitutil_Tip
2379013	 101072	 136296	2616381	 27ec3d	ethereum_bitutil_TryCL117925
4703293	 118072	 181832	5003197	 4c57bd	ethereum_corevm_Tip
4700462	 118072	 181832	5000366	 4c4cae	ethereum_corevm_TryCL117925
3568662	 116248	 137736	3822646	 3a5436	ethereum_ecies_Tip
3566582	 116248	 137736	3820566	 3a4c16	ethereum_ecies_TryCL117925
7089434	 260153	 149576	7499163	 726d9b	ethereum_ethash_Tip
7087579	 260153	 149576	7497308	 72665c	ethereum_ethash_TryCL117925
2408673	  95248	 135720	2639641	 284719	ethereum_sha3_Tip
2407012	  95248	 135720	2637980	 28409c	ethereum_sha3_TryCL117925
4680040	 108704	 139984	4928728	 4b34d8	ethereum_storage_Tip
4678379	 108704	 139984	4927067	 4b2e5b	ethereum_storage_TryCL117925
5023812	 111384	 139656	5274852	 507ce4	ethereum_trie_Tip
5021601	 111384	 139656	5272641	 507441	ethereum_trie_TryCL117925
9392334	 881770	 152488	10426592	 9f18e0	ethereum_whisperv5_Tip
9390867	 881770	 152488	10425125	 9f1325	ethereum_whisperv5_TryCL117925
3297273	 174104	 135432	3606809	 370919	gonum_blas_native_Tip
3292393	 174104	 135432	3601929	 36f609	gonum_blas_native_TryCL117925
2768825	 122864	 145672	3037361	 2e58b1	gonum_community_Tip
2767891	 122864	 145672	3036427	 2e550b	gonum_community_TryCL117925
5086714	 287120	 135816	5509650	 541212	gonum_lapack_native_Tip
5086303	 287120	 135816	5509239	 541077	gonum_lapack_native_TryCL117925
4601987	 124912	 143528	4870427	 4a511b	gonum_mat_Tip
4600506	 124912	 143528	4868946	 4a4b52	gonum_mat_TryCL117925
2840813	 222312	 143400	3206525	 30ed7d	gonum_path_Tip
2839674	 222312	 143400	3205386	 30e90a	gonum_path_TryCL117925
2444690	 103504	 144296	2692490	 29158a	gonum_topo_Tip
2442897	 103504	 144296	2690697	 290e89	gonum_topo_TryCL117925
2237492	  97360	 143080	2477932	 25cf6c	gonum_traverse_Tip
2235723	  97360	 143080	2476163	 25c883	gonum_traverse_TryCL117925
14978433	 400569	 153168	15532170	 ed008a	hugo_helpers_Tip
14976682	 400569	 153168	15530419	 ecf9b3	hugo_helpers_TryCL117925
20526259	 436953	 186064	21149276	142b65c	hugo_hugolib_sitebuilding_Tip
20521440	 436953	 186064	21144457	142a389	hugo_hugolib_sitebuilding_TryCL117925
20526259	 436953	 186064	21149276	142b65c	hugo_hugolib_Tip
20521440	 436953	 186064	21144457	142a389	hugo_hugolib_TryCL117925
24354787	 323009	 162576	24840372	17b08b4	k8s_api_Tip
24343935	 323009	 162576	24829520	17ade50	k8s_api_TryCL117925
14824181	 283489	 155440	15263110	 e8e586	k8s_schedulercache_Tip
14818560	 283489	 155440	15257489	 e8cf91	k8s_schedulercache_TryCL117925
2534866	  96176	 133736	2764778	 2a2fea	semver_Tip
2533177	  96176	 133736	2763089	 2a2951	semver_TryCL117925
2418677	  94992	 133640	2647309	 28650d	spexs2_Tip
2416868	  94992	 133640	2645500	 285dfc	spexs2_TryCL117925
6954182	 226921	 147440	7328543	 6fd31f	uber_zap_Tip
6953004	 226921	 147440	7327365	 6fce85	uber_zap_TryCL117925
3659912	 107880	 143376	3911168	 3bae00	uuid_Tip
3658142	 107880	 143376	3909398	 3ba716	uuid_TryCL117925
Copy link

@agnivade agnivade commented Jun 11, 2018

I apologize for my stupidity. I was comparing with 1.10.2. 😞 I have updated my comment. The results match with @dr2chase's observations.

Copy link

@dr2chase dr2chase commented Jun 11, 2018

No need for the s-word, benchmarking is picky work that's easy to get wrong, I ended up writing a tool to help me do it because I could not trust myself to do it all correctly and reproducibly, and I really ought to make that tool automate even more of the steps.

Copy link

@TocarIP TocarIP commented Jun 12, 2018

@dr2chase what machine are you using? I've tried to reproduce Growth_MultiSegment regression, but it showed no changes.

Copy link

@dr2chase dr2chase commented Jun 12, 2018

It is "Model name: Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz"
I used Austin's perflock to reduce risk of heat-related performance changes, used a quiet machine.

Copy link
Contributor Author

@josharian josharian commented May 4, 2019

It'd nice to be able to use BLSR from BMI1, e.g. to rapidly iterate over set bits, which is something that we do in a few hot functions in the runtime (adjustpointers, scanblock).

Copy link

@seebs seebs commented Aug 17, 2019

I haven't got data on the performance impact yet, because I have to hack the compiler to do it, but we have a tight loop that's oring two 8KB blocks together, and we also want to popcount them, and right now, on amd64, adding popcount more than doubles runtime for the loop according to naive benchmarks.

The difference from adding a popcount inside the loop is larger than I'd have expected:

        movq    (CX)(DX*8), BX
        movq    (AX)(DX*8), SI
        orq     SI, BX
        movq    BX, (CX)(DX*8)


        movq    8(CX)(DX*8), DI
        movq    8(AX)(DX*8), R8
        orq    R8, DI
        movq    DI, 8(CX)(DX*8)
        addq    SI, BX
        cmpb    runtime.x86HasPOPCNT(SB), $0
        jeq     unionBitmapBitmapInPlaceTemps_pc349
        xorq    SI, SI
        popcntq DI, SI

Note, the "addq" in the middle is actually adding a previous popcount, this loop's popcount is added to a counter about four instructions later. So this isn't the actual instructions for this step, but it's the same instructions as them.

I don't understand why there's an xorq SI, SI right before the popcount, because the popcount just overwrites that register anyway, I think? But my expectation would have been that popcount would have fairly low impact, because the entire operation is trivial enough that I wouldn't expect it to be running out of computation before it gets stuck waiting for cache lines. I did not expect it to double runtime cost. (I expected just the popcountq and addq instructions, basically.)

If time allows, I'm gonna hack my local compiler to ditch that and see what the impact is like.

Copy link

@zeebo zeebo commented Aug 17, 2019

For the XORQ SI, SI, if I recall correctly, this is because on some architectures popcnt has a false dependency on the output register.

Copy link

@seebs seebs commented Aug 17, 2019

So I made a funny mistake which actually produced informative results: I wrote a benchmark for this in which I omitted any actual use of the results of popcount operations.

As it happens, this has the curious effect of generating the conditional jumps to call the popcount function if necessary, but not generating the actual popcount instructions (as the compiler figured out that the value being set to their sum was unused).

So, for basically similar code, my reported times were:

no popcount: 450ns/op
conditionals but no popcount: 600ns/op
popcount included: 800ns/op

(somewhat approximate values)

I then went ahead and did the fairly obvious thing to just take away the runtime tests for OnesCount64, and with those, I get 444/444/520.

What's interesting about this is that the cost of adding the popcount instructions without the conditional tests is dramatically lower (~80ns) than the cost of adding the popcount instructions with the conditional tests (~200ns more than the conditionals without the popcounts).

So, for this specific function, it looks like the difference would be ~520ns vs. ~800ns. A 35% reduction in runtime for a hot loop is pretty tempting.

I don't really know enough about modern machines to know how common it is to find amd64 hardware without popcount. But I'd say that, yeah, I would be pretty interested in a target that allowed me to get that improvement on something that has been a pain point for us in the past.

Just in case it's relevant: Looking in more detail with perf, it looks like the conditionals are noticably slowing execution down, even though they're virtually never mispredicted.

using 1.12.9:

       3966.613205      task-clock (msec)         #    1.001 CPUs utilized          
               842      context-switches          #    0.212 K/sec                  
                28      cpu-migrations            #    0.007 K/sec                  
             1,233      page-faults               #    0.311 K/sec                  
    14,233,634,600      cycles                    #    3.588 GHz                      (30.71%)
    51,543,275,201      instructions              #    3.62  insn per cycle           (38.51%)
     7,791,177,587      branches                  # 1964.189 M/sec                    (38.43%)
         5,647,268      branch-misses             #    0.07% of all branches          (38.31%)
    15,594,961,940      L1-dcache-loads           # 3931.556 M/sec                    (38.44%)
         2,310,978      L1-dcache-load-misses     #    0.01% of all L1-dcache hits    (38.47%)
            52,691      LLC-loads                 #    0.013 M/sec                    (30.64%)
            17,671      LLC-load-misses           #   33.54% of all LL-cache hits     (30.63%)
   <not supported>      L1-icache-loads                                             
           556,635      L1-icache-load-misses                                         (30.73%)
    15,533,175,254      dTLB-loads                # 3915.979 M/sec                    (30.83%)
             3,938      dTLB-load-misses          #    0.00% of all dTLB cache hits   (30.93%)
             6,249      iTLB-loads                #    0.002 M/sec                    (31.06%)
               856      iTLB-load-misses          #   13.70% of all iTLB cache hits   (30.83%)

Using the same thing, but with the conditional jump for OnesCount64 omitted:

       2615.934480      task-clock (msec)         #    1.001 CPUs utilized          
               770      context-switches          #    0.294 K/sec                  
                34      cpu-migrations            #    0.013 K/sec                  
             1,238      page-faults               #    0.473 K/sec                  
     9,387,417,295      cycles                    #    3.589 GHz                      (30.25%)
    37,574,374,842      instructions              #    4.00  insn per cycle           (38.10%)
     1,338,302,028      branches                  #  511.596 M/sec                    (38.27%)
         5,577,182      branch-misses             #    0.42% of all branches          (38.38%)
    10,322,451,044      L1-dcache-loads           # 3945.990 M/sec                    (38.52%)
         2,035,316      L1-dcache-load-misses     #    0.02% of all L1-dcache hits    (38.58%)
            42,063      LLC-loads                 #    0.016 M/sec                    (30.87%)
            12,212      LLC-load-misses           #   29.03% of all LL-cache hits     (31.17%)
   <not supported>      L1-icache-loads                                             
           379,094      L1-icache-load-misses                                         (31.17%)
    10,403,203,012      dTLB-loads                # 3976.859 M/sec                    (31.17%)
             1,747      dTLB-load-misses          #    0.00% of all dTLB cache hits   (30.85%)
             2,515      iTLB-loads                #    0.961 K/sec                    (30.46%)
               577      iTLB-load-misses          #   22.94% of all iTLB cache hits   (30.31%)

Branch misses don't seem to be a significant factor here, but something about this is causing some stalling, giving us a noticable drop in instructions/cycle, but also increasing instructions executed about 35%.


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
9 participants
You can’t perform that action at this time.