Skip to content

runtime: BinaryTree17 performance regression #13535

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
aclements opened this issue Dec 8, 2015 · 3 comments
Closed

runtime: BinaryTree17 performance regression #13535

aclements opened this issue Dec 8, 2015 · 3 comments
Milestone

Comments

@aclements
Copy link
Member

BinaryTree17 from test/bench/go1 clearly slowed down by ~5% with commit 7407d8e, which made the proportional sweeper less agressive.

BinaryTree17-12   3.04s ± 3%   3.19s ± 1%  +4.94%  (p=0.008 n=5+5)
@aclements aclements self-assigned this Dec 8, 2015
@aclements aclements added this to the Go1.6 milestone Dec 8, 2015
@aclements
Copy link
Member Author

This commit has a significant effect on the behavior of the background sweeper, which means the profiling information is largely masked by #13527. In fact, by just about every performance counter, it looks like things should have improved:

Before:

       5780.387371 task-clock (msec)         #    1.298 CPUs utilized            ( +-  0.14% )
            80,332 context-switches          #    0.014 M/sec                    ( +-  1.42% )
               218 cpu-migrations            #    0.038 K/sec                    ( +-  7.59% )
            98,357 page-faults               #    0.017 M/sec                    ( +-  0.20% )
    21,038,437,476 cycles                    #    3.640 GHz                      ( +-  0.16% )
     8,109,376,876 stalled-cycles-frontend   #   38.55% frontend cycles idle     ( +-  0.52% )
   <not supported> stalled-cycles-backend  
    44,019,548,586 instructions              #    2.09  insns per cycle        
                                             #    0.18  stalled cycles per insn  ( +-  0.05% )
     7,870,279,001 branches                  # 1361.549 M/sec                    ( +-  0.06% )
        20,604,627 branch-misses             #    0.26% of all branches          ( +-  0.36% )

        21,155,717 cache-misses              #   32.480 % of all cache refs      ( +-  1.38% )
        65,135,328 cache-references                                              ( +-  1.44% )

       4.454431364 seconds time elapsed                                          ( +-  0.14% )

After:

       5501.692468 task-clock (msec)         #    1.159 CPUs utilized            ( +-  1.13% )
            43,624 context-switches          #    0.008 M/sec                    ( +-  1.65% )
               253 cpu-migrations            #    0.046 K/sec                    ( +- 20.66% )
            92,468 page-faults               #    0.017 M/sec                    ( +-  4.19% )
    20,038,538,003 cycles                    #    3.642 GHz                      ( +-  1.00% )
     7,398,799,878 stalled-cycles-frontend   #   36.92% frontend cycles idle     ( +-  2.71% )
   <not supported> stalled-cycles-backend  
    43,409,563,204 instructions              #    2.17  insns per cycle        
                                             #    0.17  stalled cycles per insn  ( +-  0.10% )
     7,726,564,010 branches                  # 1404.398 M/sec                    ( +-  0.11% )
        19,121,674 branch-misses             #    0.25% of all branches          ( +-  0.19% )

        20,157,127 cache-misses              #   33.563 % of all cache refs      ( +-  0.19% )
        60,058,099 cache-references                                              ( +-  0.58% )

       4.748749757 seconds time elapsed                                          ( +-  1.68% )

The one thing that doesn't improve is the amount of time we spend blocked in futex: it goes from 43528.4 total ms to 45066.6 total ms (though the number of futex calls drops from 272919 to 106681!).

(The futex stats are computed from perf trace -- $BENCH -test.bench BinaryTree17 | awk '{N+=$2;T+=$2*$4}END{print N " " T}'.)

@aclements
Copy link
Member Author

With the background sweeper disabled as a workaround, the performance metrics become much clearer:

Before:

       4901.222760 task-clock (msec)         #    1.054 CPUs utilized            ( +-  0.13% )
             2,136 context-switches          #    0.436 K/sec                    ( +-  2.96% )
                99 cpu-migrations            #    0.020 K/sec                    ( +-  6.83% )
            82,005 page-faults               #    0.017 M/sec                    ( +-  0.18% )
    18,211,805,777 cycles                    #    3.716 GHz                      ( +-  0.18% )
     5,917,997,422 stalled-cycles-frontend   #   32.50% frontend cycles idle     ( +-  0.53% )
   <not supported> stalled-cycles-backend  
    42,052,410,589 instructions              #    2.31  insns per cycle        
                                             #    0.14  stalled cycles per insn  ( +-  0.04% )
     7,711,177,418 branches                  # 1573.317 M/sec                    ( +-  0.04% )
        16,829,018 branch-misses             #    0.22% of all branches          ( +-  0.21% )

       4.650601834 seconds time elapsed                                          ( +-  0.12% )

        12,926,113 cache-misses              #   41.799 % of all cache refs      ( +-  0.22% )
        30,924,384 cache-references                                              ( +-  0.42% )

       4.643145976 seconds time elapsed                                          ( +-  0.15% )

After:

       5000.849191 task-clock (msec)         #    1.050 CPUs utilized            ( +-  0.24% )
             2,122 context-switches          #    0.424 K/sec                    ( +-  2.10% )
               107 cpu-migrations            #    0.021 K/sec                    ( +-  6.88% )
            82,336 page-faults               #    0.016 M/sec                    ( +-  0.15% )
    18,617,298,137 cycles                    #    3.723 GHz                      ( +-  0.11% )
     6,249,607,127 stalled-cycles-frontend   #   33.57% frontend cycles idle     ( +-  0.32% )
   <not supported> stalled-cycles-backend  
    42,576,852,645 instructions              #    2.29  insns per cycle        
                                             #    0.15  stalled cycles per insn  ( +-  0.02% )
     7,554,217,419 branches                  # 1510.587 M/sec                    ( +-  0.02% )
        17,991,876 branch-misses             #    0.24% of all branches          ( +-  0.11% )

       4.762153202 seconds time elapsed                                          ( +-  0.16% )

        19,667,208 cache-misses              #   42.460 % of all cache refs      ( +-  0.11% )
        46,319,111 cache-references                                              ( +-  0.30% )

       4.751449585 seconds time elapsed                                          ( +-  0.12% )

It's clear we're getting for more LLC references and misses. The increased LLC references are presumably from lower-level cache misses. All of the other counters are virtually identical between the runs.

A differential profile of cache-references shows that runtime.memclr goes from 0.75% to 27.82% (1st place) and runtime.mCentral_Grow goes from 0.34% to 2.70% (6th place). The rest of the profile is basically unchanged.

According to perf record -g -e cache-references ./bench.after.fp -test.bench Binary, 99% of the cache-references in memclr are from newobject -> mCache_Refill -> mCentral_CacheSpan -> mCentral_Grow -> mHeap_Alloc -> memclr.

Counting the number of mCache_Grow calls reveals that this commit raised it by 6X from 38,805 to 232,021. However, we're clearly satisfying these from the heap (not from the OS), because the max RSS is virtually unchanged at ~600MB.

@gopherbot
Copy link
Contributor

CL https://golang.org/cl/17745 mentions this issue.

@golang golang locked and limited conversation to collaborators Dec 14, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants