New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: consider adding 24-byte size class #8885

Open
dvyukov opened this Issue Oct 7, 2014 · 20 comments

Comments

Projects
None yet
7 participants
@dvyukov
Member

dvyukov commented Oct 7, 2014

Currently we have only 16 and 32 byte size classes.
24-byte size class would reduce memory consumption by 25% for programs that extensively
use 24-byte objects (e.g. Node struct { left, right *Node; data int }).
Leaving aside other issues with vector instructions (16-byte alignment is not enforced
for stack, AVX 32-byte alignment is not enforced at all); 24-byte objects are hardly
used with SSE as they contains only one and a half of vector elements.

Here is a prototype CL:
https://golang.org/cl/128090043/

On 64-bits it reduces heap size by 6%, on 32-bits - by 12% on garbage benchmark.
@bradfitz

This comment has been minimized.

Member

bradfitz commented Nov 20, 2016

See useful discussion in dup #17633 too.

/cc @josharian @randall77 @aclements @dr2chase @RLH

@aclements aclements modified the milestones: Go1.10Early, Go1.9 May 18, 2017

@bradfitz bradfitz modified the milestones: Go1.10Early, Go1.10 Jun 14, 2017

@rsc rsc modified the milestones: Go1.10, Go1.11 Nov 22, 2017

@bradfitz

This comment has been minimized.

Member

bradfitz commented Apr 3, 2018

Should we try this?

/cc @aclements @dvyukov @griesemer @josharian

@griesemer

This comment has been minimized.

Contributor

griesemer commented Apr 3, 2018

Just a reference point: The package syntax AST nodes are all bigger than 24bytes (on 64bit platforms), except for one node (EmptyStmt), which virtually never occurs. All nodes have a 16byte minimal cost (position information), and most nodes have two fields in addition, or perhaps a single interface or string field which bumps up the size to at least 32bytes. For this package I think the sweet spot is actually higher, perhaps around 40 or 56 bytes.

I think @dvyukov's tree node example is misleading: It's unlikely in the real world for the payload to be just an int. I could see a string, but that would bump that example up to 32bytes. I think many light-weight (but large) data structures are also not so homogenous in terms of node types, and the way to define large graphs of non-homogenous node (types) in Go is to use interfaces. In other words, I suspect a node is more likely to look like type node interface{...} and concrete nodes are more likely of a form such as type xnode struct { left, right node; payload T } rather than what @dvyukov suggested. That is, such a node is already at 32bytes with the two node interfaces, and then there's the payload.

I'd make the experiment and add a 40byte size class. In the "best" case that could still reduce waste by ~20% (40bytes instead of 48bytes).

@griesemer griesemer closed this Apr 3, 2018

@griesemer

This comment has been minimized.

Contributor

griesemer commented Apr 3, 2018

Didn't mean to close this.

@griesemer griesemer reopened this Apr 3, 2018

@bradfitz

This comment has been minimized.

Member

bradfitz commented Apr 3, 2018

A slice is 24 bytes, though. I have to imagine that they're somewhat common. Actual numbers from running all.bash or something would be interesting, though.

For reference here are the current size classes: https://golang.org/src/runtime/sizeclasses.go

@griesemer

This comment has been minimized.

Contributor

griesemer commented Apr 3, 2018

A slice rarely comes alone. Also, we don't usually have pointers to slices.

@josharian

This comment has been minimized.

Contributor

josharian commented Apr 3, 2018

From #17633:

Here's one data point from the compiler for 1.9 about the prevalence of slice allocations: While compiling std+cmd, 3.42% of all allocations were []*gc.Node; 1.16% of allocations were []*gc.Field.

@josharian

This comment has been minimized.

Contributor

josharian commented Apr 3, 2018

@griesemer pointers to slices occur when the slice escapes.

@griesemer

This comment has been minimized.

Contributor

griesemer commented Apr 3, 2018

@josharian Yes, but is that a common scenario? Like millions of escaped slices?

@bradfitz

This comment has been minimized.

Member

bradfitz commented Apr 3, 2018

Another example is when a slice implements an interface (e.g. sort.Interface) and needs to be boxed into an interface.

@griesemer

This comment has been minimized.

Contributor

griesemer commented Apr 3, 2018

@bradfitz Agreed. But unless we have millions of those, it just doesn't add up in the big picture. I think we need to look at large graph-like data structures, and there I think the typical node size is higher.

@bradfitz

This comment has been minimized.

Member

bradfitz commented Apr 3, 2018

Yeah, if the whole point of this is to save memory from internal fragmentation, it only really matters for structures that are retained for long-ish periods of time. It's not so obvious how to measure that in a realistic way.

Or if there are just millions of short-lived ones, the benefit is they'll be packed tighter and cause the GC to run less often, saving CPU. That's easier to measure. If that's 4.5% of total allocations in the compiler, that might show up on CPU usage numbers? Maybe.

@dvyukov

This comment has been minimized.

Member

dvyukov commented Apr 3, 2018

It's also strings, byte slices, small structs in interfaces (I have a bunch of real examples -- namely, structs with few ints, maybe a pointer, and if there is a point in reducing size to 24-bytes some ints can be made uint32/16/8). Also 32-bits are not completely dead. At that time I measured it on a realistic benchmark:

On 64-bits it reduces heap size by 6%, on 32-bits - by 12% on garbage benchmark.

On our internal apps small allocation sizes (16, 8, 32, 48) dominate.

@josharian

This comment has been minimized.

Contributor

josharian commented Apr 3, 2018

FWIW, I just instrumented mallocgc by adding println("mallocgc", size) at the top of it. I then ran make.bash, which is mostly the compiler but also a little cmd/go and linker. The head of the distribution:

 19.33% 17244825 mallocgc 8
  8.56% 7632691 mallocgc 128
  8.25% 7356719 mallocgc 24
  8.00% 7139293 mallocgc 16
  6.11% 5446505 mallocgc 32
  5.90% 5263621 mallocgc 88
  4.15% 3698890 mallocgc 48
  3.72% 3321218 mallocgc 56
  3.42% 3053224 mallocgc 40
  2.64% 2354769 mallocgc 64
  2.58% 2298379 mallocgc 232
  2.12% 1892513 mallocgc 0
  1.60% 1425121 mallocgc 4
  1.20% 1066084 mallocgc 144
  1.16% 1033258 mallocgc 112
  0.99% 885371 mallocgc 20
  0.74% 658981 mallocgc 6
  0.66% 589444 mallocgc 104
  0.63% 566173 mallocgc 9
  0.62% 550690 mallocgc 96
@josharian

This comment has been minimized.

Contributor

josharian commented Apr 3, 2018

Same experiment in 386 mode:

 20.47% 17877316 mallocgc 8
  8.48% 7401837 mallocgc 76
  7.68% 6711803 mallocgc 12
  6.35% 5543303 mallocgc 32
  5.93% 5180986 mallocgc 52
  5.70% 4980797 mallocgc 16
  5.06% 4421608 mallocgc 20
  4.71% 4113028 mallocgc 4
  2.98% 2599396 mallocgc 132
  2.86% 2496997 mallocgc 24
  2.70% 2354861 mallocgc 28
  2.15% 1877865 mallocgc 0
  1.58% 1377753 mallocgc 36
  1.57% 1374487 mallocgc 64
  0.98% 852984 mallocgc 48
  0.79% 692507 mallocgc 128
  0.77% 669959 mallocgc 6
  0.73% 633511 mallocgc 56
  0.72% 631136 mallocgc 68
  0.66% 578699 mallocgc 9

The relevant sizes are 20 byte and 24 byte.

So for 64 bit it's about 9.2% of allocations in total; 32 bit it's about 7.9%.

@bradfitz

This comment has been minimized.

Member

bradfitz commented Apr 3, 2018

That's more than I expected, but still only ~60MB total internal fragmentation from 24 byte allocations. Can you weight it by total bytes allocated instead of bytes per allocation?

Or more interesting might be to weight it by the number of wasted bytes.

@griesemer

This comment has been minimized.

Contributor

griesemer commented Apr 3, 2018

@josharian Interesting. Do you know what the 24byte objects are? (And perhaps more interesting, the 8byte objects?)

@josharian

This comment has been minimized.

Contributor

josharian commented Apr 4, 2018

Do you know what the 24byte objects are? (And perhaps more interesting, the 8byte objects?)

New mallocgc instrumentation:

  if size == 8 {
    if typ != nil {
      println("mallocgc", size, typ.string())
    } else {
      println("mallocgc", size, "??")
    }
  }

Top 10 for size == 8:

 35.20% 6070097 mallocgc 8 int64
 17.98% 3100649 mallocgc 8 *ssa.Value
 15.52% 2676905 mallocgc 8 ??
 15.02% 2590231 mallocgc 8 *gc.Node
  4.77% 821973 mallocgc 8 *types.Field
  2.87% 495187 mallocgc 8 [1]int
  2.87% 494969 mallocgc 8 [1]*gc.Node
  1.34% 230264 mallocgc 8 [1]*types.Field
  0.97% 167398 mallocgc 8 *dwarf.Var
  0.63% 108184 mallocgc 8 *ssa.Block

In my experience, allocation of [1]T types are the result of constructing a single-element slice to pass to a variadic function.

Top 10 for size == 24:

 29.36% 2159906 mallocgc 24 []*gc.Node
 22.87% 1682615 mallocgc 24 types.Struct
 15.68% 1153834 mallocgc 24 []*types.Field
  9.40% 691918 mallocgc 24 ssa.use
  5.38% 395983 mallocgc 24 ld.chain
  3.34% 245676 mallocgc 24 obj.Auto
  2.37% 174656 mallocgc 24 ??
  1.97% 144761 mallocgc 24 []uint8
  1.13% 83123 mallocgc 24 dwarf.byChildIndex
  0.90% 66389 mallocgc 24 ast.Comment

Of these, only ssa.use and ld.chain are linked lists.

@josharian

This comment has been minimized.

Contributor

josharian commented Apr 4, 2018

Can you weight it by total bytes allocated instead of bytes per allocation?

Not easily. But I tried a different experiment that hopefully will shed some light. I changed mksizeclasses.go to explicitly prevent the 32 byte size class:

diff --git a/src/runtime/mksizeclasses.go b/src/runtime/mksizeclasses.go
index b146dbcd6c..37d6b49be1 100644
--- a/src/runtime/mksizeclasses.go
+++ b/src/runtime/mksizeclasses.go
@@ -136,6 +136,9 @@ func makeClasses() []class {
                        classes[len(classes)-1].size = size
                        continue
                }
+               if size == 32 {
+                       continue
+               }
                classes = append(classes, class{size: size, npages: npages})
        }
 
@@ -157,7 +160,7 @@ func makeClasses() []class {
                }
        }
 
-       if len(classes) != 67 {
+       if len(classes) != 66 {
                panic("number of size classes has changed")
        }

and then regenerated sizeclasses.go and compared compiler performance. Results:

name        old time/op       new time/op       delta
Template          194ms ± 5%        192ms ± 3%  -0.82%  (p=0.021 n=50+46)
Unicode          84.3ms ± 4%       84.5ms ± 5%    ~     (p=0.515 n=49+50)
GoTypes           550ms ± 3%        551ms ± 4%    ~     (p=0.175 n=47+48)
Compiler          2.58s ± 3%        2.60s ± 3%    ~     (p=0.054 n=48+49)
SSA               5.94s ± 2%        5.97s ± 3%    ~     (p=0.092 n=50+49)
Flate             119ms ± 7%        118ms ± 4%    ~     (p=0.627 n=50+46)
GoParser          142ms ± 3%        142ms ± 4%    ~     (p=0.962 n=48+48)
Reflect           353ms ± 4%        357ms ± 4%  +1.18%  (p=0.006 n=48+46)
Tar               163ms ± 5%        167ms ± 4%  +2.18%  (p=0.000 n=48+49)
XML               203ms ± 5%        205ms ± 5%  +0.93%  (p=0.037 n=49+50)
[Geo mean]        350ms             351ms       +0.50%

name        old user-time/op  new user-time/op  delta
Template          243ms ± 4%        242ms ± 4%    ~     (p=0.184 n=49+48)
Unicode           108ms ± 8%        108ms ± 7%    ~     (p=0.775 n=50+50)
GoTypes           718ms ± 4%        723ms ± 3%  +0.61%  (p=0.029 n=48+48)
Compiler          3.40s ± 3%        3.42s ± 3%    ~     (p=0.119 n=49+48)
SSA               8.45s ± 3%        8.53s ± 3%  +1.00%  (p=0.000 n=50+46)
Flate             145ms ± 6%        144ms ± 5%    ~     (p=0.528 n=50+49)
GoParser          173ms ± 4%        173ms ± 4%    ~     (p=0.934 n=49+48)
Reflect           438ms ± 5%        446ms ± 6%  +1.78%  (p=0.001 n=48+49)
Tar               201ms ± 4%        204ms ± 5%  +1.56%  (p=0.000 n=47+48)
XML               250ms ± 4%        252ms ± 5%    ~     (p=0.068 n=49+50)
[Geo mean]        444ms             447ms       +0.52%

name        old alloc/op      new alloc/op      delta
Template         39.0MB ± 0%       40.0MB ± 0%  +2.55%  (p=0.008 n=5+5)
Unicode          29.1MB ± 0%       29.3MB ± 0%  +0.69%  (p=0.008 n=5+5)
GoTypes           115MB ± 0%        118MB ± 0%  +2.51%  (p=0.008 n=5+5)
Compiler          496MB ± 0%        506MB ± 0%  +2.09%  (p=0.008 n=5+5)
SSA              1.40GB ± 0%       1.43GB ± 0%  +1.80%  (p=0.008 n=5+5)
Flate            25.0MB ± 0%       25.6MB ± 0%  +2.25%  (p=0.008 n=5+5)
GoParser         31.0MB ± 0%       31.8MB ± 0%  +2.53%  (p=0.008 n=5+5)
Reflect          77.0MB ± 0%       79.5MB ± 0%  +3.34%  (p=0.008 n=5+5)
Tar              39.3MB ± 0%       40.2MB ± 0%  +2.37%  (p=0.008 n=5+5)
XML              44.8MB ± 0%       45.7MB ± 0%  +1.99%  (p=0.008 n=5+5)
[Geo mean]       79.1MB            80.8MB       +2.21%

name        old allocs/op     new allocs/op     delta
Template           385k ± 0%         380k ± 0%  -1.25%  (p=0.008 n=5+5)
Unicode            336k ± 0%         324k ± 0%  -3.67%  (p=0.008 n=5+5)
GoTypes           1.20M ± 0%        1.18M ± 0%  -1.29%  (p=0.008 n=5+5)
Compiler          4.68M ± 0%        4.62M ± 0%  -1.29%  (p=0.008 n=5+5)
SSA               11.6M ± 0%        11.5M ± 0%  -1.40%  (p=0.008 n=5+5)
Flate              237k ± 0%         234k ± 0%  -1.42%  (p=0.008 n=5+5)
GoParser           319k ± 0%         315k ± 0%  -1.28%  (p=0.008 n=5+5)
Reflect            959k ± 0%         948k ± 0%  -1.13%  (p=0.008 n=5+5)
Tar                392k ± 0%         388k ± 0%  -1.16%  (p=0.008 n=5+5)
XML                417k ± 0%         412k ± 0%  -1.36%  (p=0.016 n=4+5)
[Geo mean]         795k              782k       -1.53%

So not having a 32 byte size classes slows things down, but not by very much. Predictably, it increases the overall amount of allocation. Surprisingly, the number of allocs go down. I believe that this is because append rounds up to the nearest size class, so append-based growth happens more aggressively, thereby avoiding some allocations.

I did the same experiment eliminating the 48 byte size class (and restoring the 32 byte size class). Results:

name        old time/op       new time/op       delta
Template          192ms ± 5%        192ms ± 5%    ~     (p=0.456 n=50+50)
Unicode          83.7ms ± 4%       85.3ms ± 5%  +1.89%  (p=0.000 n=46+47)
GoTypes           596ms ± 4%        599ms ± 4%    ~     (p=0.052 n=48+48)
Compiler          2.85s ± 3%        2.87s ± 3%    ~     (p=0.188 n=50+50)
SSA               6.55s ± 2%        6.59s ± 2%  +0.51%  (p=0.001 n=48+48)
Flate             124ms ± 5%        125ms ± 6%    ~     (p=0.102 n=49+49)
GoParser          146ms ± 7%        144ms ± 5%  -1.20%  (p=0.020 n=49+48)
Reflect           355ms ± 5%        356ms ± 5%    ~     (p=0.849 n=48+49)
Tar               166ms ± 4%        166ms ± 6%    ~     (p=0.868 n=48+50)
XML               204ms ± 4%        205ms ± 4%    ~     (p=0.245 n=48+49)
[Geo mean]        362ms             364ms       +0.36%

name        old user-time/op  new user-time/op  delta
Template          242ms ± 5%        241ms ± 5%    ~     (p=0.149 n=50+50)
Unicode           108ms ± 6%        107ms ± 6%    ~     (p=0.082 n=45+49)
GoTypes           758ms ± 3%        762ms ± 4%    ~     (p=0.092 n=47+50)
Compiler          3.67s ± 3%        3.68s ± 3%    ~     (p=0.416 n=50+50)
SSA               9.11s ± 2%        9.11s ± 3%    ~     (p=0.845 n=49+50)
Flate             150ms ± 6%        151ms ± 7%    ~     (p=0.678 n=49+50)
GoParser          177ms ± 8%        175ms ± 5%  -1.13%  (p=0.042 n=50+47)
Reflect           442ms ± 5%        441ms ± 5%    ~     (p=0.388 n=49+49)
Tar               203ms ± 4%        203ms ± 6%    ~     (p=0.804 n=47+50)
XML               250ms ± 4%        249ms ± 4%    ~     (p=0.264 n=47+48)
[Geo mean]        457ms             456ms       -0.24%

name        old alloc/op      new alloc/op      delta
Template         39.0MB ± 0%       39.4MB ± 0%  +1.25%  (p=0.008 n=5+5)
Unicode          29.1MB ± 0%       30.0MB ± 0%  +3.24%  (p=0.008 n=5+5)
GoTypes           115MB ± 0%        117MB ± 0%  +1.15%  (p=0.008 n=5+5)
Compiler          496MB ± 0%        501MB ± 0%  +1.19%  (p=0.008 n=5+5)
SSA              1.40GB ± 0%       1.42GB ± 0%  +1.16%  (p=0.008 n=5+5)
Flate            25.0MB ± 0%       25.4MB ± 0%  +1.32%  (p=0.008 n=5+5)
GoParser         31.0MB ± 0%       31.3MB ± 0%  +1.12%  (p=0.008 n=5+5)
Reflect          77.0MB ± 0%       78.2MB ± 0%  +1.63%  (p=0.008 n=5+5)
Tar              39.3MB ± 0%       39.8MB ± 0%  +1.31%  (p=0.008 n=5+5)
XML              44.8MB ± 0%       45.3MB ± 0%  +1.13%  (p=0.008 n=5+5)
[Geo mean]       79.1MB            80.2MB       +1.45%

name        old allocs/op     new allocs/op     delta
Template           385k ± 0%         385k ± 0%  -0.02%  (p=0.008 n=5+5)
Unicode            336k ± 0%         336k ± 0%  -0.01%  (p=0.040 n=5+5)
GoTypes           1.20M ± 0%        1.20M ± 0%  -0.03%  (p=0.008 n=5+5)
Compiler          4.68M ± 0%        4.68M ± 0%  -0.02%  (p=0.008 n=5+5)
SSA               11.6M ± 0%        11.6M ± 0%  -0.04%  (p=0.008 n=5+5)
Flate              237k ± 0%         237k ± 0%  -0.02%  (p=0.016 n=5+5)
GoParser           319k ± 0%         319k ± 0%    ~     (p=0.333 n=5+5)
Reflect            959k ± 0%         959k ± 0%  -0.01%  (p=0.008 n=5+5)
Tar                393k ± 0%         392k ± 0%    ~     (p=0.222 n=5+5)
XML                417k ± 0%         417k ± 0%  -0.02%  (p=0.024 n=5+5)
[Geo mean]         795k              794k       -0.02%

Similar, but less impactful.

I'm not sure exactly what to make of these numbers. Is a 0.5% speed improvement significant enough? Not sure. But at least this gives us a rough feel.

@griesemer

This comment has been minimized.

Contributor

griesemer commented Apr 4, 2018

Thanks @josharian, this is good information. Clearly, my intuition was wrong here. Regarding the 8byte allocations that might come from single-element slices for variadic functions: fmt.Printf comes to mind... (fmt.go for creating type name symbols?).

@bradfitz bradfitz modified the milestones: Go1.11, Go1.12 May 18, 2018

@gopherbot gopherbot modified the milestones: Go1.12, Unplanned May 23, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment