Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: consider adding 24-byte size class #8885

Closed
dvyukov opened this issue Oct 7, 2014 · 28 comments
Closed

runtime: consider adding 24-byte size class #8885

dvyukov opened this issue Oct 7, 2014 · 28 comments
Assignees
Milestone

Comments

@dvyukov
Copy link
Member

@dvyukov dvyukov commented Oct 7, 2014

Currently we have only 16 and 32 byte size classes.
24-byte size class would reduce memory consumption by 25% for programs that extensively
use 24-byte objects (e.g. Node struct { left, right *Node; data int }).
Leaving aside other issues with vector instructions (16-byte alignment is not enforced
for stack, AVX 32-byte alignment is not enforced at all); 24-byte objects are hardly
used with SSE as they contains only one and a half of vector elements.

Here is a prototype CL:
https://golang.org/cl/128090043/

On 64-bits it reduces heap size by 6%, on 32-bits - by 12% on garbage benchmark.
@bradfitz
Copy link
Contributor

@bradfitz bradfitz commented Nov 20, 2016

See useful discussion in dup #17633 too.

/cc @josharian @randall77 @aclements @dr2chase @RLH

@aclements aclements modified the milestones: Go1.10Early, Go1.9 May 18, 2017
@bradfitz bradfitz modified the milestones: Go1.10Early, Go1.10 Jun 14, 2017
@rsc rsc modified the milestones: Go1.10, Go1.11 Nov 22, 2017
@bradfitz
Copy link
Contributor

@bradfitz bradfitz commented Apr 3, 2018

Should we try this?

/cc @aclements @dvyukov @griesemer @josharian

@griesemer
Copy link
Contributor

@griesemer griesemer commented Apr 3, 2018

Just a reference point: The package syntax AST nodes are all bigger than 24bytes (on 64bit platforms), except for one node (EmptyStmt), which virtually never occurs. All nodes have a 16byte minimal cost (position information), and most nodes have two fields in addition, or perhaps a single interface or string field which bumps up the size to at least 32bytes. For this package I think the sweet spot is actually higher, perhaps around 40 or 56 bytes.

I think @dvyukov's tree node example is misleading: It's unlikely in the real world for the payload to be just an int. I could see a string, but that would bump that example up to 32bytes. I think many light-weight (but large) data structures are also not so homogenous in terms of node types, and the way to define large graphs of non-homogenous node (types) in Go is to use interfaces. In other words, I suspect a node is more likely to look like type node interface{...} and concrete nodes are more likely of a form such as type xnode struct { left, right node; payload T } rather than what @dvyukov suggested. That is, such a node is already at 32bytes with the two node interfaces, and then there's the payload.

I'd make the experiment and add a 40byte size class. In the "best" case that could still reduce waste by ~20% (40bytes instead of 48bytes).

@griesemer griesemer closed this Apr 3, 2018
@griesemer
Copy link
Contributor

@griesemer griesemer commented Apr 3, 2018

Didn't mean to close this.

@griesemer griesemer reopened this Apr 3, 2018
@bradfitz
Copy link
Contributor

@bradfitz bradfitz commented Apr 3, 2018

A slice is 24 bytes, though. I have to imagine that they're somewhat common. Actual numbers from running all.bash or something would be interesting, though.

For reference here are the current size classes: https://golang.org/src/runtime/sizeclasses.go

@griesemer
Copy link
Contributor

@griesemer griesemer commented Apr 3, 2018

A slice rarely comes alone. Also, we don't usually have pointers to slices.

@josharian
Copy link
Contributor

@josharian josharian commented Apr 3, 2018

FWIW, I just instrumented mallocgc by adding println("mallocgc", size) at the top of it. I then ran make.bash, which is mostly the compiler but also a little cmd/go and linker. The head of the distribution:

 19.33% 17244825 mallocgc 8
  8.56% 7632691 mallocgc 128
  8.25% 7356719 mallocgc 24
  8.00% 7139293 mallocgc 16
  6.11% 5446505 mallocgc 32
  5.90% 5263621 mallocgc 88
  4.15% 3698890 mallocgc 48
  3.72% 3321218 mallocgc 56
  3.42% 3053224 mallocgc 40
  2.64% 2354769 mallocgc 64
  2.58% 2298379 mallocgc 232
  2.12% 1892513 mallocgc 0
  1.60% 1425121 mallocgc 4
  1.20% 1066084 mallocgc 144
  1.16% 1033258 mallocgc 112
  0.99% 885371 mallocgc 20
  0.74% 658981 mallocgc 6
  0.66% 589444 mallocgc 104
  0.63% 566173 mallocgc 9
  0.62% 550690 mallocgc 96
@josharian
Copy link
Contributor

@josharian josharian commented Apr 3, 2018

Same experiment in 386 mode:

 20.47% 17877316 mallocgc 8
  8.48% 7401837 mallocgc 76
  7.68% 6711803 mallocgc 12
  6.35% 5543303 mallocgc 32
  5.93% 5180986 mallocgc 52
  5.70% 4980797 mallocgc 16
  5.06% 4421608 mallocgc 20
  4.71% 4113028 mallocgc 4
  2.98% 2599396 mallocgc 132
  2.86% 2496997 mallocgc 24
  2.70% 2354861 mallocgc 28
  2.15% 1877865 mallocgc 0
  1.58% 1377753 mallocgc 36
  1.57% 1374487 mallocgc 64
  0.98% 852984 mallocgc 48
  0.79% 692507 mallocgc 128
  0.77% 669959 mallocgc 6
  0.73% 633511 mallocgc 56
  0.72% 631136 mallocgc 68
  0.66% 578699 mallocgc 9

The relevant sizes are 20 byte and 24 byte.

So for 64 bit it's about 9.2% of allocations in total; 32 bit it's about 7.9%.

@bradfitz
Copy link
Contributor

@bradfitz bradfitz commented Apr 3, 2018

That's more than I expected, but still only ~60MB total internal fragmentation from 24 byte allocations. Can you weight it by total bytes allocated instead of bytes per allocation?

Or more interesting might be to weight it by the number of wasted bytes.

@griesemer
Copy link
Contributor

@griesemer griesemer commented Apr 3, 2018

@josharian Interesting. Do you know what the 24byte objects are? (And perhaps more interesting, the 8byte objects?)

@josharian
Copy link
Contributor

@josharian josharian commented Apr 4, 2018

Do you know what the 24byte objects are? (And perhaps more interesting, the 8byte objects?)

New mallocgc instrumentation:

  if size == 8 {
    if typ != nil {
      println("mallocgc", size, typ.string())
    } else {
      println("mallocgc", size, "??")
    }
  }

Top 10 for size == 8:

 35.20% 6070097 mallocgc 8 int64
 17.98% 3100649 mallocgc 8 *ssa.Value
 15.52% 2676905 mallocgc 8 ??
 15.02% 2590231 mallocgc 8 *gc.Node
  4.77% 821973 mallocgc 8 *types.Field
  2.87% 495187 mallocgc 8 [1]int
  2.87% 494969 mallocgc 8 [1]*gc.Node
  1.34% 230264 mallocgc 8 [1]*types.Field
  0.97% 167398 mallocgc 8 *dwarf.Var
  0.63% 108184 mallocgc 8 *ssa.Block

In my experience, allocation of [1]T types are the result of constructing a single-element slice to pass to a variadic function.

Top 10 for size == 24:

 29.36% 2159906 mallocgc 24 []*gc.Node
 22.87% 1682615 mallocgc 24 types.Struct
 15.68% 1153834 mallocgc 24 []*types.Field
  9.40% 691918 mallocgc 24 ssa.use
  5.38% 395983 mallocgc 24 ld.chain
  3.34% 245676 mallocgc 24 obj.Auto
  2.37% 174656 mallocgc 24 ??
  1.97% 144761 mallocgc 24 []uint8
  1.13% 83123 mallocgc 24 dwarf.byChildIndex
  0.90% 66389 mallocgc 24 ast.Comment

Of these, only ssa.use and ld.chain are linked lists.

@josharian
Copy link
Contributor

@josharian josharian commented Apr 4, 2018

Can you weight it by total bytes allocated instead of bytes per allocation?

Not easily. But I tried a different experiment that hopefully will shed some light. I changed mksizeclasses.go to explicitly prevent the 32 byte size class:

diff --git a/src/runtime/mksizeclasses.go b/src/runtime/mksizeclasses.go
index b146dbcd6c..37d6b49be1 100644
--- a/src/runtime/mksizeclasses.go
+++ b/src/runtime/mksizeclasses.go
@@ -136,6 +136,9 @@ func makeClasses() []class {
                        classes[len(classes)-1].size = size
                        continue
                }
+               if size == 32 {
+                       continue
+               }
                classes = append(classes, class{size: size, npages: npages})
        }
 
@@ -157,7 +160,7 @@ func makeClasses() []class {
                }
        }
 
-       if len(classes) != 67 {
+       if len(classes) != 66 {
                panic("number of size classes has changed")
        }

and then regenerated sizeclasses.go and compared compiler performance. Results:

name        old time/op       new time/op       delta
Template          194ms ± 5%        192ms ± 3%  -0.82%  (p=0.021 n=50+46)
Unicode          84.3ms ± 4%       84.5ms ± 5%    ~     (p=0.515 n=49+50)
GoTypes           550ms ± 3%        551ms ± 4%    ~     (p=0.175 n=47+48)
Compiler          2.58s ± 3%        2.60s ± 3%    ~     (p=0.054 n=48+49)
SSA               5.94s ± 2%        5.97s ± 3%    ~     (p=0.092 n=50+49)
Flate             119ms ± 7%        118ms ± 4%    ~     (p=0.627 n=50+46)
GoParser          142ms ± 3%        142ms ± 4%    ~     (p=0.962 n=48+48)
Reflect           353ms ± 4%        357ms ± 4%  +1.18%  (p=0.006 n=48+46)
Tar               163ms ± 5%        167ms ± 4%  +2.18%  (p=0.000 n=48+49)
XML               203ms ± 5%        205ms ± 5%  +0.93%  (p=0.037 n=49+50)
[Geo mean]        350ms             351ms       +0.50%

name        old user-time/op  new user-time/op  delta
Template          243ms ± 4%        242ms ± 4%    ~     (p=0.184 n=49+48)
Unicode           108ms ± 8%        108ms ± 7%    ~     (p=0.775 n=50+50)
GoTypes           718ms ± 4%        723ms ± 3%  +0.61%  (p=0.029 n=48+48)
Compiler          3.40s ± 3%        3.42s ± 3%    ~     (p=0.119 n=49+48)
SSA               8.45s ± 3%        8.53s ± 3%  +1.00%  (p=0.000 n=50+46)
Flate             145ms ± 6%        144ms ± 5%    ~     (p=0.528 n=50+49)
GoParser          173ms ± 4%        173ms ± 4%    ~     (p=0.934 n=49+48)
Reflect           438ms ± 5%        446ms ± 6%  +1.78%  (p=0.001 n=48+49)
Tar               201ms ± 4%        204ms ± 5%  +1.56%  (p=0.000 n=47+48)
XML               250ms ± 4%        252ms ± 5%    ~     (p=0.068 n=49+50)
[Geo mean]        444ms             447ms       +0.52%

name        old alloc/op      new alloc/op      delta
Template         39.0MB ± 0%       40.0MB ± 0%  +2.55%  (p=0.008 n=5+5)
Unicode          29.1MB ± 0%       29.3MB ± 0%  +0.69%  (p=0.008 n=5+5)
GoTypes           115MB ± 0%        118MB ± 0%  +2.51%  (p=0.008 n=5+5)
Compiler          496MB ± 0%        506MB ± 0%  +2.09%  (p=0.008 n=5+5)
SSA              1.40GB ± 0%       1.43GB ± 0%  +1.80%  (p=0.008 n=5+5)
Flate            25.0MB ± 0%       25.6MB ± 0%  +2.25%  (p=0.008 n=5+5)
GoParser         31.0MB ± 0%       31.8MB ± 0%  +2.53%  (p=0.008 n=5+5)
Reflect          77.0MB ± 0%       79.5MB ± 0%  +3.34%  (p=0.008 n=5+5)
Tar              39.3MB ± 0%       40.2MB ± 0%  +2.37%  (p=0.008 n=5+5)
XML              44.8MB ± 0%       45.7MB ± 0%  +1.99%  (p=0.008 n=5+5)
[Geo mean]       79.1MB            80.8MB       +2.21%

name        old allocs/op     new allocs/op     delta
Template           385k ± 0%         380k ± 0%  -1.25%  (p=0.008 n=5+5)
Unicode            336k ± 0%         324k ± 0%  -3.67%  (p=0.008 n=5+5)
GoTypes           1.20M ± 0%        1.18M ± 0%  -1.29%  (p=0.008 n=5+5)
Compiler          4.68M ± 0%        4.62M ± 0%  -1.29%  (p=0.008 n=5+5)
SSA               11.6M ± 0%        11.5M ± 0%  -1.40%  (p=0.008 n=5+5)
Flate              237k ± 0%         234k ± 0%  -1.42%  (p=0.008 n=5+5)
GoParser           319k ± 0%         315k ± 0%  -1.28%  (p=0.008 n=5+5)
Reflect            959k ± 0%         948k ± 0%  -1.13%  (p=0.008 n=5+5)
Tar                392k ± 0%         388k ± 0%  -1.16%  (p=0.008 n=5+5)
XML                417k ± 0%         412k ± 0%  -1.36%  (p=0.016 n=4+5)
[Geo mean]         795k              782k       -1.53%

So not having a 32 byte size classes slows things down, but not by very much. Predictably, it increases the overall amount of allocation. Surprisingly, the number of allocs go down. I believe that this is because append rounds up to the nearest size class, so append-based growth happens more aggressively, thereby avoiding some allocations.

I did the same experiment eliminating the 48 byte size class (and restoring the 32 byte size class). Results:

name        old time/op       new time/op       delta
Template          192ms ± 5%        192ms ± 5%    ~     (p=0.456 n=50+50)
Unicode          83.7ms ± 4%       85.3ms ± 5%  +1.89%  (p=0.000 n=46+47)
GoTypes           596ms ± 4%        599ms ± 4%    ~     (p=0.052 n=48+48)
Compiler          2.85s ± 3%        2.87s ± 3%    ~     (p=0.188 n=50+50)
SSA               6.55s ± 2%        6.59s ± 2%  +0.51%  (p=0.001 n=48+48)
Flate             124ms ± 5%        125ms ± 6%    ~     (p=0.102 n=49+49)
GoParser          146ms ± 7%        144ms ± 5%  -1.20%  (p=0.020 n=49+48)
Reflect           355ms ± 5%        356ms ± 5%    ~     (p=0.849 n=48+49)
Tar               166ms ± 4%        166ms ± 6%    ~     (p=0.868 n=48+50)
XML               204ms ± 4%        205ms ± 4%    ~     (p=0.245 n=48+49)
[Geo mean]        362ms             364ms       +0.36%

name        old user-time/op  new user-time/op  delta
Template          242ms ± 5%        241ms ± 5%    ~     (p=0.149 n=50+50)
Unicode           108ms ± 6%        107ms ± 6%    ~     (p=0.082 n=45+49)
GoTypes           758ms ± 3%        762ms ± 4%    ~     (p=0.092 n=47+50)
Compiler          3.67s ± 3%        3.68s ± 3%    ~     (p=0.416 n=50+50)
SSA               9.11s ± 2%        9.11s ± 3%    ~     (p=0.845 n=49+50)
Flate             150ms ± 6%        151ms ± 7%    ~     (p=0.678 n=49+50)
GoParser          177ms ± 8%        175ms ± 5%  -1.13%  (p=0.042 n=50+47)
Reflect           442ms ± 5%        441ms ± 5%    ~     (p=0.388 n=49+49)
Tar               203ms ± 4%        203ms ± 6%    ~     (p=0.804 n=47+50)
XML               250ms ± 4%        249ms ± 4%    ~     (p=0.264 n=47+48)
[Geo mean]        457ms             456ms       -0.24%

name        old alloc/op      new alloc/op      delta
Template         39.0MB ± 0%       39.4MB ± 0%  +1.25%  (p=0.008 n=5+5)
Unicode          29.1MB ± 0%       30.0MB ± 0%  +3.24%  (p=0.008 n=5+5)
GoTypes           115MB ± 0%        117MB ± 0%  +1.15%  (p=0.008 n=5+5)
Compiler          496MB ± 0%        501MB ± 0%  +1.19%  (p=0.008 n=5+5)
SSA              1.40GB ± 0%       1.42GB ± 0%  +1.16%  (p=0.008 n=5+5)
Flate            25.0MB ± 0%       25.4MB ± 0%  +1.32%  (p=0.008 n=5+5)
GoParser         31.0MB ± 0%       31.3MB ± 0%  +1.12%  (p=0.008 n=5+5)
Reflect          77.0MB ± 0%       78.2MB ± 0%  +1.63%  (p=0.008 n=5+5)
Tar              39.3MB ± 0%       39.8MB ± 0%  +1.31%  (p=0.008 n=5+5)
XML              44.8MB ± 0%       45.3MB ± 0%  +1.13%  (p=0.008 n=5+5)
[Geo mean]       79.1MB            80.2MB       +1.45%

name        old allocs/op     new allocs/op     delta
Template           385k ± 0%         385k ± 0%  -0.02%  (p=0.008 n=5+5)
Unicode            336k ± 0%         336k ± 0%  -0.01%  (p=0.040 n=5+5)
GoTypes           1.20M ± 0%        1.20M ± 0%  -0.03%  (p=0.008 n=5+5)
Compiler          4.68M ± 0%        4.68M ± 0%  -0.02%  (p=0.008 n=5+5)
SSA               11.6M ± 0%        11.6M ± 0%  -0.04%  (p=0.008 n=5+5)
Flate              237k ± 0%         237k ± 0%  -0.02%  (p=0.016 n=5+5)
GoParser           319k ± 0%         319k ± 0%    ~     (p=0.333 n=5+5)
Reflect            959k ± 0%         959k ± 0%  -0.01%  (p=0.008 n=5+5)
Tar                393k ± 0%         392k ± 0%    ~     (p=0.222 n=5+5)
XML                417k ± 0%         417k ± 0%  -0.02%  (p=0.024 n=5+5)
[Geo mean]         795k              794k       -0.02%

Similar, but less impactful.

I'm not sure exactly what to make of these numbers. Is a 0.5% speed improvement significant enough? Not sure. But at least this gives us a rough feel.

@griesemer
Copy link
Contributor

@griesemer griesemer commented Apr 4, 2018

Thanks @josharian, this is good information. Clearly, my intuition was wrong here. Regarding the 8byte allocations that might come from single-element slices for variadic functions: fmt.Printf comes to mind... (fmt.go for creating type name symbols?).

@bradfitz bradfitz modified the milestones: Go1.11, Go1.12 May 18, 2018
@gopherbot gopherbot modified the milestones: Go1.12, Unplanned May 23, 2018
@aclements aclements modified the milestones: Unplanned, Go1.15 Nov 19, 2019
@aclements
Copy link
Member

@aclements aclements commented Nov 19, 2019

We'll consider doing this for 1.15

@dmitshur
Copy link
Member

@dmitshur dmitshur commented Feb 19, 2020

This issue is currently labeled as early-in-cycle for Go 1.15. That time is now, so friendly ping. If it no longer needs to be done early in cycle, that label can be removed.

@martisch
Copy link
Contributor

@martisch martisch commented Feb 20, 2020

Still working on the heapbitmap code. Likely ready in march, just not the first week the tree opens.

@martisch
Copy link
Contributor

@martisch martisch commented Mar 4, 2020

Unfortunately due to shifts in time commitments I wont be able to work on this anymore for the go1.15 cycle.

@martisch martisch removed their assignment Mar 4, 2020
@ianlancetaylor ianlancetaylor modified the milestones: Go1.15, Go1.16 May 18, 2020
@martisch martisch self-assigned this Jun 22, 2020
@martisch
Copy link
Contributor

@martisch martisch commented Jul 13, 2020

Update: I have a ./all.bash passing CL that adds the new size class. Was a bit more complicated then I initially thought due to being the only size class (on amd64) above 8 bytes that isnt 16byte aligned, some outdated or to me misreadable documentation about checkmarks and I initally forgot that the size class also allows slices of 3 pointers so needs unrolling of the ptrmask into the heap bits.. Currently doing documentation and benchmarks. My current plan is to be finished before the tree opens for go1.16 so this can get in early.

@aclements
Copy link
Member

@aclements aclements commented Jul 13, 2020

Neat! FYI, I already have a CL (not yet mailed) to move checkmarks out of the bitmap and was planning to send it for 1.16, precisely because they make everything with the bitmap annoying. :) I could send that now and you could rebase on top of it if that would be helpful?

@martisch
Copy link
Contributor

@martisch martisch commented Jul 13, 2020

Depending on if the remove checkmarks from heapbits CL gets more complex if the 24byte size class is merged first I think removing the checkmark from heapbits can go in first. It at least would make this CL easier by not needing to change the set and check checkmark functions. Some masks may need to be adjusted for heapBitsSetType but that should be trivial.

My complication was more around understanding when to set them or not from the existing code. The documentation and comments in heapBitsSetType seem to not align.

code not concerned if checkmark set or not set:

// 2-element slice of pointer.

seems checkmark bit actually set to 0 usually in heapBitsSetType. the actual setting doesnt matter since it seems to be cleared when initializing checkmarking:

// When not in checkmark mode, this bit is set to 1.

@gopherbot
Copy link

@gopherbot gopherbot commented Jul 13, 2020

Change https://golang.org/cl/242258 mentions this issue: runtime: add 24 byte allocation size class

@gopherbot gopherbot closed this in 14c7caa Sep 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
10 participants
You can’t perform that action at this time.