Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: consider adding 24-byte size class #8885

Closed
dvyukov opened this issue Oct 7, 2014 · 28 comments
Closed

runtime: consider adding 24-byte size class #8885

dvyukov opened this issue Oct 7, 2014 · 28 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Milestone

Comments

@dvyukov
Copy link
Member

dvyukov commented Oct 7, 2014

Currently we have only 16 and 32 byte size classes.
24-byte size class would reduce memory consumption by 25% for programs that extensively
use 24-byte objects (e.g. Node struct { left, right *Node; data int }).
Leaving aside other issues with vector instructions (16-byte alignment is not enforced
for stack, AVX 32-byte alignment is not enforced at all); 24-byte objects are hardly
used with SSE as they contains only one and a half of vector elements.

Here is a prototype CL:
https://golang.org/cl/128090043/

On 64-bits it reduces heap size by 6%, on 32-bits - by 12% on garbage benchmark.
@bradfitz
Copy link
Contributor

See useful discussion in dup #17633 too.

/cc @josharian @randall77 @aclements @dr2chase @RLH

@aclements aclements modified the milestones: Go1.10Early, Go1.9 May 18, 2017
@bradfitz bradfitz added early-in-cycle A change that should be done early in the 3 month dev cycle. and removed early-in-cycle A change that should be done early in the 3 month dev cycle. labels Jun 14, 2017
@bradfitz bradfitz modified the milestones: Go1.10Early, Go1.10 Jun 14, 2017
@rsc rsc modified the milestones: Go1.10, Go1.11 Nov 22, 2017
@bradfitz
Copy link
Contributor

bradfitz commented Apr 3, 2018

Should we try this?

/cc @aclements @dvyukov @griesemer @josharian

@griesemer
Copy link
Contributor

Just a reference point: The package syntax AST nodes are all bigger than 24bytes (on 64bit platforms), except for one node (EmptyStmt), which virtually never occurs. All nodes have a 16byte minimal cost (position information), and most nodes have two fields in addition, or perhaps a single interface or string field which bumps up the size to at least 32bytes. For this package I think the sweet spot is actually higher, perhaps around 40 or 56 bytes.

I think @dvyukov's tree node example is misleading: It's unlikely in the real world for the payload to be just an int. I could see a string, but that would bump that example up to 32bytes. I think many light-weight (but large) data structures are also not so homogenous in terms of node types, and the way to define large graphs of non-homogenous node (types) in Go is to use interfaces. In other words, I suspect a node is more likely to look like type node interface{...} and concrete nodes are more likely of a form such as type xnode struct { left, right node; payload T } rather than what @dvyukov suggested. That is, such a node is already at 32bytes with the two node interfaces, and then there's the payload.

I'd make the experiment and add a 40byte size class. In the "best" case that could still reduce waste by ~20% (40bytes instead of 48bytes).

@griesemer
Copy link
Contributor

Didn't mean to close this.

@griesemer griesemer reopened this Apr 3, 2018
@bradfitz
Copy link
Contributor

bradfitz commented Apr 3, 2018

A slice is 24 bytes, though. I have to imagine that they're somewhat common. Actual numbers from running all.bash or something would be interesting, though.

For reference here are the current size classes: https://golang.org/src/runtime/sizeclasses.go

@griesemer
Copy link
Contributor

A slice rarely comes alone. Also, we don't usually have pointers to slices.

@josharian
Copy link
Contributor

Do you know what the 24byte objects are? (And perhaps more interesting, the 8byte objects?)

New mallocgc instrumentation:

  if size == 8 {
    if typ != nil {
      println("mallocgc", size, typ.string())
    } else {
      println("mallocgc", size, "??")
    }
  }

Top 10 for size == 8:

 35.20% 6070097 mallocgc 8 int64
 17.98% 3100649 mallocgc 8 *ssa.Value
 15.52% 2676905 mallocgc 8 ??
 15.02% 2590231 mallocgc 8 *gc.Node
  4.77% 821973 mallocgc 8 *types.Field
  2.87% 495187 mallocgc 8 [1]int
  2.87% 494969 mallocgc 8 [1]*gc.Node
  1.34% 230264 mallocgc 8 [1]*types.Field
  0.97% 167398 mallocgc 8 *dwarf.Var
  0.63% 108184 mallocgc 8 *ssa.Block

In my experience, allocation of [1]T types are the result of constructing a single-element slice to pass to a variadic function.

Top 10 for size == 24:

 29.36% 2159906 mallocgc 24 []*gc.Node
 22.87% 1682615 mallocgc 24 types.Struct
 15.68% 1153834 mallocgc 24 []*types.Field
  9.40% 691918 mallocgc 24 ssa.use
  5.38% 395983 mallocgc 24 ld.chain
  3.34% 245676 mallocgc 24 obj.Auto
  2.37% 174656 mallocgc 24 ??
  1.97% 144761 mallocgc 24 []uint8
  1.13% 83123 mallocgc 24 dwarf.byChildIndex
  0.90% 66389 mallocgc 24 ast.Comment

Of these, only ssa.use and ld.chain are linked lists.

@josharian
Copy link
Contributor

Can you weight it by total bytes allocated instead of bytes per allocation?

Not easily. But I tried a different experiment that hopefully will shed some light. I changed mksizeclasses.go to explicitly prevent the 32 byte size class:

diff --git a/src/runtime/mksizeclasses.go b/src/runtime/mksizeclasses.go
index b146dbcd6c..37d6b49be1 100644
--- a/src/runtime/mksizeclasses.go
+++ b/src/runtime/mksizeclasses.go
@@ -136,6 +136,9 @@ func makeClasses() []class {
                        classes[len(classes)-1].size = size
                        continue
                }
+               if size == 32 {
+                       continue
+               }
                classes = append(classes, class{size: size, npages: npages})
        }
 
@@ -157,7 +160,7 @@ func makeClasses() []class {
                }
        }
 
-       if len(classes) != 67 {
+       if len(classes) != 66 {
                panic("number of size classes has changed")
        }

and then regenerated sizeclasses.go and compared compiler performance. Results:

name        old time/op       new time/op       delta
Template          194ms ± 5%        192ms ± 3%  -0.82%  (p=0.021 n=50+46)
Unicode          84.3ms ± 4%       84.5ms ± 5%    ~     (p=0.515 n=49+50)
GoTypes           550ms ± 3%        551ms ± 4%    ~     (p=0.175 n=47+48)
Compiler          2.58s ± 3%        2.60s ± 3%    ~     (p=0.054 n=48+49)
SSA               5.94s ± 2%        5.97s ± 3%    ~     (p=0.092 n=50+49)
Flate             119ms ± 7%        118ms ± 4%    ~     (p=0.627 n=50+46)
GoParser          142ms ± 3%        142ms ± 4%    ~     (p=0.962 n=48+48)
Reflect           353ms ± 4%        357ms ± 4%  +1.18%  (p=0.006 n=48+46)
Tar               163ms ± 5%        167ms ± 4%  +2.18%  (p=0.000 n=48+49)
XML               203ms ± 5%        205ms ± 5%  +0.93%  (p=0.037 n=49+50)
[Geo mean]        350ms             351ms       +0.50%

name        old user-time/op  new user-time/op  delta
Template          243ms ± 4%        242ms ± 4%    ~     (p=0.184 n=49+48)
Unicode           108ms ± 8%        108ms ± 7%    ~     (p=0.775 n=50+50)
GoTypes           718ms ± 4%        723ms ± 3%  +0.61%  (p=0.029 n=48+48)
Compiler          3.40s ± 3%        3.42s ± 3%    ~     (p=0.119 n=49+48)
SSA               8.45s ± 3%        8.53s ± 3%  +1.00%  (p=0.000 n=50+46)
Flate             145ms ± 6%        144ms ± 5%    ~     (p=0.528 n=50+49)
GoParser          173ms ± 4%        173ms ± 4%    ~     (p=0.934 n=49+48)
Reflect           438ms ± 5%        446ms ± 6%  +1.78%  (p=0.001 n=48+49)
Tar               201ms ± 4%        204ms ± 5%  +1.56%  (p=0.000 n=47+48)
XML               250ms ± 4%        252ms ± 5%    ~     (p=0.068 n=49+50)
[Geo mean]        444ms             447ms       +0.52%

name        old alloc/op      new alloc/op      delta
Template         39.0MB ± 0%       40.0MB ± 0%  +2.55%  (p=0.008 n=5+5)
Unicode          29.1MB ± 0%       29.3MB ± 0%  +0.69%  (p=0.008 n=5+5)
GoTypes           115MB ± 0%        118MB ± 0%  +2.51%  (p=0.008 n=5+5)
Compiler          496MB ± 0%        506MB ± 0%  +2.09%  (p=0.008 n=5+5)
SSA              1.40GB ± 0%       1.43GB ± 0%  +1.80%  (p=0.008 n=5+5)
Flate            25.0MB ± 0%       25.6MB ± 0%  +2.25%  (p=0.008 n=5+5)
GoParser         31.0MB ± 0%       31.8MB ± 0%  +2.53%  (p=0.008 n=5+5)
Reflect          77.0MB ± 0%       79.5MB ± 0%  +3.34%  (p=0.008 n=5+5)
Tar              39.3MB ± 0%       40.2MB ± 0%  +2.37%  (p=0.008 n=5+5)
XML              44.8MB ± 0%       45.7MB ± 0%  +1.99%  (p=0.008 n=5+5)
[Geo mean]       79.1MB            80.8MB       +2.21%

name        old allocs/op     new allocs/op     delta
Template           385k ± 0%         380k ± 0%  -1.25%  (p=0.008 n=5+5)
Unicode            336k ± 0%         324k ± 0%  -3.67%  (p=0.008 n=5+5)
GoTypes           1.20M ± 0%        1.18M ± 0%  -1.29%  (p=0.008 n=5+5)
Compiler          4.68M ± 0%        4.62M ± 0%  -1.29%  (p=0.008 n=5+5)
SSA               11.6M ± 0%        11.5M ± 0%  -1.40%  (p=0.008 n=5+5)
Flate              237k ± 0%         234k ± 0%  -1.42%  (p=0.008 n=5+5)
GoParser           319k ± 0%         315k ± 0%  -1.28%  (p=0.008 n=5+5)
Reflect            959k ± 0%         948k ± 0%  -1.13%  (p=0.008 n=5+5)
Tar                392k ± 0%         388k ± 0%  -1.16%  (p=0.008 n=5+5)
XML                417k ± 0%         412k ± 0%  -1.36%  (p=0.016 n=4+5)
[Geo mean]         795k              782k       -1.53%

So not having a 32 byte size classes slows things down, but not by very much. Predictably, it increases the overall amount of allocation. Surprisingly, the number of allocs go down. I believe that this is because append rounds up to the nearest size class, so append-based growth happens more aggressively, thereby avoiding some allocations.

I did the same experiment eliminating the 48 byte size class (and restoring the 32 byte size class). Results:

name        old time/op       new time/op       delta
Template          192ms ± 5%        192ms ± 5%    ~     (p=0.456 n=50+50)
Unicode          83.7ms ± 4%       85.3ms ± 5%  +1.89%  (p=0.000 n=46+47)
GoTypes           596ms ± 4%        599ms ± 4%    ~     (p=0.052 n=48+48)
Compiler          2.85s ± 3%        2.87s ± 3%    ~     (p=0.188 n=50+50)
SSA               6.55s ± 2%        6.59s ± 2%  +0.51%  (p=0.001 n=48+48)
Flate             124ms ± 5%        125ms ± 6%    ~     (p=0.102 n=49+49)
GoParser          146ms ± 7%        144ms ± 5%  -1.20%  (p=0.020 n=49+48)
Reflect           355ms ± 5%        356ms ± 5%    ~     (p=0.849 n=48+49)
Tar               166ms ± 4%        166ms ± 6%    ~     (p=0.868 n=48+50)
XML               204ms ± 4%        205ms ± 4%    ~     (p=0.245 n=48+49)
[Geo mean]        362ms             364ms       +0.36%

name        old user-time/op  new user-time/op  delta
Template          242ms ± 5%        241ms ± 5%    ~     (p=0.149 n=50+50)
Unicode           108ms ± 6%        107ms ± 6%    ~     (p=0.082 n=45+49)
GoTypes           758ms ± 3%        762ms ± 4%    ~     (p=0.092 n=47+50)
Compiler          3.67s ± 3%        3.68s ± 3%    ~     (p=0.416 n=50+50)
SSA               9.11s ± 2%        9.11s ± 3%    ~     (p=0.845 n=49+50)
Flate             150ms ± 6%        151ms ± 7%    ~     (p=0.678 n=49+50)
GoParser          177ms ± 8%        175ms ± 5%  -1.13%  (p=0.042 n=50+47)
Reflect           442ms ± 5%        441ms ± 5%    ~     (p=0.388 n=49+49)
Tar               203ms ± 4%        203ms ± 6%    ~     (p=0.804 n=47+50)
XML               250ms ± 4%        249ms ± 4%    ~     (p=0.264 n=47+48)
[Geo mean]        457ms             456ms       -0.24%

name        old alloc/op      new alloc/op      delta
Template         39.0MB ± 0%       39.4MB ± 0%  +1.25%  (p=0.008 n=5+5)
Unicode          29.1MB ± 0%       30.0MB ± 0%  +3.24%  (p=0.008 n=5+5)
GoTypes           115MB ± 0%        117MB ± 0%  +1.15%  (p=0.008 n=5+5)
Compiler          496MB ± 0%        501MB ± 0%  +1.19%  (p=0.008 n=5+5)
SSA              1.40GB ± 0%       1.42GB ± 0%  +1.16%  (p=0.008 n=5+5)
Flate            25.0MB ± 0%       25.4MB ± 0%  +1.32%  (p=0.008 n=5+5)
GoParser         31.0MB ± 0%       31.3MB ± 0%  +1.12%  (p=0.008 n=5+5)
Reflect          77.0MB ± 0%       78.2MB ± 0%  +1.63%  (p=0.008 n=5+5)
Tar              39.3MB ± 0%       39.8MB ± 0%  +1.31%  (p=0.008 n=5+5)
XML              44.8MB ± 0%       45.3MB ± 0%  +1.13%  (p=0.008 n=5+5)
[Geo mean]       79.1MB            80.2MB       +1.45%

name        old allocs/op     new allocs/op     delta
Template           385k ± 0%         385k ± 0%  -0.02%  (p=0.008 n=5+5)
Unicode            336k ± 0%         336k ± 0%  -0.01%  (p=0.040 n=5+5)
GoTypes           1.20M ± 0%        1.20M ± 0%  -0.03%  (p=0.008 n=5+5)
Compiler          4.68M ± 0%        4.68M ± 0%  -0.02%  (p=0.008 n=5+5)
SSA               11.6M ± 0%        11.6M ± 0%  -0.04%  (p=0.008 n=5+5)
Flate              237k ± 0%         237k ± 0%  -0.02%  (p=0.016 n=5+5)
GoParser           319k ± 0%         319k ± 0%    ~     (p=0.333 n=5+5)
Reflect            959k ± 0%         959k ± 0%  -0.01%  (p=0.008 n=5+5)
Tar                393k ± 0%         392k ± 0%    ~     (p=0.222 n=5+5)
XML                417k ± 0%         417k ± 0%  -0.02%  (p=0.024 n=5+5)
[Geo mean]         795k              794k       -0.02%

Similar, but less impactful.

I'm not sure exactly what to make of these numbers. Is a 0.5% speed improvement significant enough? Not sure. But at least this gives us a rough feel.

@griesemer
Copy link
Contributor

Thanks @josharian, this is good information. Clearly, my intuition was wrong here. Regarding the 8byte allocations that might come from single-element slices for variadic functions: fmt.Printf comes to mind... (fmt.go for creating type name symbols?).

@bradfitz bradfitz modified the milestones: Go1.11, Go1.12 May 18, 2018
@gopherbot gopherbot modified the milestones: Go1.12, Unplanned May 23, 2018
@aclements aclements modified the milestones: Unplanned, Go1.15 Nov 19, 2019
@aclements
Copy link
Member

We'll consider doing this for 1.15

@martisch martisch self-assigned this Nov 19, 2019
@ianlancetaylor ianlancetaylor added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Nov 19, 2019
@dmitshur
Copy link
Contributor

This issue is currently labeled as early-in-cycle for Go 1.15. That time is now, so friendly ping. If it no longer needs to be done early in cycle, that label can be removed.

@martisch martisch removed the early-in-cycle A change that should be done early in the 3 month dev cycle. label Feb 20, 2020
@martisch
Copy link
Contributor

Still working on the heapbitmap code. Likely ready in march, just not the first week the tree opens.

@martisch
Copy link
Contributor

martisch commented Mar 4, 2020

Unfortunately due to shifts in time commitments I wont be able to work on this anymore for the go1.15 cycle.

@martisch martisch removed their assignment Mar 4, 2020
@ianlancetaylor ianlancetaylor modified the milestones: Go1.15, Go1.16 May 18, 2020
@martisch martisch self-assigned this Jun 22, 2020
@martisch
Copy link
Contributor

Update: I have a ./all.bash passing CL that adds the new size class. Was a bit more complicated then I initially thought due to being the only size class (on amd64) above 8 bytes that isnt 16byte aligned, some outdated or to me misreadable documentation about checkmarks and I initally forgot that the size class also allows slices of 3 pointers so needs unrolling of the ptrmask into the heap bits.. Currently doing documentation and benchmarks. My current plan is to be finished before the tree opens for go1.16 so this can get in early.

@aclements
Copy link
Member

Neat! FYI, I already have a CL (not yet mailed) to move checkmarks out of the bitmap and was planning to send it for 1.16, precisely because they make everything with the bitmap annoying. :) I could send that now and you could rebase on top of it if that would be helpful?

@martisch
Copy link
Contributor

martisch commented Jul 13, 2020

Depending on if the remove checkmarks from heapbits CL gets more complex if the 24byte size class is merged first I think removing the checkmark from heapbits can go in first. It at least would make this CL easier by not needing to change the set and check checkmark functions. Some masks may need to be adjusted for heapBitsSetType but that should be trivial.

My complication was more around understanding when to set them or not from the existing code. The documentation and comments in heapBitsSetType seem to not align.

code not concerned if checkmark set or not set:

// 2-element slice of pointer.

seems checkmark bit actually set to 0 usually in heapBitsSetType. the actual setting doesnt matter since it seems to be cleared when initializing checkmarking:

// When not in checkmark mode, this bit is set to 1.

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/242258 mentions this issue: runtime: add 24 byte allocation size class

@golang golang locked and limited conversation to collaborators Sep 14, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
None yet
Development

No branches or pull requests

10 participants