runtime: don't allocate when putting a bool into an interface #17725

Closed
bradfitz opened this Issue Nov 1, 2016 · 32 comments

Comments

Projects
None yet
@bradfitz
Owner

bradfitz commented Nov 1, 2016

Shoving a bool into an interface probably doesn't need to allocate. I imagine most binaries already have a static 1 byte and a static 0 byte somewhere whose address we could use in the interface's data pointer.

bradfitz@gdev:~$ cat alloc.go
package main

import "testing"

func main() {
        var x interface{}
        x = true
        println(x)
        x = false
        println(x)
        x = true
        println(x)
        
        println(testing.AllocsPerRun(5000, func() {
                x = true
        }))
}

bradfitz@gdev:~$ go run ~/alloc.go
(0x49b960,0xc42003bf4f)
(0x49b960,0xc42003bf4e)
(0x49b960,0xc42003bf4d)
+1.000000e+000

/cc @josharian @randall77 @mdempsky

@bradfitz bradfitz added the Performance label Nov 1, 2016

@bradfitz bradfitz added this to the Unplanned milestone Nov 1, 2016

@bradfitz bradfitz changed the title from cmd/compile: to cmd/compile: don't allocate when putting a bool into an interface Nov 1, 2016

@mdempsky

This comment has been minimized.

Show comment Hide comment
@mdempsky

mdempsky Nov 1, 2016

Member

Relatedly, I've been thinking that converting len-1 []byte to string shouldn't need to allocate either.

Member

mdempsky commented Nov 1, 2016

Relatedly, I've been thinking that converting len-1 []byte to string shouldn't need to allocate either.

@mdempsky mdempsky changed the title from cmd/compile: don't allocate when putting a bool into an interface to runtime: don't allocate when putting a bool into an interface Nov 1, 2016

@mdempsky

This comment has been minimized.

Show comment Hide comment
@mdempsky

mdempsky Nov 1, 2016

Member

This can already be done within the runtime in convT2I. I don't believe it needs compiler assistance.

Member

mdempsky commented Nov 1, 2016

This can already be done within the runtime in convT2I. I don't believe it needs compiler assistance.

@dr2chase

This comment has been minimized.

Show comment Hide comment
@dr2chase

dr2chase Nov 1, 2016

Contributor

How do we feel about smallish integers? Seems like we could profile in convT2I to see how this would pay off. One possible value for smallish is {2,1,0,-1}

Contributor

dr2chase commented Nov 1, 2016

How do we feel about smallish integers? Seems like we could profile in convT2I to see how this would pay off. One possible value for smallish is {2,1,0,-1}

@bradfitz

This comment has been minimized.

Show comment Hide comment
@bradfitz

bradfitz Nov 1, 2016

Owner

I feel like last time smallish integers was brought up, the concern was that it would be inconsistent and thus surprising. But with bools at least it would be consistent.

Then again, the better the compiler and runtime get, the more that performance things get surprising, so maybe it's okay nowadays.

I agree some convT2I stats for all this would be interesting.

Owner

bradfitz commented Nov 1, 2016

I feel like last time smallish integers was brought up, the concern was that it would be inconsistent and thus surprising. But with bools at least it would be consistent.

Then again, the better the compiler and runtime get, the more that performance things get surprising, so maybe it's okay nowadays.

I agree some convT2I stats for all this would be interesting.

@mdempsky

This comment has been minimized.

Show comment Hide comment
@mdempsky

mdempsky Nov 1, 2016

Member

I was thinking we just have a 256-byte slab of "\x00\x01...\xff", and we can index into that for any read-only 1-byte value that needs to be converted to a pointer (e.g., converting a len-1 []byte to string, or storing a bool/byte/int8/uint8 into an interface).

If we instead used [...]int64{..., -3, -2, -1, 0, 1, 2, 3, ...} for a certain range of integers, we could avoid heap allocs for a wider variety of values.

Agreed collecting stats on recognizable patterns would be good.

Member

mdempsky commented Nov 1, 2016

I was thinking we just have a 256-byte slab of "\x00\x01...\xff", and we can index into that for any read-only 1-byte value that needs to be converted to a pointer (e.g., converting a len-1 []byte to string, or storing a bool/byte/int8/uint8 into an interface).

If we instead used [...]int64{..., -3, -2, -1, 0, 1, 2, 3, ...} for a certain range of integers, we could avoid heap allocs for a wider variety of values.

Agreed collecting stats on recognizable patterns would be good.

@josharian

This comment has been minimized.

Show comment Hide comment
@josharian

josharian Nov 1, 2016

Contributor

I was thinking we just have a 256-byte slab of "\x00\x01...\xff"

That was my exact reaction as well.

Contributor

josharian commented Nov 1, 2016

I was thinking we just have a 256-byte slab of "\x00\x01...\xff"

That was my exact reaction as well.

@randall77

This comment has been minimized.

Show comment Hide comment
@randall77

randall77 Nov 1, 2016

Contributor

I had a CL a long time ago which did [...]int64{0..1024}. We ended up not using it because the performance cliff from 1024->1025 would be weird.

The 256-byte one would not have that problem, although it would weirdly encourage people to cram their data into 1 byte chunks.

Contributor

randall77 commented Nov 1, 2016

I had a CL a long time ago which did [...]int64{0..1024}. We ended up not using it because the performance cliff from 1024->1025 would be weird.

The 256-byte one would not have that problem, although it would weirdly encourage people to cram their data into 1 byte chunks.

@josharian

This comment has been minimized.

Show comment Hide comment
@josharian

josharian Nov 1, 2016

Contributor

To spare others the search, that was CL 2500, which in turn also points to #8618. I'd still like to see this get done.

Contributor

josharian commented Nov 1, 2016

To spare others the search, that was CL 2500, which in turn also points to #8618. I'd still like to see this get done.

@griesemer

This comment has been minimized.

Show comment Hide comment
@griesemer

griesemer Nov 1, 2016

Contributor

I'm also still in favor of this. I think the benefits outweigh the oddity in performance behavior.

Contributor

griesemer commented Nov 1, 2016

I'm also still in favor of this. I think the benefits outweigh the oddity in performance behavior.

@bradfitz

This comment has been minimized.

Show comment Hide comment
@bradfitz

bradfitz Nov 1, 2016

Owner

We ended up not using it because the performance cliff from 1024->1025 would be weird.

Clearly we just need to smooth out that cliff with some fastrand(), slowly increasing the chance of allocating from 0 to 1024, where 1025 is 100%! /s

Owner

bradfitz commented Nov 1, 2016

We ended up not using it because the performance cliff from 1024->1025 would be weird.

Clearly we just need to smooth out that cliff with some fastrand(), slowly increasing the chance of allocating from 0 to 1024, where 1025 is 100%! /s

@mrjrieke

This comment has been minimized.

Show comment Hide comment
@mrjrieke

mrjrieke Nov 1, 2016

@mdempsky would slow compiler tiny bit.... of course, @bradfitz presents a difficult challenge for prediction with fastrand(). cqtm

mrjrieke commented Nov 1, 2016

@mdempsky would slow compiler tiny bit.... of course, @bradfitz presents a difficult challenge for prediction with fastrand(). cqtm

@josharian

This comment has been minimized.

Show comment Hide comment
@josharian

josharian Jan 22, 2017

Contributor

Byte-sized values handled in CL 35555. I went the compiler route because it was simple and meant that byte-size values required no runtime call at all, shrinking ns/op from ~5 to ~1.

After handling byte-sized values and constant values (#18704), I instrumented the runtime to dump what integer-ish values made it into convT2E. During make.bash, I got this (unsurprising) distribution, top 30 values only:

 percent cnt  bits  uint value
 28.83%  3995 64bit 0
  2.11%   292 64bit 1
  1.45%   201 64bit 4
  1.35%   187 64bit 3
  1.28%   177 64bit 2
  0.87%   120 64bit 17
  0.84%   117 64bit 6
  0.81%   112 64bit 5
  0.71%    99 64bit 7
  0.71%    98 64bit 47
  0.66%    91 64bit 8
  0.48%    66 64bit 13
  0.43%    60 32bit 118
  0.38%    53 64bit 12
  0.36%    50 64bit 9
  0.32%    45 64bit 42
  0.23%    32 64bit 39
  0.22%    30 64bit 44
  0.21%    29 64bit 10
  0.20%    28 64bit 30
  0.17%    24 64bit 16
  0.17%    23 64bit 14
  0.16%    22 32bit 106
  0.16%    22 64bit 11
  0.15%    21 32bit 101
  0.15%    21 64bit 85
  0.13%    18 32bit 281
  0.13%    18 32bit 395
  0.12%    17 32bit 398
  0.12%    16 32bit 123

That says to me we might get the best bang for our buck by inserting a check for integer-ish values of size <= uintptr in the generated code like:

  if val == 0 {
    e = {itab, &zerobase}
  } else {
    e = convT2E(...)
  }
Contributor

josharian commented Jan 22, 2017

Byte-sized values handled in CL 35555. I went the compiler route because it was simple and meant that byte-size values required no runtime call at all, shrinking ns/op from ~5 to ~1.

After handling byte-sized values and constant values (#18704), I instrumented the runtime to dump what integer-ish values made it into convT2E. During make.bash, I got this (unsurprising) distribution, top 30 values only:

 percent cnt  bits  uint value
 28.83%  3995 64bit 0
  2.11%   292 64bit 1
  1.45%   201 64bit 4
  1.35%   187 64bit 3
  1.28%   177 64bit 2
  0.87%   120 64bit 17
  0.84%   117 64bit 6
  0.81%   112 64bit 5
  0.71%    99 64bit 7
  0.71%    98 64bit 47
  0.66%    91 64bit 8
  0.48%    66 64bit 13
  0.43%    60 32bit 118
  0.38%    53 64bit 12
  0.36%    50 64bit 9
  0.32%    45 64bit 42
  0.23%    32 64bit 39
  0.22%    30 64bit 44
  0.21%    29 64bit 10
  0.20%    28 64bit 30
  0.17%    24 64bit 16
  0.17%    23 64bit 14
  0.16%    22 32bit 106
  0.16%    22 64bit 11
  0.15%    21 32bit 101
  0.15%    21 64bit 85
  0.13%    18 32bit 281
  0.13%    18 32bit 395
  0.12%    17 32bit 398
  0.12%    16 32bit 123

That says to me we might get the best bang for our buck by inserting a check for integer-ish values of size <= uintptr in the generated code like:

  if val == 0 {
    e = {itab, &zerobase}
  } else {
    e = convT2E(...)
  }
@minux

This comment has been minimized.

Show comment Hide comment
@minux

minux Jan 22, 2017

Member
Member

minux commented Jan 22, 2017

@randall77

This comment has been minimized.

Show comment Hide comment
@randall77

randall77 Jan 22, 2017

Contributor

It could be done if we increased the size of interfaces to 3 words. Otherwise, I'm afraid that there are a host of complications (mostly in the garbage collector) which make sometimes-a-pointer-sometimes-a-scalar fields hard. I can elaborate more if you wish.

Contributor

randall77 commented Jan 22, 2017

It could be done if we increased the size of interfaces to 3 words. Otherwise, I'm afraid that there are a host of complications (mostly in the garbage collector) which make sometimes-a-pointer-sometimes-a-scalar fields hard. I can elaborate more if you wish.

@josharian

This comment has been minimized.

Show comment Hide comment
@josharian

josharian Jan 22, 2017

Contributor

Lots of prior discussion in #8405.

Contributor

josharian commented Jan 22, 2017

Lots of prior discussion in #8405.

@josharian

This comment has been minimized.

Show comment Hide comment
@josharian

josharian Jan 22, 2017

Contributor

CL 35563 implements the strategy "use zerobase for small pointer-free zero values". It could probably be optimized a bit, but it's a first look at the cost/benefit of doing it entirely in the runtime. Checking if val == 0 in the calling code should be more efficient, because the == 0 check will be more efficient, but it's also considerably more of a pain to implement. :)

While running the stdlib tests, CL 35563 eliminated roughly 170,000 allocations. Increasing the size of zerobase did not show any improvement; larger pointer-free zero values sent to convT2E/convT2I are rare.

Contributor

josharian commented Jan 22, 2017

CL 35563 implements the strategy "use zerobase for small pointer-free zero values". It could probably be optimized a bit, but it's a first look at the cost/benefit of doing it entirely in the runtime. Checking if val == 0 in the calling code should be more efficient, because the == 0 check will be more efficient, but it's also considerably more of a pain to implement. :)

While running the stdlib tests, CL 35563 eliminated roughly 170,000 allocations. Increasing the size of zerobase did not show any improvement; larger pointer-free zero values sent to convT2E/convT2I are rare.

@josharian

This comment has been minimized.

Show comment Hide comment
@josharian

josharian Jan 22, 2017

Contributor

And with that, I'm going to pause my hacking on interface conversions. I hope that making the code and numbers concrete will help the conversation about the correct direction to go from here.

Contributor

josharian commented Jan 22, 2017

And with that, I'm going to pause my hacking on interface conversions. I hope that making the code and numbers concrete will help the conversation about the correct direction to go from here.

@gopherbot

This comment has been minimized.

Show comment Hide comment
@gopherbot

gopherbot Jan 24, 2017

CL https://golang.org/cl/35555 mentions this issue.

CL https://golang.org/cl/35555 mentions this issue.

@gopherbot

This comment has been minimized.

Show comment Hide comment
@gopherbot

gopherbot Jan 24, 2017

CL https://golang.org/cl/35563 mentions this issue.

CL https://golang.org/cl/35563 mentions this issue.

@minux

This comment has been minimized.

Show comment Hide comment
@minux

minux Jan 25, 2017

Member
Member

minux commented Jan 25, 2017

gopherbot pushed a commit that referenced this issue Feb 2, 2017

cmd/compile, runtime: convert byte-sized values to interfaces without…
… allocation

Based in part on khr's CL 2500.

Updates #17725
Updates #18121

Change-Id: I744e1f92fc2104e6c5bd883a898c30b2eea8cc31
Reviewed-on: https://go-review.googlesource.com/35555
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
@ancaemanuel

This comment has been minimized.

Show comment Hide comment
@ancaemanuel

ancaemanuel Feb 3, 2017

How can an an static enum from 0x00 to 0xFF can help ?
in 0358367
Can anyone explain ?

How can an an static enum from 0x00 to 0xFF can help ?
in 0358367
Can anyone explain ?

@josharian

This comment has been minimized.

Show comment Hide comment
@josharian

josharian Feb 3, 2017

Contributor

Bools are represented as 0 or 1, so they can use the static values for 0 and 1.

Contributor

josharian commented Feb 3, 2017

Bools are represented as 0 or 1, so they can use the static values for 0 and 1.

@ancaemanuel

This comment has been minimized.

Show comment Hide comment
@ancaemanuel

ancaemanuel Feb 3, 2017

I know that, and thanks for your reply, but I still do not understand the theory behind this optimisation.
Why do we need to list all the values of one byte ?

I know that, and thanks for your reply, but I still do not understand the theory behind this optimisation.
Why do we need to list all the values of one byte ?

@josharian

This comment has been minimized.

Show comment Hide comment
@josharian

josharian Feb 3, 2017

Contributor

I plan to write a blog post (commaok.xyz) sometime in the next few days that explains a lot of this.

Contributor

josharian commented Feb 3, 2017

I plan to write a blog post (commaok.xyz) sometime in the next few days that explains a lot of this.

@josharian

This comment has been minimized.

Show comment Hide comment
@josharian

josharian Feb 3, 2017

Contributor

I just want to finish CL 35563 first, and that will take some experimentation.

Contributor

josharian commented Feb 3, 2017

I just want to finish CL 35563 first, and that will take some experimentation.

@rasky

This comment has been minimized.

Show comment Hide comment
@rasky

rasky Feb 4, 2017

Contributor

@ancaemanuel an interface is two words: a pointer to a type and a pointer to a value. To store "12" in an interface, you need to have a position in memory that holds the value "12", to be pointed by the interface's second word. Up until now, that position in memory would be allocated on the heap, every time anew. With this patch, for special values between 0 and 255, the compiler knows that there is a location in memory that already holds those values and use it, rather than allocating on the heap.

Contributor

rasky commented Feb 4, 2017

@ancaemanuel an interface is two words: a pointer to a type and a pointer to a value. To store "12" in an interface, you need to have a position in memory that holds the value "12", to be pointed by the interface's second word. Up until now, that position in memory would be allocated on the heap, every time anew. With this patch, for special values between 0 and 255, the compiler knows that there is a location in memory that already holds those values and use it, rather than allocating on the heap.

@josharian

This comment has been minimized.

Show comment Hide comment
@josharian

josharian Feb 7, 2017

Contributor

@ancaemanuel perhaps http://commaok.xyz/post/interface-allocs/ will also provide some useful background and context, though you're probably better off reading https://research.swtch.com/interfaces :)

Contributor

josharian commented Feb 7, 2017

@ancaemanuel perhaps http://commaok.xyz/post/interface-allocs/ will also provide some useful background and context, though you're probably better off reading https://research.swtch.com/interfaces :)

@ancaemanuel

This comment has been minimized.

Show comment Hide comment
@ancaemanuel

ancaemanuel Feb 10, 2017

Thank you.
I have an ideea: the compiler knows the constants used for interfaces in a program. How about generating an static array (with the constants already in the program) at compile time ?

Thank you.
I have an ideea: the compiler knows the constants used for interfaces in a program. How about generating an static array (with the constants already in the program) at compile time ?

@josharian

This comment has been minimized.

Show comment Hide comment
@josharian

josharian Feb 10, 2017

Contributor

@ancaemanuel if I understand correctly, that is https://golang.org/issue/18704, which is done. (This is also discussed a bit more readably at http://commaok.xyz/post/interface-allocs/.)

Contributor

josharian commented Feb 10, 2017

@ancaemanuel if I understand correctly, that is https://golang.org/issue/18704, which is done. (This is also discussed a bit more readably at http://commaok.xyz/post/interface-allocs/.)

@ancaemanuel

This comment has been minimized.

Show comment Hide comment
@ancaemanuel

ancaemanuel Feb 10, 2017

I read the article.
This is an optimisation for small numbers.
If an program uses 500 constants, they are already in the compiled program. How about using that array (or map) ?

I read the article.
This is an optimisation for small numbers.
If an program uses 500 constants, they are already in the compiled program. How about using that array (or map) ?

@josharian

This comment has been minimized.

Show comment Hide comment
@josharian

josharian Feb 10, 2017

Contributor

I'm not sure I understand. Would you file a new issue with some more details and cc me and we can discuss there? Thanks!

Contributor

josharian commented Feb 10, 2017

I'm not sure I understand. Would you file a new issue with some more details and cc me and we can discuss there? Thanks!

josharian added a commit to josharian/go that referenced this issue Feb 14, 2017

runtime: use zeroVal for small zero values in interfaces
Fixes #17725

name                old time/op    new time/op    delta
ConvT2EInt/const-8    0.90ns ± 4%    0.90ns ± 2%      ~     (p=0.623 n=20+15)
ConvT2EInt/zero-8     21.5ns ± 2%     7.4ns ± 6%   -65.33%  (p=0.000 n=19+18)
ConvT2EInt/one-8      21.6ns ± 2%    22.8ns ± 2%    +5.21%  (p=0.000 n=20+19)

name                old alloc/op   new alloc/op   delta
ConvT2EInt/const-8     0.00B          0.00B           ~     (all equal)
ConvT2EInt/zero-8      8.00B ± 0%     0.00B       -100.00%  (p=0.000 n=20+20)
ConvT2EInt/one-8       8.00B ± 0%     8.00B ± 0%      ~     (all equal)

name                old allocs/op  new allocs/op  delta
ConvT2EInt/const-8      0.00           0.00           ~     (all equal)
ConvT2EInt/zero-8       1.00 ± 0%      0.00       -100.00%  (p=0.000 n=20+20)
ConvT2EInt/one-8        1.00 ± 0%      1.00 ± 0%      ~     (all equal)

Change-Id: I5b71f9e44e3de8b8f2284a3821c5176c13ab2c61
@randall77

This comment has been minimized.

Show comment Hide comment
@randall77

randall77 Mar 26, 2017

Contributor

This is fixed with 0358367

Contributor

randall77 commented Mar 26, 2017

This is fixed with 0358367

@randall77 randall77 closed this Mar 26, 2017

@gopherbot gopherbot locked and limited conversation to collaborators Mar 26, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.