-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reflect: Call is slow #7818
Comments
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Call frame allocations can account for significant portion of all allocations in a program, if call is executed in an inner loop (e.g. to process every line in a log). On the other hand, the allocation is easy to remove using sync.Pool since the allocation is strictly scoped. benchmark old ns/op new ns/op delta BenchmarkCall 634 338 -46.69% BenchmarkCall-4 496 167 -66.33% benchmark old allocs new allocs delta BenchmarkCall 1 0 -100.00% BenchmarkCall-4 1 0 -100.00% Update #7818 Change-Id: Icf60cce0a9be82e6171f0c0bd80dee2393db54a7 Reviewed-on: https://go-review.googlesource.com/1954 Reviewed-by: Keith Randall <khr@golang.org>
Looking at some internal fleet-wide CPU usage, I see It might be time to optimize this. In Go 1.8, there's now precedent in the reflect package for returning a worker func ( Investigate the top 2014 comment's /cc @dsnet |
Value.Interface() sort of does this, but you have to call the result like a real function, not using Value.Call(). If you can fit code into using Value.Interface(), then there's no allocation overhead (unless the frame pool is empty) and no validation. The compiler constructs the arguments frame and interprets the result frame and reflect just has to do some memory copying. (That said, it still does two copies of both the arguments and the results. I feel like it should be possible to get that down to one.) |
This comment has been minimized.
This comment has been minimized.
For some concrete numbers, as of 1.9 on darwin/amd64,
|
This comment has been minimized.
This comment has been minimized.
Caching function layout in a Caller reduce 50% reflect call overhead. https://gist.github.com/sillyousu/606e4874839456cc02335bd1c5045f27 update @bruth 's benchmark:
|
This comment has been minimized.
This comment has been minimized.
It seems that java sets a buffer slice to avoid this problem. |
Change https://golang.org/cl/166462 mentions this issue: |
mustBe was barely over budget, so manually inlining the first flag.kind call is enough. Add a TODO to reverse that in the future, once the compiler gets better. mustBeExported and mustBeAssignable were over budget by a larger amount, so add slow path functions instead. This is the same strategy used in the sync package for common methods like Once.Do, for example. Lots of exported reflect.Value methods call these assert-like unexported methods, so avoiding the function call overhead in the common case does shave off a percent from most exported APIs. Finally, add the methods to TestIntendedInlining. While at it, replace a couple of uses of the 0 Kind with its descriptive name, Invalid. name old time/op new time/op delta Call-8 68.0ns ± 1% 66.8ns ± 1% -1.81% (p=0.000 n=10+9) PtrTo-8 8.00ns ± 2% 7.83ns ± 0% -2.19% (p=0.000 n=10+9) Updates #7818. Change-Id: Ic1603b640519393f6b50dd91ec3767753eb9e761 Reviewed-on: https://go-review.googlesource.com/c/go/+/166462 Run-TryBot: Daniel Martí <mvdan@mvdan.cc> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Taking a look at a CPU profile, IsVariadic was calculated multiple times unnecessarily. I also added a new BechmarkCallMethod to measure this use-case. name old time/op new time/op delta Call-16 20.3ns ± 8% 20.3ns ±16% ~ (p=0.443 n=18+20) CallMethod-16 99.2ns ± 3% 90.1ns ± 2% -9.22% (p=0.000 n=20+17) CallArgCopy/size=128-16 15.0ns ± 2% 14.5ns ± 3% -2.76% (p=0.000 n=20+19) CallArgCopy/size=256-16 15.9ns ± 7% 15.3ns ± 5% -4.26% (p=0.000 n=20+20) CallArgCopy/size=1024-16 17.6ns ± 7% 17.2ns ± 6% -1.73% (p=0.044 n=19+20) CallArgCopy/size=4096-16 25.3ns ± 4% 24.9ns ± 4% -1.66% (p=0.016 n=18+20) CallArgCopy/size=65536-16 375ns ± 4% 376ns ± 4% ~ (p=0.644 n=20+20) name old alloc/op new alloc/op delta Call-16 0.00B 0.00B ~ (all equal) CallMethod-16 0.00B 0.00B ~ (all equal) name old allocs/op new allocs/op delta Call-16 0.00 0.00 ~ (all equal) CallMethod-16 0.00 0.00 ~ (all equal) name old speed new speed delta CallArgCopy/size=128-16 8.56GB/s ± 2% 8.80GB/s ± 3% +2.84% (p=0.000 n=20+19) CallArgCopy/size=256-16 16.1GB/s ± 6% 16.8GB/s ± 5% +4.45% (p=0.000 n=20+20) CallArgCopy/size=1024-16 58.2GB/s ± 7% 59.4GB/s ± 6% +2.16% (p=0.026 n=20+20) CallArgCopy/size=4096-16 161GB/s ± 4% 165GB/s ± 4% +1.95% (p=0.007 n=17+20) CallArgCopy/size=65536-16 175GB/s ± 4% 174GB/s ± 4% ~ (p=0.640 n=20+20) Updates golang#7818 Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>
These calls are cachable, so do that in order to avoid doing extra work. This opportunity was discovered while taking a look at a CPU profile while investigating golang#7818. I added a BenchmarkCallMethod which is similar to BechmarkCall but for a method receiver. Benchmark results, including the new BenchmarkCallMethod: name old time/op new time/op delta Call-16 22.0ns ±19% 20.2ns ±17% -8.08% (p=0.000 n=40+40) CallMethod-16 100ns ± 3% 91ns ± 2% -9.13% (p=0.000 n=40+39) CallArgCopy/size=128-16 15.7ns ± 1% 14.3ns ± 4% -8.98% (p=0.000 n=38+37) CallArgCopy/size=256-16 15.9ns ± 3% 15.0ns ± 5% -6.12% (p=0.000 n=39+39) CallArgCopy/size=1024-16 18.8ns ± 6% 17.1ns ± 6% -9.03% (p=0.000 n=38+38) CallArgCopy/size=4096-16 26.6ns ± 3% 25.2ns ± 4% -5.19% (p=0.000 n=39+40) CallArgCopy/size=65536-16 379ns ± 3% 371ns ± 5% -2.11% (p=0.000 n=39+40) name old alloc/op new alloc/op delta Call-16 0.00B 0.00B ~ (all equal) CallMethod-16 0.00B 0.00B ~ (all equal) name old allocs/op new allocs/op delta Call-16 0.00 0.00 ~ (all equal) CallMethod-16 0.00 0.00 ~ (all equal) name old speed new speed delta CallArgCopy/size=128-16 8.13GB/s ± 1% 8.92GB/s ± 4% +9.77% (p=0.000 n=38+38) CallArgCopy/size=256-16 16.1GB/s ± 3% 17.1GB/s ± 5% +6.56% (p=0.000 n=39+39) CallArgCopy/size=1024-16 54.6GB/s ± 6% 60.1GB/s ± 5% +9.93% (p=0.000 n=38+38) CallArgCopy/size=4096-16 154GB/s ± 5% 163GB/s ± 4% +5.63% (p=0.000 n=40+40) CallArgCopy/size=65536-16 173GB/s ± 3% 177GB/s ± 5% +2.18% (p=0.000 n=39+40) Updates golang#7818. Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>
These calls are cachable, so do that in order to avoid doing extra work. This opportunity was discovered while taking a look at a CPU profile while investigating golang#7818. I added a BenchmarkCallMethod which is similar to BechmarkCall but for a method receiver. Benchmark results, including the new BenchmarkCallMethod: name old time/op new time/op delta Call-16 22.0ns ±19% 20.2ns ±17% -8.08% (p=0.000 n=40+40) CallMethod-16 100ns ± 3% 91ns ± 2% -9.13% (p=0.000 n=40+39) CallArgCopy/size=128-16 15.7ns ± 1% 14.3ns ± 4% -8.98% (p=0.000 n=38+37) CallArgCopy/size=256-16 15.9ns ± 3% 15.0ns ± 5% -6.12% (p=0.000 n=39+39) CallArgCopy/size=1024-16 18.8ns ± 6% 17.1ns ± 6% -9.03% (p=0.000 n=38+38) CallArgCopy/size=4096-16 26.6ns ± 3% 25.2ns ± 4% -5.19% (p=0.000 n=39+40) CallArgCopy/size=65536-16 379ns ± 3% 371ns ± 5% -2.11% (p=0.000 n=39+40) name old alloc/op new alloc/op delta Call-16 0.00B 0.00B ~ (all equal) CallMethod-16 0.00B 0.00B ~ (all equal) name old allocs/op new allocs/op delta Call-16 0.00 0.00 ~ (all equal) CallMethod-16 0.00 0.00 ~ (all equal) name old speed new speed delta CallArgCopy/size=128-16 8.13GB/s ± 1% 8.92GB/s ± 4% +9.77% (p=0.000 n=38+38) CallArgCopy/size=256-16 16.1GB/s ± 3% 17.1GB/s ± 5% +6.56% (p=0.000 n=39+39) CallArgCopy/size=1024-16 54.6GB/s ± 6% 60.1GB/s ± 5% +9.93% (p=0.000 n=38+38) CallArgCopy/size=4096-16 154GB/s ± 5% 163GB/s ± 4% +5.63% (p=0.000 n=40+40) CallArgCopy/size=65536-16 173GB/s ± 3% 177GB/s ± 5% +2.18% (p=0.000 n=39+40) Updates golang#7818. Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>
Change https://golang.org/cl/281252 mentions this issue: |
These calls are cachable, so do that in order to avoid doing extra work. This opportunity was discovered while taking a look at a CPU profile while investigating golang#7818. I added a BenchmarkCallMethod which is similar to BechmarkCall but for a method receiver. Benchmark results, including the new BenchmarkCallMethod: name old time/op new time/op delta Call-16 22.0ns ±19% 20.2ns ±17% -8.08% (p=0.000 n=40+40) CallMethod-16 100ns ± 3% 91ns ± 2% -9.13% (p=0.000 n=40+39) CallArgCopy/size=128-16 15.7ns ± 1% 14.3ns ± 4% -8.98% (p=0.000 n=38+37) CallArgCopy/size=256-16 15.9ns ± 3% 15.0ns ± 5% -6.12% (p=0.000 n=39+39) CallArgCopy/size=1024-16 18.8ns ± 6% 17.1ns ± 6% -9.03% (p=0.000 n=38+38) CallArgCopy/size=4096-16 26.6ns ± 3% 25.2ns ± 4% -5.19% (p=0.000 n=39+40) CallArgCopy/size=65536-16 379ns ± 3% 371ns ± 5% -2.11% (p=0.000 n=39+40) name old alloc/op new alloc/op delta Call-16 0.00B 0.00B ~ (all equal) CallMethod-16 0.00B 0.00B ~ (all equal) name old allocs/op new allocs/op delta Call-16 0.00 0.00 ~ (all equal) CallMethod-16 0.00 0.00 ~ (all equal) name old speed new speed delta CallArgCopy/size=128-16 8.13GB/s ± 1% 8.92GB/s ± 4% +9.77% (p=0.000 n=38+38) CallArgCopy/size=256-16 16.1GB/s ± 3% 17.1GB/s ± 5% +6.56% (p=0.000 n=39+39) CallArgCopy/size=1024-16 54.6GB/s ± 6% 60.1GB/s ± 5% +9.93% (p=0.000 n=38+38) CallArgCopy/size=4096-16 154GB/s ± 5% 163GB/s ± 4% +5.63% (p=0.000 n=40+40) CallArgCopy/size=65536-16 173GB/s ± 3% 177GB/s ± 5% +2.18% (p=0.000 n=39+40) Updates golang#7818. Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>
These calls are cacheable, so do that to avoid doing extra work. This opportunity was discovered while taking a look at a CPU profile while investigating #7818. I added a BenchmarkCallMethod, which is similar to BechmarkCall but for a method receiver. Benchmark results, including the new BenchmarkCallMethod: name old time/op new time/op delta Call-16 22.0ns ±19% 20.2ns ±17% -8.08% (p=0.000 n=40+40) CallMethod-16 100ns ± 3% 91ns ± 2% -9.13% (p=0.000 n=40+39) CallArgCopy/size=128-16 15.7ns ± 1% 14.3ns ± 4% -8.98% (p=0.000 n=38+37) CallArgCopy/size=256-16 15.9ns ± 3% 15.0ns ± 5% -6.12% (p=0.000 n=39+39) CallArgCopy/size=1024-16 18.8ns ± 6% 17.1ns ± 6% -9.03% (p=0.000 n=38+38) CallArgCopy/size=4096-16 26.6ns ± 3% 25.2ns ± 4% -5.19% (p=0.000 n=39+40) CallArgCopy/size=65536-16 379ns ± 3% 371ns ± 5% -2.11% (p=0.000 n=39+40) name old alloc/op new alloc/op delta Call-16 0.00B 0.00B ~ (all equal) CallMethod-16 0.00B 0.00B ~ (all equal) name old allocs/op new allocs/op delta Call-16 0.00 0.00 ~ (all equal) CallMethod-16 0.00 0.00 ~ (all equal) name old speed new speed delta CallArgCopy/size=128-16 8.13GB/s ± 1% 8.92GB/s ± 4% +9.77% (p=0.000 n=38+38) CallArgCopy/size=256-16 16.1GB/s ± 3% 17.1GB/s ± 5% +6.56% (p=0.000 n=39+39) CallArgCopy/size=1024-16 54.6GB/s ± 6% 60.1GB/s ± 5% +9.93% (p=0.000 n=38+38) CallArgCopy/size=4096-16 154GB/s ± 5% 163GB/s ± 4% +5.63% (p=0.000 n=40+40) CallArgCopy/size=65536-16 173GB/s ± 3% 177GB/s ± 5% +2.18% (p=0.000 n=39+40) Updates #7818. Change-Id: I94f88811ea9faf3dc2543984a13b360b5db66a4b GitHub-Last-Rev: 9bbaa18 GitHub-Pull-Request: #43475 Reviewed-on: https://go-review.googlesource.com/c/go/+/281252 Reviewed-by: Daniel Martí <mvdan@mvdan.cc> Reviewed-by: Keith Randall <khr@golang.org> Trust: Daniel Martí <mvdan@mvdan.cc> Trust: Brad Fitzpatrick <bradfitz@golang.org> Run-TryBot: Daniel Martí <mvdan@mvdan.cc> TryBot-Result: Go Bot <gobot@golang.org>
Now that we have #49340, which is a reasonable proposal which is just waiting for a prototype, should we close this issue as a duplicate? Is there any other part of Call which is slower than it needs to be besides the allocation for the results slice? |
I updated benchmark by @bruth here and these are results on my machine:
I added method invocation of using
s := reflect.ValueOf(v)
f := s.MethodByName("Inc").Interface().(func())
for k := 0; k < n; k++ {
f()
} I would expect and assume that So I think there are definitely some improvements needed here. |
The text was updated successfully, but these errors were encountered: