Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: generic functions are significantly slower than identical non-generic functions in some cases #50182

Closed
argusdusty opened this issue Dec 15, 2021 · 22 comments
Labels
NeedsInvestigation Performance
Milestone

Comments

@argusdusty
Copy link

@argusdusty argusdusty commented Dec 15, 2021

What version of Go are you using (go version)?

$ go version
go version go1.18beta1 windows/amd64

Does this issue reproduce with the latest release?

It reproduces on the 1.18 Beta 1 release.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
set GO111MODULE=
set GOARCH=amd64
set GOBIN=
set GOCACHE=C:\Users\argusdusty\AppData\Local\go-build
set GOENV=C:\Users\argusdusty\AppData\Roaming\go\env
set GOEXE=.exe
set GOEXPERIMENT=
set GOFLAGS=
set GOHOSTARCH=amd64
set GOHOSTOS=windows
set GOINSECURE=
set GOMODCACHE=C:\Users\argusdusty\Code\pkg\mod
set GONOPROXY=
set GONOSUMDB=
set GOOS=windows
set GOPATH=C:\Users\argusdusty\Code
set GOPRIVATE=
set GOPROXY=https://proxy.golang.org,direct
set GOROOT=C:\Program Files\Go
set GOSUMDB=sum.golang.org
set GOTMPDIR=
set GOTOOLDIR=C:\Program Files\Go\pkg\tool\windows_amd64
set GOVCS=
set GOVERSION=go1.18beta1
set GCCGO=gccgo
set GOAMD64=v1
set AR=ar
set CC=gcc
set CXX=g++
set CGO_ENABLED=1
set GOMOD=C:\Users\argusdusty\Code\src\foo\go.mod
set GOWORK=
set CGO_CFLAGS=-g -O2
set CGO_CPPFLAGS=
set CGO_CXXFLAGS=-g -O2
set CGO_FFLAGS=-g -O2
set CGO_LDFLAGS=-g -O2
set PKG_CONFIG=pkg-config
set GOGCCFLAGS=-m64 -mthreads -fmessage-length=0 -fdebug-prefix-map=C:\Users\ARGUSD~1\AppData\Local\Temp\go-build3090171098=/tmp/go-build -gno-record-gcc-switches

What did you do?

sort_test.go:

package sort_test

import (
	"sort"
	"testing"
)

func IsSortedGeneric[T sort.Interface](data T) bool {
	n := data.Len()
	for i := n - 1; i > 0; i-- {
		if data.Less(i, i-1) {
			return false
		}
	}
	return true
}

var data = sort.IntSlice{-10, -5, 0, 1, 2, 3, 5, 7, 11, 100, 100, 100, 1000, 10000}

func BenchmarkIsSorted(b *testing.B) {
	for i := 0; i < b.N; i++ {
		sort.IsSorted(data)
	}
}

func BenchmarkIsSortedGeneric(b *testing.B) {
	for i := 0; i < b.N; i++ {
		IsSortedGeneric[sort.IntSlice](data)
	}
}

go test -bench=. -cpu=1 -benchmem sort_test.go

What did you expect to see?

IsSortedGeneric to be faster than, or at least very similar in runtime to, the otherwise identical sort.IsSorted.

What did you see instead?

goos: windows
goarch: amd64
cpu: AMD Ryzen 5 3600X 6-Core Processor
BenchmarkIsSorted               21008770                55.26 ns/op           24 B/op          1 allocs/op
BenchmarkIsSortedGeneric         2859728               421.3 ns/op           336 B/op         14 allocs/op
PASS
ok      command-line-arguments  3.003s

This appears to be a problem specifically when methods are part of the type constraint - and you may note above that the number of allocs (14) is exactly equal to the number of method calls (1 .Len, 13 .Less calls). Compiler output points to runtime.convTslice triggered at each method call as the cause of the allocs.

A generic IsSorted implementation that restricts to a slice of constraints.Ordered does not have this slowdown. Experimentation also finds that supplying a pointer type instead of a value does not have this issue:

type PtrIntSlice []int

func (x *PtrIntSlice) Len() int           { return len(*x) }
func (x *PtrIntSlice) Less(i, j int) bool { return (*x)[i] < (*x)[j] }
func (x *PtrIntSlice) Swap(i, j int)      { (*x)[i], (*x)[j] = (*x)[j], (*x)[i] }

var ptrdata = PtrIntSlice{-10, -5, 0, 1, 2, 3, 5, 7, 11, 100, 100, 100, 1000, 10000}

func BenchmarkIsSortedPtrGeneric(b *testing.B) {
	for i := 0; i < b.N; i++ {
		IsSortedGeneric[*PtrIntSlice](&ptrdata)
	}
}

BenchmarkIsSortedPtrGeneric 44432265 27.07 ns/op 0 B/op 0 allocs/op

@ALTree ALTree changed the title runtime: generic functions are significantly slower than identical non-generic functions in some cases cmd/compile: generic functions are significantly slower than identical non-generic functions in some cases Dec 15, 2021
@ALTree ALTree added Performance NeedsInvestigation labels Dec 15, 2021
@ALTree
Copy link
Member

@ALTree ALTree commented Dec 15, 2021

Not sure if this is expected at this stage or not, but it could be a consequence of the current implementation.

cc @randall77 @danscales

@komuw
Copy link
Contributor

@komuw komuw commented Dec 15, 2021

If you use -gcflags=all='-G=3 -d=unified=1', the generic version becomes much faster than the non-generic one.

gotip version
go version devel go1.18-0f05ed3b78 Tue Dec 14 20:25:03 2021 +0000 linux/amd64

gotip test -bench=. -cpu=1 -benchmem ./...

goos: linux
goarch: amd64
pkg: cool
cpu: Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz
BenchmarkIsSorted        	23439408	        51.01 ns/op	      24 B/op	       1 allocs/op
BenchmarkIsSortedGeneric 	 3412116	       341.3 ns/op	     336 B/op	      14 allocs/op

gotip test -gcflags=all='-G=3 -d=unified=1' -bench=. -cpu=1 -benchmem ./...

goos: linux
goarch: amd64
pkg: cool
cpu: Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz
BenchmarkIsSorted        	24290271	        49.23 ns/op	      24 B/op	       1 allocs/op
BenchmarkIsSortedGeneric 	232407577	         5.210 ns/op	       0 B/op	       0 allocs/op

I still do not know what the unified flag does. I saw it once on @mdempsky's twitch livestream

@randall77
Copy link
Contributor

@randall77 randall77 commented Dec 15, 2021

This is because we convert data (which is a slice) into an interface so we can call a method on it. That conversion requires an allocation (because the callee is not statically known and could save the receiver away somewhere).
Maybe we could lift that conversion out of the loop? i.e. if we need to convert a variable of type parameter to an interface, do so at the place where that variable is last assigned instead of where the method call actually happens. This would require some analysis but easy cases (like this one where data is never reassigned) could be doable.

@randall77
Copy link
Contributor

@randall77 randall77 commented Dec 15, 2021

unified has no allocations because it is fully stenciled and the receiver is known at the callsite.

@aclements
Copy link
Member

@aclements aclements commented Dec 15, 2021

@randall77 , why does that lead to 14 allocations per iteration, though? I would expect that to result in 1 additional allocation per iteration. Instead, we seem to be allocating for each element of the slice.

@randall77
Copy link
Contributor

@randall77 randall77 commented Dec 15, 2021

It's one allocation per iteration of the loop in IsSortedGeneric. That's len(data) allocations per benchmark iteration.

@randall77
Copy link
Contributor

@randall77 randall77 commented Dec 15, 2021

(len(data)-1 allocations for the Less calls, plus one for the Len call.)

@ianlancetaylor ianlancetaylor added this to the Go1.19 milestone Dec 15, 2021
@DmitriyMV
Copy link

@DmitriyMV DmitriyMV commented Dec 16, 2021

@komuw
It was interesting to me, so I took a dive, and what I found was: as described here unified IR doesn't currently do dictionary (shape like) approach for generics, and instead does stenciling (aka separate code for each type no matter memory representation). So resulting binary is indeed faster, but ALOT bigger. That also explains why there was no allocations with this flag.

Further reading:
[dev.typeparams] all: add GOEXPERIMENT=unified knob
#46786
[go/dev.typeparams] [dev.typeparams] cmd/compile: unified IR construction

@ConsoleTVs
Copy link

@ConsoleTVs ConsoleTVs commented Dec 17, 2021

Is there a reason to add this to Go1.19 rather than 1.18 itself considering this is a beta ?

@randall77
Copy link
Contributor

@randall77 randall77 commented Dec 17, 2021

I suspect any fix for this is going to be large + difficult. We don't want to put changes like that in at this point, especially if it is just for performance.
If it turns out the fix is small + easy, we could put it in 1.18.

@g-talbot
Copy link

@g-talbot g-talbot commented Dec 19, 2021

Why isn't data.Less() inlined into the loop and then, particularly for integers, it becomes a single comparison instruction? You know the precise type at the call site, amirite?

@randall77
Copy link
Contributor

@randall77 randall77 commented Dec 20, 2021

Why isn't data.Less() inlined into the loop and then, particularly for integers, it becomes a single comparison instruction? You know the precise type at the call site, amirite?

No, we don't. We know the type at the callsite implements sort.Interface, and we know its memory layout looks like a slice, but we don't know its exact type. In particular, we don't know what the callee is. We can't inline in that situation.

@danscales
Copy link
Contributor

@danscales danscales commented Dec 23, 2021

This is because we convert data (which is a slice) into an interface so we can call a method on it. That conversion requires an allocation (because the callee is not statically known and could save the receiver away somewhere). Maybe we could lift that conversion out of the loop? i.e. if we need to convert a variable of type parameter to an interface, do so at the place where that variable is last assigned instead of where the method call actually happens. This would require some analysis but easy cases (like this one where data is never reassigned) could be doable.

@randall77 It would be useful to represent explicitly the knowledge that all dictionaries are read-only (static). Maybe we should represent access to a dictionary entry as a single operation that makes it to the SSA passes, rather than breaking it down immediately into the indexing and loading operations, as we currently do right during instantiation (stenciling). That way, we could ensure that the dictionary access operation takes no memory input, and therefore can always be CSE'ed when it has identical inputs.

In this case (and probably most cases), it seems like we also need to make sure the CONVIDATA calls can be CSE'ed, but they are transformed during the order() part of Walk() into run-time calls. We might want the CONVIDATA call to make it SSA (so again, it can easily be CSEed for the same input value), but that seems harder, since all the logic for understanding/translating CONVIDATA is in walk.go

@g-talbot
Copy link

@g-talbot g-talbot commented Dec 24, 2021

IsSortedGeneric[sort.IntSlice](data)
IsSortedGeneric[sort.IntSlice](data)

@randall77 But you do know the exact type. It's passed right here in the brackets. The compiler should be able to deduce that it can short-circuit the type here and that only sort.IntSlice's Less() implementation is being called. This is a real issue. What's the point of generics if the compiler can't short-circuit the interface machinery when it knows the precise types? You lose the performance benefit via inlining if you can't do that.

@randall77
Copy link
Contributor

@randall77 randall77 commented Dec 24, 2021

@randall77 But you do know the exact type. It's passed right here in the brackets.

We don't know it when compiling IsSortedGeneric. We don't compile IsSortedGeneric[T] for every T that instantiates it. We compile it only for each distinct shape of the instantiating T. See https://github.com/golang/proposal/blob/master/design/generics-implementation-gcshape.md , and in particular the second bullet in the risks section.

@g-talbot
Copy link

@g-talbot g-talbot commented Dec 25, 2021

@randall77 So, for example, sorting (think sort.Stable for example, but with a generic implementation available instead of the interface) basically becomes functionally identical to the existing sort.Stable that takes an interface upon compilation?

I've run into performance problems with sort.Sort and sort.Stable--copying the implementation to a separate file and s/Interface/MyType/g can be many times faster--and I was hoping that generics would resolve this?

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Dec 26, 2021

@g-talbot The goal of generics is not to provide faster execution time. The goal is to make it easier to write certain kinds of code with less repetitive boilerplate and more compile-time type checking. Rewriting code to use generics may sometimes make it faster. It may sometimes make it slower (as in this example). While we will continue to improve the compiler balancing the needs of fast builds and fast runtimes, it will never be the case that generic code will always be faster than non-generic code. That is not a goal.

@gopherbot
Copy link

@gopherbot gopherbot commented Jan 13, 2022

Change https://golang.org/cl/378178 mentions this issue: cmd/compile: stop interface conversions for generic method calls from allocating

@bboreham
Copy link
Contributor

@bboreham bboreham commented Jan 13, 2022

Just wanted to reference #48849 which to me is a very similar issue.

@gopherbot
Copy link

@gopherbot gopherbot commented Feb 11, 2022

Change https://go.dev/cl/385274 mentions this issue: cmd/compile: use method expression closures to implement bound method calls

@gopherbot
Copy link

@gopherbot gopherbot commented Feb 11, 2022

Change https://go.dev/cl/385254 mentions this issue: cmd/compile: implement generic method expressions with closures in dictionary

@thepudds
Copy link

@thepudds thepudds commented Feb 15, 2022

Just wanted to share some benchmarks showing the improvement for the example at the top of this issue.

  • The first change in CL 378178 that was merged in January reduced the time by ~90% (compared to tip as of 4f6f68e, which was before any change for this issue).

  • The pending CL 385274 reduces the time by another ~20% (and also simplifies the implementation).

$ benchstat tip-4f6f68e cl-378178 cl-385274 

name \ time/op      tip-4f6f68e  cl-378178    cl-385274
IsSortedGeneric-32   461ns ± 1%    42ns ± 0%    34ns ± 0%
IsSorted-32         65.4ns ± 1%  65.3ns ± 1%  65.5ns ± 1%

name \ alloc/op     tip-4f6f68e  cl-378178    cl-385274
IsSortedGeneric-32    336B ± 0%      0B           0B     
IsSorted-32          24.0B ± 0%   24.0B ± 0%   24.0B ± 0%

name \ allocs/op    tip-4f6f68e  cl-378178    cl-385274
IsSortedGeneric-32    14.0 ± 0%     0.0          0.0     
IsSorted-32           1.00 ± 0%    1.00 ± 0%    1.00 ± 0%

gopherbot pushed a commit that referenced this issue Mar 26, 2022
…ctionary

Currently we do quite a dance for method expressions on generic
types. We write a new closure, and in that closure convert the
receiver to an interface with the required method, then call the
target using an interface call.

Instead in this CL, we just allocate a (captureless) closure in the
dictionary which implements that method expression.

This CL makes method expressions faster and simpler. But the real win
is some followon CLs, where we can use the same closure to implement
bound method calls using the same closure, instead of converting to
interface and having wrappers convert back. Much faster and simpler.

Still thinking about how to do method values. The receiver still
needs to be captured, so there must be some closure involved, I think.

Update #50182

Change-Id: I1fbd57e7105663f8b049955b8f4111649a5f4aa8
Reviewed-on: https://go-review.googlesource.com/c/go/+/385254
Trust: Keith Randall <khr@golang.org>
Run-TryBot: Keith Randall <khr@golang.org>
Reviewed-by: Matthew Dempsky <mdempsky@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
gopherbot pushed a commit that referenced this issue Mar 26, 2022
… calls

When we have x.M(args) where x is a value of type parameter type, we
currently cast x to the bound of that type parameter (which is an interface)
and then invoke the method on that interface. That's pretty inefficient
because:
 1) We need to convert x to an interface, which often requires allocation.
    With CL 378178 it is at least stack allocation, but allocation nontheless.
 2) We need to call through wrapper functions to unpack the interface
    into the right argument locations for the callee.

Instead, let's just call the target directly. The previous CL to this one
added method expression closures to the dictionary, which is a simple
captureless closure that implements T.M for type parameter T and method M.
So to implement x.M(args) for x of type T, we use methodexpr(T,M)(x, args).
We just need to move x from the receiver slot to the first argument, and
use the dictionary entry to implement the polymorphism. This works because
we stencil by shape, so we know how to marshal x for the call even though
we don't know its exact type.

We should be able to revert CL 378178 after this one, as that optimization
will no longer be necssary as we're not converting values to interfaces
to implement this language construct anymore.

Update #50182

Change-Id: I813de4510e41ab63626e58bd1167f9ae93016202
Reviewed-on: https://go-review.googlesource.com/c/go/+/385274
Trust: Keith Randall <khr@golang.org>
Run-TryBot: Keith Randall <khr@golang.org>
Reviewed-by: Matthew Dempsky <mdempsky@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
jproberts pushed a commit to jproberts/go that referenced this issue Jun 21, 2022
… allocating

Let T be a type parameter, and say we instantiate it with S, a type
that isn't pointer-like (e.g. a pair of ints, or as in 50182, a
slice). Then to call a method m on a variable of type T, the compiler
does essentially:

var v T = ...
i := (interface{m()})(v)
i.m()

The conversion at that second line allocates, as we need to make the
data word for an interface. And in the general case, that interface
may live an arbitrarily long time. But in this case, we know it
doesn't.

The data word of i has type *S.  When we call i.m, we can't call S.m
directly. It is expecting an S, not a *S. We call through a wrapper
defined on *S, which looks like:

func (p *S) m() {
   var s S = *p
   s.m()
}

The value passed in for p is exactly the data word mentioned above. It
never escapes anywhere - the wrapper copies a type S variable out of
*p and p is dead after that. That means that in the situation where we
build an interface for the explicit purpose of calling a method on it,
and use that built interface nowhere else, the allocation of the data
word for that interface is known to die before the call returns and
thus can be stack allocated.

One tricky case is that although the allocation of the backing store
of the interface conversion doesn't escape, pointers we store *inside*
that allocation might escape (in fact they definitely will, unless we
can devirtualize the receiver).

Fixes golang#50182

Change-Id: I40e893955c2e6871c54ccecf1b9f0cae17871b0d
Reviewed-on: https://go-review.googlesource.com/c/go/+/378178
Trust: Keith Randall <khr@golang.org>
Run-TryBot: Keith Randall <khr@golang.org>
Trust: Dan Scales <danscales@google.com>
Reviewed-by: Matthew Dempsky <mdempsky@google.com>
Reviewed-by: David Chase <drchase@google.com>
Reviewed-by: Dan Scales <danscales@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Performance
Projects
None yet
Development

No branches or pull requests