New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: defer is slow #14939

Open
minux opened this Issue Mar 24, 2016 · 23 comments

Comments

Projects
None yet
@minux
Member

minux commented Mar 24, 2016

package p
import "testing"
//go:noinline
func defers() (r int) {
        defer func() {
                r = 42
        }()
        return 0
}
func BenchmarkDefer(b *testing.B) {
        for i := 0; i < b.N; i++ {
                defers()
        }
}

On my system, BenchmarkDefer uses 77.7ns/op. This issue
arises from investigation of #9704: if I remove the "defer endcgo(mp)"
and place the call at the end of the func cgocall, the benchmark in
#9704 will improve from 144ns/op to 63.7ns/op. (Note: we can't

eliminate the defer in func cgocall though, as it will break defer/recover
in Go->C->Go scenario.)

@minux minux added this to the Go1.7 milestone Mar 24, 2016

@minux

This comment has been minimized.

Member

minux commented Mar 24, 2016

One way to improve defer for such simple cases is to
allocate _defer on stack by the compiler. But as
a function can defer unbounded number of functions,
I'm not sure if maintaining two different _defer allocation
mechanisms is acceptable.

@cespare

This comment has been minimized.

Contributor

cespare commented Mar 24, 2016

Some prior discussion at #6980.

@martisch

This comment has been minimized.

Member

martisch commented Mar 24, 2016

I also had a CL recently where a single defer would have been the preferred solution but was not usable because of a 65ns performance hit. CL20512

So improving simple cases like a single defer in some branch (might even call no other methods or only select few) would have helped there.

name old time/op new time/op delta 
SprintfInt-2 137ns ± 7% 202ns ± 6% +47.72% (p=0.000 n=20+20)
@aclements

This comment has been minimized.

Member

aclements commented Jul 1, 2016

Given the current cost of defers, I think it's acceptable to have two defer allocation mechanisms if that addresses the problem. And I actually don't think the runtime side of this is very complicated.

This is on the list for 1.8. I'm planning to either do it myself or get someone else to do it. :)

If it turns out we need to simplify things, we could limit this to defers with at most one (possibly implicit) argument, which would handle the cgo case as well as the mutex unlock case. Another possible simplification would be to only stack-allocate defers in functions where all defers can be stack allocated, which would probably simplify creating an efficient prologue.

@aclements aclements self-assigned this Jul 1, 2016

@ianlancetaylor

This comment has been minimized.

Contributor

ianlancetaylor commented Jul 1, 2016

Separate from stack allocation, we should also consider special-casing defers with no arguments, as that is a fairly common case (about half of the defer calls in the standard library). Because the no-argument case doesn't have to worry about the arguments on the stack, it can use a simpler version of deferproc, one that doesn't need to call systemstack.

redbaron added a commit to redbaron/prometheus that referenced this issue Sep 18, 2016

redbaron added a commit to redbaron/prometheus that referenced this issue Sep 19, 2016

redbaron added a commit to redbaron/prometheus that referenced this issue Sep 19, 2016

@bradfitz

This comment has been minimized.

Member

bradfitz commented Sep 19, 2016

@bmizerany sent https://golang.org/cl/29379 to replace a defer with explicit mutex unlocks in x/time/rate. Consider reverting that optimization if this more general optimization goes in.

@aclements

This comment has been minimized.

Member

aclements commented Sep 22, 2016

/cc @dr2chase, who I've been talking with about compiler changes to support this.

@aclements

This comment has been minimized.

Member

aclements commented Sep 22, 2016

@dr2chase and I have been discussing an alternate approach to this that's somewhat less general than stack-allocated defers but should have essentially zero overhead versus a function call when it applies.

The idea is to take a page out of the C++ exception handling book, open code the defer execution, and use a PC value table to figure out where the defer code is when handling a panic.

Specifically, if the set of defers can be statically determined at every PC in a function, then the compiler would turn those defers into closures built on the stack and generate code at every exit point to directly call the appropriate defer closures for that exit point. In the common case, then, there would be no deferreturn logic at all and the defer execution would look essentially like hand-expanding the defer (like what CL 29379 did).

To keep this working with panics, the compiler would generate a PC value table for every function where this optimization applies that logically records, for every PC, where in the stack frame to find the set of defer closures to run for that PC. This actual encoding of this table could be quite compact in general, since large runs of PCs will have the same set of defer closures, and we could encode the tree-like structure of the set of defers to run directly in this table, so each entry would contain at most one defer closure offset and the PC to use to look up the next defer closure offset.

When panic walks the stack, it would keep an eye on both this PC value table and the defer stack. A given frame could have either defers on the stack or a defer offset PC value table, but not both. If a frame has a defer offset PC value table, panic would use the table to find the defer closures and call them.

beorn7 added a commit to prometheus/prometheus that referenced this issue Sep 22, 2016

Avoid `defer` in seriesMap.get
This is related to golang/go#14939 .
It's probably the only occurrence where it matters.
@minux

This comment has been minimized.

Member

minux commented Sep 22, 2016

@ianlancetaylor

This comment has been minimized.

Contributor

ianlancetaylor commented Sep 22, 2016

I don't quite see how your proposal would correctly handle the case of a panic that occurs while executing a deferred function. At that point some of the deferred functions have been executed and some have not, but what is the PC value you will use to determine which remain to be executed?

@ianlancetaylor

This comment has been minimized.

Contributor

ianlancetaylor commented Sep 22, 2016

A similar approach that might work would be to record a linked list of stack allocated deferred closures in the stack frame. You could even choose to always start the list from the same place in the stack frame, just below the return address/frame pointer, so you would only need a single bit in the traceback information. Then each time you defer a function, you update the linked list to point to the new stack allocated closure. At the end of the function, you remove each closure from the list as you execute it.

@dr2chase

This comment has been minimized.

Contributor

dr2chase commented Sep 22, 2016

@minux - no. If there's any point in the generated code (which is not the same as the input program) that two different defer sets can reach, then the PC-range technique won't work. Unless, say, we record which defers need running by storing into a bitmask of defers-to-(not-)run; this would also handles Ian's problem. Store a zero byte on entry, store a different value on the branches to the common area that don't enable all the defers. PC range has to mention the location of the which-defers-not-to-run byte, of course.

@aclements

This comment has been minimized.

Member

aclements commented Sep 22, 2016

I don't quite see how your proposal would correctly handle the case of a panic that occurs while executing a deferred function. At that point some of the deferred functions have been executed and some have not, but what is the PC value you will use to determine which remain to be executed?

That's a good question. If you're simply in the function epilogue executing local defers and one of them panics, then the PC in the epilogue tells you what's left. However, if you're running defers because of a panic and one of them panics, then you're right that things get more complicated. One possibility would be that when a panic happens, before running any defers, you walk the stack to find any frames that have PC value-based defers and weave those into the linked list of defers at the right points. Then panic just works from the linked list. If a second panic happens while walking this linked list, we know that we're already handling a panic, so it wouldn't rebuild the list, it would just keep going from the current list.

@aclements

This comment has been minimized.

Member

aclements commented Sep 22, 2016

A similar approach that might work would be to record a linked list of stack allocated deferred closures in the stack frame.

Possibly. I was hoping to avoid the overhead of even dealing with a linked list in the regular return case. If I understand your proposal, it seems like the cost at function exit wouldn't be substantially different from the cost of deferreturn today, which alone is about 36% of the time in @minux's benchmark.

@ianlancetaylor

This comment has been minimized.

Contributor

ianlancetaylor commented Sep 22, 2016

I am imagining that given

defer f()
...
defer g()

the defer statements would be compiled as

deferredFunctions = deferredFunction{next: deferredFunctions; run: f}
...
deferredFunctions = deferredFunction{next: deferredFunctions; run: g}

and the function exit sequence would be compiled as

deferredFunctions = deferredFunctions.next
g()
deferredFunctions = deferredFunctions.next
f()

So I don't think the performance would be like that of deferreturn. The compiler would of course have to ensure that deferredFunctions was stored on the stack and that it was always accurate before any function call. And this approach would take extra execution space as there would in effect be two copies of each deferred function--I suspect that is true of any efficient implementation.

@aclements

This comment has been minimized.

Member

aclements commented Sep 22, 2016

I see. I thought you were saying the function return would use the linked list (rather than just unwinding it for the benefit of a panic). I imagine that would perform well.

janisz added a commit to allegro/bigcache that referenced this issue Apr 18, 2017

Changed locking strategy (#37)
There is a known issue in Go with deferred operations being much slower than explicit operations, both directly and indirectly impacting locking. Generally, on OS X and Windows, the change gives between 10-15% performance gain.

Relates: golang/go#14939

@bradfitz bradfitz modified the milestones: Go1.9Early, Go1.10Early May 3, 2017

@bradfitz bradfitz modified the milestones: Go1.10Early, Go1.10 Jun 14, 2017

@starius

This comment has been minimized.

starius commented Jul 25, 2017

Notably defer + lambda + method call works faster than defer + method call.
I wrote a simple program that compares performance of

defer mutex.Unlock()

and

defer func() { mutex.Unlock() }()

The results are as follows for the master branch (ee392ac):

With lambda: 46.691357 ns
Without lambda: 50.014426 ns

My explanation. Probably defer mutex.Unlock() works slower, because it has to put the argument (mutex) on the stack. In case of lambda function mutex is not an argument - it is a reference to an upvalue, so defer doesn't need to put it on stack.

The speedup of using lambda version is 1.07. Can Go compiler detect cases where defer's arguments do not change (like mutex in my code) and introduce a lambda function to get that speedup?

PS. Also I run the program under Go 1.8.3 and got the following output:

With lambda: 45.962607 ns
Without lambda: 48.770406 ns

It demonstrates almost the same speedup of using lambdas (1.06) but more important part is that the master works slower than 1.8.3. Shouldn't it have worked faster after https://go-review.googlesource.com/c/29656/ ?

@aclements

This comment has been minimized.

Member

aclements commented Aug 2, 2017

@starius, interesting case.

The speedup of using lambda version is 1.07. Can Go compiler detect cases where defer's arguments do not change (like mutex in my code) and introduce a lambda function to get that speedup?

As you point out, it's important that mutex doesn't change, or else this transformation isn't actually sound. Unfortunately, while that sort of thing is easy to detect in SSA, defer is mostly implemented in the front-end, where it's much harder to tell.

For the defer optimization I'd like to ultimately do, though, I suspect this would fall out naturally, since the defer call would simply be transformed into a function call at the end of the function (plus some panic handling metadata), so SSA would recognize that the argument saved at the defer point is unchanged at the prologue and CSE them.

It demonstrates almost the same speedup of using lambdas (1.06) but more important part is that the master works slower than 1.8.3. Shouldn't it have worked faster after https://go-review.googlesource.com/c/29656/ ?

I suspect this is noise. CL 29656 was released in Go 1.8, and the defer implementation hasn't changed significantly (perhaps at all) between 1.8 and master.

@CAFxX

This comment has been minimized.

Contributor

CAFxX commented Sep 1, 2018

As you point out, it's important that mutex doesn't change, or else this transformation isn't actually sound.

Wouldn't it be sound if the transformation yielded the following instead?

{
  receiver := mutex
  defer func() { receiver.Unlock() }()
}

navytux added a commit to navytux/og-rek that referenced this issue Sep 24, 2018

decoder: Fix panic on dict with non-comparable keys
Even though we tried to catch whether dict keys are ok to be used via
reflect.TypeOf(key).Comparable() (see da5f034 "decoder: Fix crashes
found by fuzzer (kisielk#32)"), that turned out to be not enough. For example
if key is a struct, e.g. of the following type

	type Ref struct {
		Pid interface{}
	}

it will be comparable. But the comparision, depending on dynamic .Pid
type, might panic. This is what was actually cauht by fuzz-testing
recently:

	kisielk#50 (second part of the report)

So instead of recursively walking a key type and checking each subfield
with reflect.TypeOf().Comparable(), switch for using panic/recover for
detecting the "unhashable key" situation.

This slows down decoding a bit (only cumulative figure for all-test-vectors decoding):

	name          old time/op    new time/op    delta
	DecodeLong-4     361ns ± 0%     362ns ± 0%    ~     (p=0.238 n=5+4)
	Decode-4        93.2µs ± 0%    95.6µs ± 0%  +2.54%  (p=0.008 n=5+5)
	Encode-4        16.5µs ± 0%    16.6µs ± 0%    ~     (p=0.841 n=5+5)

but that is the price of correctness. And with manually recursively walking key
type I doubt it would be faster.

The defer overhead should be less once golang/go#14939 is fixed.

Updates: kisielk#30

kisielk added a commit to kisielk/og-rek that referenced this issue Sep 25, 2018

decoder: Fix panic on dict with non-comparable keys
Even though we tried to catch whether dict keys are ok to be used via
reflect.TypeOf(key).Comparable() (see da5f034 "decoder: Fix crashes
found by fuzzer (#32)"), that turned out to be not enough. For example
if key is a struct, e.g. of the following type

	type Ref struct {
		Pid interface{}
	}

it will be comparable. But the comparision, depending on dynamic .Pid
type, might panic. This is what was actually cauht by fuzz-testing
recently:

	#50 (second part of the report)

So instead of recursively walking a key type and checking each subfield
with reflect.TypeOf().Comparable(), switch for using panic/recover for
detecting the "unhashable key" situation.

This slows down decoding a bit (only cumulative figure for all-test-vectors decoding):

	name          old time/op    new time/op    delta
	DecodeLong-4     361ns ± 0%     362ns ± 0%    ~     (p=0.238 n=5+4)
	Decode-4        93.2µs ± 0%    95.6µs ± 0%  +2.54%  (p=0.008 n=5+5)
	Encode-4        16.5µs ± 0%    16.6µs ± 0%    ~     (p=0.841 n=5+5)

but that is the price of correctness. And with manually recursively walking key
type I doubt it would be faster.

The defer overhead should be less once golang/go#14939 is fixed.

Updates: #30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment