Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: missed deferred calls [1.16 backport] #43941

Closed
mdempsky opened this issue Jan 27, 2021 · 21 comments
Closed

runtime: missed deferred calls [1.16 backport] #43941

mdempsky opened this issue Jan 27, 2021 · 21 comments
Labels
CherryPickCandidate
Milestone

Comments

@mdempsky
Copy link
Member

@mdempsky mdempsky commented Jan 27, 2021

This program should run successfully: https://play.golang.org/p/XIVsN45Oyzz

It does with gccgo and with cmd/compile using -gcflags=-N, but not when compiled normally. It also still fails at CL 286712, PS 3.

The problem is the "defer step(12)" and "defer step(13)" calls are being missed. Instead, execution jumps directly to the "step(14)" defer.

Sorry the reproducer is still so large. I'm having trouble minimizing it any further. It seems particularly sensitive to changes that I wouldn't expect to have any effect.

/cc @danscales @randall77

@mdempsky mdempsky added the NeedsFix label Jan 27, 2021
@mdempsky
Copy link
Member Author

@mdempsky mdempsky commented Jan 27, 2021

FWIW, since this is a valid program that's misbehaving silently (modulo the test program's own consistency checks), I think this is probably a higher priority issue than #43920 or #43942, which are also valid programs but failing loudly.

@mdempsky
Copy link
Member Author

@mdempsky mdempsky commented Jan 27, 2021

Somewhat more minimized test case: https://play.golang.org/p/60U6DFD6Ew- https://play.golang.org/p/Zpgkg3tZ8BO https://play.golang.org/p/jP3CpfpxSku https://play.golang.org/p/mX8ZnRSzZEh https://play.golang.org/p/AjhP4QHH1QU

After the panic(4), the ok = true assignment should execute. But instead execution jumps directly to the expect(4, recover())+log.Fatal defer.

It seems like it's particularly sensitive to stack frame layout. In particular, the issue was I needed to keep making matching changes to both of the main function literals within main. It seems like some of the state left on the stack by the first function is getting reused without proper reinitialization by the second function?

@mdempsky
Copy link
Member Author

@mdempsky mdempsky commented Jan 27, 2021

Having stared at these cases a bunch, I'm starting to think they're likely duplicates. In each of the reproducers, there's some code of the shape:

defer func() {
	defer func() {
		expect(3, recover())
	}()
	defer panic(3)
	panic(2)
}()
defer func() {
	expect(1, recover())
}()
panic(1)

The differences are all in the code around the above. I suspect the above code is just corrupting the defer stack, and then the surrounding code just influences how that corruption ultimately manifests. Most commonly it's simply "bad defer entry in panic" (#43920), but when the stars align (or stacks align?) it manifests instead as issues like #43941 and #43942.

@go101
Copy link

@go101 go101 commented Oct 9, 2021

A small variant:

package main

func main() {
	defer func() {
		println(recover().(int))
	}()
	func() {
		func() (_ [2]int) { type _ int; return }()
		func() {
			defer func() {
				defer func() {
					recover()
				}()
				defer panic(3)
				panic(2)
			}()
			defer func() {
				recover()
			}()
			panic(1)
		}()
		defer func() {}()
	}()
	
	var x = 123
	func() {
		defer print(x) // not executed!
		func() {
			defer func() {}()
			panic(4)
		}()
	}()
}

@batara666
Copy link

@batara666 batara666 commented Oct 9, 2021

this need to be fix

@ianlancetaylor ianlancetaylor added this to the Go1.18 milestone Oct 9, 2021
@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Oct 9, 2021

@go101 I think that your example is a different problem. Please open a different issue for it. Thanks.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Oct 9, 2021

@danscales The test case in the initial comment appears to work with Go 1.14, Go 1.17, and tip. It fails with Go 1.15 and Go 1.16. Go 1.15 is no longer supported. Do you think it would be feasible to backport the fix (whatever it is) to Go 1.16?

@ianlancetaylor ianlancetaylor removed this from the Go1.18 milestone Oct 9, 2021
@ianlancetaylor ianlancetaylor added this to the Go1.16.10 milestone Oct 9, 2021
@go101
Copy link

@go101 go101 commented Oct 10, 2021

@ianlancetaylor My test case is based on @mdempsky's one: #43941 (comment), I think they are same in principle.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Oct 10, 2021

@go101 They may be the same in principle but they are clearly different in practice. The original test case in this issue works on Go 1.17 and on tip. Your test case, if I understand it correctly, fails on Go 1.17 and tip. Therefore the two test cases, though they may be similar, demonstrate different problems. Please open a new issue. Thanks.

@danscales
Copy link
Contributor

@danscales danscales commented Oct 11, 2021

@ianlancetaylor OK, I did the bisection for the test case in the original comment. As you say, it worked in 1.14, which is when fast defers appeared. It started not working with this change:

https://go-review.googlesource.com/c/go/+/229601

[cmd/compile: move fixVariadicCall from walk to order]. That seems very unrelated, so maybe that change revealed an existing bug with defers.

Then, not unexpectedly, the test started working again with:

https://go-review.googlesource.com/c/go/+/310175

[internal/buildcfg: enable regabidefer by default], which enabled a major change in how defers are implemented by @aclements (in service of regabi, I believe).

So, I don't think there is an obvious simple fix for 1.16, but I will investigate further. If I don't see anything fairly obvious/easy to fix, I would probably lean toward not back-porting a fix for 1.16, since this case is so unusual.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Oct 11, 2021

I agree that if there isn't a simple fix we should just move on, since it works in Go 1.17. Thanks for looking.

@aclements
Copy link
Member

@aclements aclements commented Oct 13, 2021

I'm really surprised that enabling regabidefer would have fixed a missed defer. It was a significant change to how defers are invoked, but not to our overall handling of the defer stack.

@aclements
Copy link
Member

@aclements aclements commented Oct 13, 2021

It might be that once we understand #48898 we'll understand why regabidefer fixed this test program.

@gopherbot
Copy link

@gopherbot gopherbot commented Oct 14, 2021

Change https://golang.org/cl/356011 mentions this issue: cmd/compile: fix some defer bugs

@dmitshur
Copy link
Contributor

@dmitshur dmitshur commented Oct 29, 2021

This issue is currently in Go1.16.10 milestone and has a release-blocker label, which means the release of Go 1.16.10 will be blocked on it next week. It doesn't have a CherryPickCandidate label, so we missed it in our review of backports.

@ianlancetaylor Based on your latest comment above, it sounds like you may no longer think this issue needs to block Go 1.16.10, is that right?

We should either add CherryPickCandidate so this can be reviewed, or move this back to Go 1.18 milestone and ask gopherbot to make a 1.16 backport issue from it.

@danscales
Copy link
Contributor

@danscales danscales commented Oct 29, 2021

The fix ends up being fairly simple, see the fix for Go 1.18 that is about to go into master: https://go-review.googlesource.com/c/go/+/356011 . Ignoring comments & tests, there's only 3 lines of code change.

However, I would lean toward not backporting to 1.16.10. We will get soak time of several months for Go 1.18 before it goes out, whereas we would not get that for a point release. Also, the test cases are quite elaborate and probably unlikely in "real life". If really desired, maybe we can put in to the next Go 1.16 point release after the Go 1.18 fix has been checked into master for a few weeks?

gopherbot pushed a commit that referenced this issue Oct 29, 2021
Fix two defer bugs related to adding/removing open defer entries.
The bugs relate to the way that we add and remove open defer entries
from the defer chain. At the point of a panic, when we want to start
processing defer entries in order during the panic process, we need to
add entries to the defer chain for stack frames with open defers, since
the normal fast-defer code does not add these entries. We do this by
calling addOneOpenDeferFrame() at the beginning of each time around the
defer loop in gopanic(). Those defer entries get sorted with other open
and non-open-coded defer frames.

However, the tricky part is that we also need to remove defer entries if
they end not being needed because of a recover (which means we are back
to executing the defer code inline at function exits). But we need
to deal with multiple panics and in-process defers on the stack, so we
can't just remove all open-coded defers from the the defer chain during
a recover.

The fix (and new invariant) is that we should not add any open-coded
defers to the defer chain that are higher up the stack than an open-coded
defer that is in progress. We know that open-coded defer will still be
run until completed, and when it is completed, then a more outer frame
will be added (if there is one). This fits with existing code in gopanic
that only removes open-coded defer entries up to any defer in progress.

These bugs were because of the previous inconsistency between adding and
removing open defer entries, which meant that stale defer entries could
be left on the list, in these unusual cases with both recursive
panics plus multiple independent (non-nested) cases of panic & recover.

The test for #48898 was difficult to add to defer_test.go (while keeping
the failure mode), so I added as a go/test/fixedbug test instead.

Fixes #43920
Updates #43941
Fixes #48898

Change-Id: I593b77033e08c33094315abf8089fbc4cab07376
Reviewed-on: https://go-review.googlesource.com/c/go/+/356011
Trust: Dan Scales <danscales@google.com>
Trust: Cuong Manh Le <cuong.manhle.vn@gmail.com>
Reviewed-by: Austin Clements <austin@google.com>
@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Oct 29, 2021

I'm fine either way.

@danscales
Copy link
Contributor

@danscales danscales commented Nov 1, 2021

OK, we don't need to check into 1.16.10 right now. As I said, let's see how the fix in 1.18 goes. I'm removing this as a release blocker for 1.16.10. It doesn't look like 1.16.11 exists yet, but I would move it to that milestone when that exists.

@thanm thanm removed this from the Go1.16.10 milestone Nov 4, 2021
@thanm thanm added this to the Go1.16.11 milestone Nov 4, 2021
@heschi heschi removed this from the Go1.16.11 milestone Dec 2, 2021
@heschi heschi added this to the Go1.16.12 milestone Dec 2, 2021
@toothrot toothrot removed this from the Go1.16.12 milestone Dec 9, 2021
@toothrot toothrot added this to the Go1.16.13 milestone Dec 9, 2021
@cagedmantis cagedmantis removed this from the Go1.16.13 milestone Jan 6, 2022
@cagedmantis cagedmantis added this to the Go1.16.14 milestone Jan 6, 2022
@cherrymui cherrymui removed this from the Go1.16.14 milestone Feb 9, 2022
@cherrymui cherrymui added this to the Go1.16.15 milestone Feb 9, 2022
@dmitshur
Copy link
Contributor

@dmitshur dmitshur commented Feb 19, 2022

@danscales Some time has passed here, and this fix (CL 356011) had a chance to be included in 1.18 betas and RC 1. Do you think there's enough signal to be able to tell whether it'd be safe and worthwhile to backport it to Go 1.16.15 now?

I'll edit this issue to fit the backport issue format and add CherryPickCandidate label, that way we'll get a chance to see it in upcoming meetings. (Otherwise it's likely to be missed.)

@dmitshur dmitshur added CherryPickCandidate and removed NeedsFix labels Feb 19, 2022
@dmitshur dmitshur changed the title runtime: missed deferred calls runtime: missed deferred calls [1.16 backport] Feb 19, 2022
@danscales
Copy link
Contributor

@danscales danscales commented Feb 21, 2022

@dmitshur Yes, I think it is fine to be backported to Go 1.16.15 now! Thanks for following up on this. It's also possible to leave it out of Go 1.16.15, depending on the consensus in your meetings, since the only known/reported case of the bug is for the complex example at the top of this issue that was created by a form of fuzzing.

@cagedmantis
Copy link
Contributor

@cagedmantis cagedmantis commented Mar 2, 2022

Go 1.16 is very close to the end of its release cycle. The team members think that approving this could possibly introduce a regression at the end of a cycle. Because of that, this is considered a risky change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CherryPickCandidate
Projects
None yet
Development

No branches or pull requests