Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: unexpected return pc for runtime.sigpanic called from 0x1 #43496

Open
egonelbre opened this issue Jan 4, 2021 · 10 comments
Open

runtime: unexpected return pc for runtime.sigpanic called from 0x1 #43496

egonelbre opened this issue Jan 4, 2021 · 10 comments

Comments

@egonelbre
Copy link
Contributor

@egonelbre egonelbre commented Jan 4, 2021

What version of Go are you using (go version)?

$ go version
go version go1.15.6 linux/amd64

Does this issue reproduce with the latest release?

Unable to reproduce.

What operating system and processor architecture are you using (go env)?

It's using golang:1.15.6 docker image as the base for CI testing.

What did you do?

We have automated tests running for changes and one of them failed with a crash.
The race-detector is enabled for the tests.

The relevant part of the crash seems to be:

unexpected fault address 0x20000e25d000
fatal error: fault
runtime: unexpected return pc for runtime.sigpanic called from 0x1
stack: frame={sp:0x7fff71c22dd0, fp:0x7fff71c22e00} stack=[0xc0016e4000,0xc0016e8000)

fatal error: unknown caller pc

runtime stack:
runtime.throw(0x21593f2, 0x11)
	/usr/local/go/src/runtime/panic.go:1116 +0x72
runtime.gentraceback(0x4672b3, 0x7fff71c22dd0, 0x0, 0xc00059af00, 0x0, 0x0, 0x7fffffff, 0x7fff71c22ed8, 0x0, 0x0, ...)
	/usr/local/go/src/runtime/traceback.go:273 +0x1aec
runtime.addOneOpenDeferFrame.func1()
	/usr/local/go/src/runtime/panic.go:721 +0x91
runtime.systemstack(0x487db4)
	/usr/local/go/src/runtime/asm_amd64.s:370 +0x66
runtime.mstart()
	/usr/local/go/src/runtime/proc.go:1116


goroutine 8278 [running]:
runtime: unexpected return pc for runtime.systemstack_switch called from 0x0
stack: frame={sp:0x7fff71c22cc0, fp:0x7fff71c22cc8} stack=[0xc0016e4000,0xc0016e8000)

runtime.systemstack_switch()
	/usr/local/go/src/runtime/asm_amd64.s:330 fp=0x7fff71c22cc8 sp=0x7fff71c22cc0 pc=0x487ee0
created by golang.org/x/sync/errgroup.(*Group).Go
	/go/pkg/mod/golang.org/x/sync@v0.0.0-20200625203802-6e8e738ad208/errgroup/errgroup.go:54 +0x74


goroutine 1 [chan receive]:
testing.(*T).Run(0xc000500600, 0x214ec46, 0x9, 0x223ef60, 0x0)
	/usr/local/go/src/testing/testing.go:1169 +0x5f4
testing.runTests.func1(0xc000500600)
	/usr/local/go/src/testing/testing.go:1439 +0xa7
testing.tRunner(0xc000500600, 0xc000121ce0)
	/usr/local/go/src/testing/testing.go:1123 +0x203
testing.runTests(0xc0001b6d80, 0x2daffe0, 0xc, 0xc, 0xbff39cd96b1b208f, 0x1179127ca7c, 0x2e3c840, 0xc0004219b0)
	/usr/local/go/src/testing/testing.go:1437 +0x613
testing.(*M).Run(0xc0001ac300, 0x0)
	/usr/local/go/src/testing/testing.go:1345 +0x3b4
main.main()
	_testmain.go:67 +0x237

...
@egonelbre
Copy link
Contributor Author

@egonelbre egonelbre commented Jan 4, 2021

Maybe related #41099

@egonelbre
Copy link
Contributor Author

@egonelbre egonelbre commented Jan 4, 2021

Going through the stack trace multiple times, the additional thing that stuck out was:

goroutine 8315 [runnable]:
golang.org/x/crypto/argon2.processBlockSSE(0xc002c99800, 0xc002c99400, 0xc00382cc00, 0x1)
	/go/pkg/mod/golang.org/x/crypto@v0.0.0-20200820211705-5c72a883971a/argon2/blamka_amd64.go:24 +0x62b
golang.org/x/crypto/argon2.processBlockXOR(...)
	/go/pkg/mod/golang.org/x/crypto@v0.0.0-20200820211705-5c72a883971a/argon2/blamka_amd64.go:59
golang.org/x/crypto/argon2.processBlocks.func1(0x300000000, 0xc000000000, 0xc000326000)
	/go/pkg/mod/golang.org/x/crypto@v0.0.0-20200820211705-5c72a883971a/argon2/argon2.go:223 +0x21b
created by golang.org/x/crypto/argon2.processBlocks
	/go/pkg/mod/golang.org/x/crypto@v0.0.0-20200g820211705-5c72a883971a/argon2/argon2.go:234 +0x1bd

And maybe:

goroutine 8530 [runnable]:
strings.Index(0x287dad8, 0x2a, 0x2147016, 0x1, 0x2)
	/usr/local/go/src/strings/strings.go:1024 +0x825
strings.genSplit(0x287dad8, 0x2a, 0x2147016, 0x1, 0x0, 0x2, 0x36, 0x14d, 0xbcc060)
	/usr/local/go/src/strings/strings.go:251 +0x131
strings.Split(...)
	/usr/local/go/src/strings/strings.go:299

@egonelbre
Copy link
Contributor Author

@egonelbre egonelbre commented Jan 4, 2021

My best guess is that a bad BP in asm code can cause traceback to crash. Based on seeing something similar in minio/sha256-simd#55

@toothrot toothrot added this to the Backlog milestone Jan 5, 2021
@toothrot
Copy link
Contributor

@toothrot toothrot commented Jan 5, 2021

@randall77
Copy link
Contributor

@randall77 randall77 commented Jan 5, 2021

Traceback doesn't use BP currently, so that isn't it. (Using BP for traceback is #16638.)

That minio bug involves playing with SP, not BP, which could cause problems. (Our traceback code can't handle variable-sized frames.)

goroutine 8278 does look like frames are missing as it starts with a systemstack_switch, which doesn't seem right. The first frame should be a errgroup.(*Group).Go.func1 frame, I think.
Possibly defer related? @danscales

@danscales
Copy link

@danscales danscales commented Jan 5, 2021

Since sighandler/preparePanic do some manipulation of the stack to make it look like there was a direct call to sigpanic(), is it possible that this is some extremely rare race in the sighander or preparePanic?

From the list of stack traces, we can't definitively pinpoint where the original panic happened. But during panic handling, open-coded defers do need to look at the stack trace of the panic goroutine, hence will cause a problem if the backtrace has an issue.

For reference, here is the source code for errgroup.(*Group).Go(). It seems possible that goroutine 8278 (the one that started at line 54 below) is the one that panicked and has a corrupted stack and hence cause secondary panic in the addOneOpenDeferFrame/gentraceback. It's not clear to me if the original panic or problem is defer-related, or just that the defer-processing is highlighting the bad backtrace.

func (g *Group) Go(f func() error) {
	g.wg.Add(1)

	go func() {  // line 54
		defer g.wg.Done()

		if err := f(); err != nil {
			g.errOnce.Do(func() {
				g.err = err
				if g.cancel != nil {
					g.cancel()
				}
			})
		}
	}()
}

@egonelbre
Copy link
Contributor Author

@egonelbre egonelbre commented Jan 6, 2021

That minio bug involves playing with SP.

Sidenote, is or should there be a linter that verifies that SP is not being used?

@randall77
Copy link
Contributor

@randall77 randall77 commented Jan 6, 2021

@egonelbre Yes, that would make sense. We could also be more conservative with tracebacks during profiling interrupts, as that's the only case where we would try to traceback assembly like that.

@bcmills
Copy link
Member

@bcmills bcmills commented Aug 10, 2021

Is this possibly the same cause as #43942?

@seebs
Copy link
Contributor

@seebs seebs commented Oct 15, 2021

I have what appears at the moment to be a reproducer for this, sort of, but it's not an isolated reproducer, and also it's extremely unstable. MacOS amd64 host, running go test -race on a fairly large code base.

Symptoms: First, all of yesterday, any attempt to run go test -race just failed out with a complaint about too many address collisions for the race detector. These failures happened essentially-instantly at test startup. I ignored this and went and did something else for a bit, changing nothing that would even have been hit that early in test startup, and started getting cryptic complaints about SIGILL showing instruction bytes which were faultfcntlfetch[...] and looked like maybe part of a string table or something. That was 1.16.7. So I tried 1.17.2, and got:

unexpected fault address 0x2304d0000040
fatal error: fault
runtime: unexpected return pc for runtime.sigpanic called from 0x0
stack: frame={sp:0x70000a1b9e20, fp:0x70000a1b9e70} stack=[0xc00d24f000,0xc00d250000)

fatal error: unknown caller pc

runtime stack:
runtime.throw({0x59ceab8, 0x6757b40})
        /usr/local/go/src/runtime/panic.go:1198 +0x71
runtime.gentraceback(0x1, 0x1, 0xc0006376c0, 0x70000a1b9ef0, 0x0, 0x0, 0x7fffffff, 0x70000a1b9f08, 0x59bb535, 0x0)
        /usr/local/go/src/runtime/traceback.go:274 +0x1956
runtime.addOneOpenDeferFrame.func1()
        /usr/local/go/src/runtime/panic.go:751 +0x6b
runtime.systemstack()
        /usr/local/go/src/runtime/asm_amd64.s:383 +0x49

goroutine 12918 [running]:
runtime: unexpected return pc for runtime.systemstack_switch called from 0x0
stack: frame={sp:0x70000a1b9d18, fp:0x70000a1b9d20} stack=[0xc00d24f000,0xc00d250000)

runtime.systemstack_switch()
        /usr/local/go/src/runtime/asm_amd64.s:350 fp=0x70000a1b9d20 sp=0x70000a1b9d18 pc=0x406d840
created by golang.org/x/sync/errgroup.(*Group).Go
        /Users/seebs/go/src/github.com/molecula/featurebase/vendor/golang.org/x/sync/errgroup/errgroup.go:54 +0xf1

I don't know exactly what's happening, but I do know that this appears to happen at the same point in the test run every time (the previous message is the same, but it's somewhere during a test which might take quite a few seconds, so I don't know). The specific test it happens in completes without issue, though, if I run it alone; it's only with the other tests before it that it blows up. Also, there's about 1k goroutines (this is perhaps not surprising for the tests in question). In one of the dumps, four goroutines were all in the same function, makeString, which is not really identical to the following, but basically like this:

func makeBytes(unused, n1, n2 string, ignored uint64) []byte {
    r := make([]byte, 0, 16)
    r = append(r, '~')
    r = append(r, []byte(n1)...)
    r = append(r, ';')
    r = append(r, []byte(n2)...)
    r = append(r, '<')
    return r
}

func makeString(unused, n1, n2 string, ignored uint64) string {
    return string(makeBytes(n1, n2))
}

These calls are occurring at roughly the same point in several goroutines, but there's nothing about these that makes me think they should be taking over a microsecond or so, and yet, sometimes I see several goroutines stuck in this, or on a line which looks like otherFunc(makeString(a, b, c, d), e), which makes me suspicious that it's trying to do something here which is failing weirdly.

I don't know what's going on but it's definitely something.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
6 participants