Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: running with delve on arm64 throws nosplit stack over 792 byte limit #64113

Closed
mauri870 opened this issue Nov 14, 2023 · 15 comments
Closed
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done. release-blocker
Milestone

Comments

@mauri870
Copy link
Member

mauri870 commented Nov 14, 2023

What version of Go are you using (go version)?

tip

Does this issue reproduce with the latest release?

Only tip

What operating system and processor architecture are you using (go env)?

darwin/arm64

What did you do?

dlv debug cmd/go # any go program if that matters

What did you expect to see?

delve starts up

What did you see instead?

syscall.ptrace: nosplit stack over 792 byte limit
syscall.ptrace<1>
    grows 80 bytes, calls syscall.ptrace1<1>
        grows 160 bytes, calls syscall.syscall6<1>
            grows 192 bytes, calls runtime.exitsyscall<1>
                grows 96 bytes, calls runtime.exitsyscallfast<1>
                    grows 80 bytes, calls runtime.wirep<1>
                        grows 64 bytes, calls runtime.throw<1>
                            grows 48 bytes, calls runtime.fatalthrow<1>
                                grows 64 bytes, calls runtime.systemstack<0>
                                    grows 16 bytes, calls indirect
                                        grows 0 bytes, calls runtime.morestack<0>
                                        8 bytes over limit
                                    grows 16 bytes, calls runtime.abort<0>
                                    8 bytes over limit
                                    grows 16 bytes, calls gosave_systemstack_switch<22>
                                        grows 0 bytes, calls runtime.abort<0>
                                        8 bytes over limit
                                    grows 16 bytes, calls runtime.save_g<0>
                                    8 bytes over limit
                                    grows 16 bytes, calls indirect
                                        grows 0 bytes, calls runtime.morestack<0>
                                        8 bytes over limit
                                    grows 16 bytes, calls runtime.save_g<0>
                                    8 bytes over limit
                        grows 64 bytes, calls runtime.throw<1>
                            grows 48 bytes, calls runtime.fatalthrow<1>
                                grows 64 bytes, calls runtime.systemstack<0>
                                    grows 16 bytes, calls indirect
                                        grows 0 bytes, calls runtime.morestack<0>
                                        8 bytes over limit
                                    grows 16 bytes, calls runtime.abort<0>
                                    8 bytes over limit
                                    grows 16 bytes, calls gosave_systemstack_switch<22>
                                        grows 0 bytes, calls runtime.abort<0>
                                        8 bytes over limit
                                    grows 16 bytes, calls runtime.save_g<0>
                                    8 bytes over limit
                                    grows 16 bytes, calls indirect
                                        grows 0 bytes, calls runtime.morestack<0>
                                        8 bytes over limit
                                    grows 16 bytes, calls runtime.save_g<0>
                                    8 bytes over limit
exit status 1

This was introduced by CL 494187. Based on the comments there and my short investigation it appears that what is causing this is the new tracer allocated on the stack in exitsyscall.

cc @mknyszek

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Nov 14, 2023
@mauri870 mauri870 added the NeedsFix The path to resolution is known, but the work has not been done. label Nov 14, 2023
@mknyszek
Copy link
Contributor

mknyszek commented Nov 14, 2023

Thanks for filing a bug; I'm aware of this issue but forgot to write it down. I'll fix this before the release candidate.

@mknyszek mknyszek self-assigned this Nov 14, 2023
@mknyszek mknyszek added this to the Go1.22 milestone Nov 14, 2023
@sylr
Copy link

sylr commented Nov 22, 2023

I can reproduce that with a simple go build without delve:

CGO_ENABLED=0 GOEXPERIMENTS=arenas GOOS=darwin GOARCH=arm64 gotip build -tags sql_scanner -gcflags="all=-N -l" -pgo=auto -ldflags ' -X github.com/xxxxxx/yyyyyy/cmd.Version=1a8f599-dirty -extldflags ""' -o dist/yyyyyy-1a8f599-dirty \
	./cmd/yyyyyy
# github.com/xxxxxxx/yyyyyy/cmd/yyyyyy
syscall.ptrace: nosplit stack over 792 byte limit
syscall.ptrace<1>
    grows 80 bytes, calls syscall.ptrace1<1>
        grows 160 bytes, calls syscall.syscall6<1>
            grows 192 bytes, calls runtime.entersyscall<1>
                grows 32 bytes, calls runtime.reentersyscall<1>
                    grows 112 bytes, calls runtime.casgstatus<1>
                        grows 80 bytes, calls runtime.nanotime<1>
                            grows 32 bytes, calls runtime.nanotime1<0>
                                grows 48 bytes, calls runtime.libcCall<1>
                                    grows 48 bytes, calls runtime.asmcgocall<0>
                                        grows 16 bytes, calls gosave_systemstack_switch<20>
...

@mknyszek
Copy link
Contributor

Sorry for the delay here. We discussed this offline last week and I have a path forward, just haven't gotten around to it. I'll send a change to fix this today.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/544535 mentions this issue: runtime: throw from the systemstack in wirep

@mknyszek mknyszek moved this from Todo to In Progress in Go Compiler / Runtime Nov 22, 2023
@github-project-automation github-project-automation bot moved this from In Progress to Done in Go Compiler / Runtime Nov 22, 2023
@mauri870
Copy link
Member Author

mauri870 commented Nov 23, 2023

@mknyszek Reopening this since I started seeing this again on master, even after your fix, upon running dlv debug cmd/trace on darwin/arm64 as linked below. Seems to be a little different stack this time around.

I did some digging and found a fix, happy to submit it if you agree with the changes. I tried looking into using systemstack but there is not much room for it, it seems like the culprit is nanotime in casgstatus. The function wrapper is probably not nice for nosplit, but it is the only way I found to supress the issue, other than running nanotime on the system stack, which is probably not ok either.

patch
diff --git a/src/runtime/proc.go b/src/runtime/proc.go
index 6348335804..b9db1cef56 100644
--- a/src/runtime/proc.go
+++ b/src/runtime/proc.go
@@ -1133,22 +1133,30 @@ func casgstatus(gp *g, oldval, newval uint32) {
 	const yieldDelay = 5 * 1000
 	var nextYield int64
 
+	// Wrap nanotime to avoid a nosplit overflow build failure
+	// on some platforms when built with -N -l. See #64113.
+	var nowfunc = func() int64 {
+		return nanotime()
+	}
+
 	// loop if gp->atomicstatus is in a scan state giving
 	// GC time to finish and change the state to oldval.
 	for i := 0; !gp.atomicstatus.CompareAndSwap(oldval, newval); i++ {
 		if oldval == _Gwaiting && gp.atomicstatus.Load() == _Grunnable {
-			throw("casgstatus: waiting for Gwaiting but is Grunnable")
+			systemstack(func() {
+				throw("casgstatus: waiting for Gwaiting but is Grunnable")
+			})
 		}
 		if i == 0 {
-			nextYield = nanotime() + yieldDelay
+			nextYield = nowfunc() + yieldDelay
 		}
-		if nanotime() < nextYield {
+		if nowfunc() < nextYield {
 			for x := 0; x < 10 && gp.atomicstatus.Load() != oldval; x++ {
 				procyield(1)
 			}
 		} else {
 			osyield()
-			nextYield = nanotime() + yieldDelay/2
+			nextYield = nowfunc() + yieldDelay/2
 		}
 	}
 
@@ -1173,7 +1181,7 @@ func casgstatus(gp *g, oldval, newval uint32) {
 		// We transitioned out of runnable, so measure how much
 		// time we spent in this state and add it to
 		// runnableTime.
-		now := nanotime()
+		now := nowfunc()
 		gp.runnableTime += now - gp.trackingStamp
 		gp.trackingStamp = 0
 	case _Gwaiting:
@@ -1186,7 +1194,7 @@ func casgstatus(gp *g, oldval, newval uint32) {
 		// a more representative estimate of the absolute value.
 		// gTrackingPeriod also represents an accurate sampling period
 		// because we can only enter this state from _Grunning.
-		now := nanotime()
+		now := nowfunc()
 		sched.totalMutexWaitTime.Add((now - gp.trackingStamp) * gTrackingPeriod)
 		gp.trackingStamp = 0
 	}
@@ -1197,12 +1205,12 @@ func casgstatus(gp *g, oldval, newval uint32) {
 			break
 		}
 		// Blocking on a lock. Write down the timestamp.
-		now := nanotime()
+		now := nowfunc()
 		gp.trackingStamp = now
 	case _Grunnable:
 		// We just transitioned into runnable, so record what
 		// time that happened.
-		now := nanotime()
+		now := nowfunc()
 		gp.trackingStamp = now
 	case _Grunning:
 		// We're transitioning into running, so turn off
syscall.ptrace: nosplit stack over 792 byte limit
dlv debug cmd/trace
# cmd/trace
syscall.ptrace: nosplit stack over 792 byte limit
syscall.ptrace<1>
  grows 80 bytes, calls syscall.ptrace1<1>
      grows 160 bytes, calls syscall.syscall6<1>
          grows 192 bytes, calls runtime.entersyscall<1>
              grows 32 bytes, calls runtime.reentersyscall<1>
                  grows 112 bytes, calls runtime.casgstatus<1>
                      grows 80 bytes, calls runtime.nanotime<1>
                          grows 32 bytes, calls runtime.nanotime1<0>
                              grows 48 bytes, calls runtime.libcCall<1>
                                  grows 48 bytes, calls runtime.asmcgocall<0>
                                      grows 16 bytes, calls gosave_systemstack_switch<34>
                                          grows 0 bytes, calls runtime.abort<0>
                                          8 bytes over limit
                                      grows 16 bytes, calls runtime.save_g<0>
                                      8 bytes over limit
                                      grows 16 bytes, calls indirect
                                          grows 0 bytes, calls runtime.morestack<0>
                                          8 bytes over limit
                                      grows 16 bytes, calls runtime.save_g<0>
                                      8 bytes over limit
                                      grows 16 bytes, calls indirect
                                          grows 0 bytes, calls runtime.morestack<0>
                                          8 bytes over limit
                      grows 80 bytes, calls runtime.nanotime<1>
                          grows 32 bytes, calls runtime.nanotime1<0>
                              grows 48 bytes, calls runtime.libcCall<1>
                                  grows 48 bytes, calls runtime.asmcgocall<0>
                                      grows 16 bytes, calls gosave_systemstack_switch<34>
                                          grows 0 bytes, calls runtime.abort<0>
                                          8 bytes over limit
                                      grows 16 bytes, calls runtime.save_g<0>
                                      8 bytes over limit
                                      grows 16 bytes, calls indirect
                                          grows 0 bytes, calls runtime.morestack<0>
                                          8 bytes over limit
                                      grows 16 bytes, calls runtime.save_g<0>
                                      8 bytes over limit
                                      grows 16 bytes, calls indirect
                                          grows 0 bytes, calls runtime.morestack<0>
                                          8 bytes over limit
                      grows 80 bytes, calls runtime.nanotime<1>
                          grows 32 bytes, calls runtime.nanotime1<0>
                              grows 48 bytes, calls runtime.libcCall<1>
                                  grows 48 bytes, calls runtime.asmcgocall<0>
                                      grows 16 bytes, calls gosave_systemstack_switch<34>
                                          grows 0 bytes, calls runtime.abort<0>
                                          8 bytes over limit
                                      grows 16 bytes, calls runtime.save_g<0>
                                      8 bytes over limit
                                      grows 16 bytes, calls indirect
                                          grows 0 bytes, calls runtime.morestack<0>
                                          8 bytes over limit
                                      grows 16 bytes, calls runtime.save_g<0>
                                      8 bytes over limit
                                      grows 16 bytes, calls indirect
                                          grows 0 bytes, calls runtime.morestack<0>
                                          8 bytes over limit
                      grows 80 bytes, calls runtime.nanotime<1>
                          grows 32 bytes, calls runtime.nanotime1<0>
                              grows 48 bytes, calls runtime.libcCall<1>
                                  grows 48 bytes, calls runtime.asmcgocall<0>
                                      grows 16 bytes, calls gosave_systemstack_switch<34>
                                          grows 0 bytes, calls runtime.abort<0>
                                          8 bytes over limit
                                      grows 16 bytes, calls runtime.save_g<0>
                                      8 bytes over limit
                                      grows 16 bytes, calls indirect
                                          grows 0 bytes, calls runtime.morestack<0>
                                          8 bytes over limit
                                      grows 16 bytes, calls runtime.save_g<0>
                                      8 bytes over limit
                                      grows 16 bytes, calls indirect
                                          grows 0 bytes, calls runtime.morestack<0>
                                          8 bytes over limit
                      grows 80 bytes, calls runtime.nanotime<1>
                          grows 32 bytes, calls runtime.nanotime1<0>
                              grows 48 bytes, calls runtime.libcCall<1>
                                  grows 48 bytes, calls runtime.asmcgocall<0>
                                      grows 16 bytes, calls gosave_systemstack_switch<34>
                                          grows 0 bytes, calls runtime.abort<0>
                                          8 bytes over limit
                                      grows 16 bytes, calls runtime.save_g<0>
                                      8 bytes over limit
                                      grows 16 bytes, calls indirect
                                          grows 0 bytes, calls runtime.morestack<0>
                                          8 bytes over limit
                                      grows 16 bytes, calls runtime.save_g<0>
                                      8 bytes over limit
                                      grows 16 bytes, calls indirect
                                          grows 0 bytes, calls runtime.morestack<0>
                                          8 bytes over limit
                      grows 80 bytes, calls runtime.nanotime<1>
                          grows 32 bytes, calls runtime.nanotime1<0>
                              grows 48 bytes, calls runtime.libcCall<1>
                                  grows 48 bytes, calls runtime.asmcgocall<0>
                                      grows 16 bytes, calls gosave_systemstack_switch<34>
                                          grows 0 bytes, calls runtime.abort<0>
                                          8 bytes over limit
                                      grows 16 bytes, calls runtime.save_g<0>
                                      8 bytes over limit
                                      grows 16 bytes, calls indirect
                                          grows 0 bytes, calls runtime.morestack<0>
                                          8 bytes over limit
                                      grows 16 bytes, calls runtime.save_g<0>
                                      8 bytes over limit
                                      grows 16 bytes, calls indirect
                                          grows 0 bytes, calls runtime.morestack<0>
                                          8 bytes over limit
                      grows 80 bytes, calls runtime.nanotime<1>
                          grows 32 bytes, calls runtime.nanotime1<0>
                              grows 48 bytes, calls runtime.libcCall<1>
                                  grows 48 bytes, calls runtime.asmcgocall<0>
                                      grows 16 bytes, calls gosave_systemstack_switch<34>
                                          grows 0 bytes, calls runtime.abort<0>
                                          8 bytes over limit
                                      grows 16 bytes, calls runtime.save_g<0>
                                      8 bytes over limit
                                      grows 16 bytes, calls indirect
                                          grows 0 bytes, calls runtime.morestack<0>
                                          8 bytes over limit
                                      grows 16 bytes, calls runtime.save_g<0>
                                      8 bytes over limit
                                      grows 16 bytes, calls indirect
                                          grows 0 bytes, calls runtime.morestack<0>
                                          8 bytes over limit

@mauri870 mauri870 reopened this Nov 23, 2023
@github-project-automation github-project-automation bot moved this from Done to In Progress in Go Compiler / Runtime Nov 23, 2023
@mknyszek
Copy link
Contributor

@mauri870 Thanks. Unfortunately that patch doesn't help because nowfunc also needs to be nosplit. If the call to nowfunc tries to grow the stack, then something could break at runtime.

Running nanotime on the system stack is fine and would fix it, but may impact performance. I don't see an easy solution yet; I'll keep thinking.

@mauri870
Copy link
Member Author

Thanks for looking into it, I tried running nanotime on the system stack but as you said it impacted performance heavily, so not really a solution we can consider.

@mauri870
Copy link
Member Author

mauri870 commented Nov 27, 2023

@mknyszek What about wrapping casgstatus itself in systemstack? I see that casgstatus is called on the systemstack in quite a lot of places, maybe that is fine to do in reentersyscall?

@mknyszek
Copy link
Contributor

I think we should just wrap the contents of nanotime1 in src/runtime/sys_darwin.go with systemstack. It's going to switch to the system stack anyway in asmcgocall. (Thanks @cherrymui for the pointer.)

I tried running nanotime on the system stack but as you said it impacted performance heavily

What did you use to measure it? It occurs to me that most platforms already have to switch to the system stack for nanotime at some point (also thanks to @cherrymui). An extra switch just means an extra call and copying the result back to the caller through the closure. This is probably a few nanoseconds at most, though I guess it already only takes 10s of nanoseconds to call on linux/amd64, so I suppose if you called nanotime on the system stack in a loop I could see that possibly looking like a bigger regression.

At this point, I think my main concern is more about having to add wrappers around quite a few call sites. Since this issue is specific to darwin, handling it in nanotime1 seems cleaner to me. I'll give that a try and send a CL if it works.

@cherrymui
Copy link
Member

I took a look at the code generation for syscall.ptrace1 on darwin/arm64 with and without -N (-l is applied on both). Without -N, it uses 80 bytes of stack, with -N, 160. In the -N mode it has code like

        0x0048 00072 (syscall/zsyscall_darwin_arm64.go:2015) MOVD    R0, syscall..autotmp_10-24(SP)
        0x004c 00076 (syscall/zsyscall_darwin_arm64.go:2015) MOVD    R1, syscall..autotmp_11-32(SP)
        0x0050 00080 (syscall/zsyscall_darwin_arm64.go:2015) MOVD    R2, syscall..autotmp_12-40(SP)
        0x0054 00084 (syscall/zsyscall_darwin_arm64.go:2015) MOVD    R0, syscall..autotmp_6-48(SP)
        0x0058 00088 (syscall/zsyscall_darwin_arm64.go:2015) MOVD    R1, syscall..autotmp_7-56(SP)
        0x005c 00092 (syscall/zsyscall_darwin_arm64.go:2015) MOVD    R2, syscall..autotmp_8-64(SP)

which spill registers to autotmps but they are literally never used. We should be able to remove those without affecting debugging experience.

I wonder for nosplit functions, even with -N, we should still do some optimizations, at least the ones that can improve stack usage without affecting debugging experience, like removing dead autotmps. Also, nosplit functions are rare and are usually runtime-related code (like the syscall package). It is very unlikely user want to debug those packages.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/545275 mentions this issue: runtime: call libcCall on the systemstack in nanotime1 on Darwin

@mauri870
Copy link
Member Author

mauri870 commented Nov 27, 2023

I tried running nanotime on the system stack but as you said it impacted performance heavily

What did you use to measure it?

I believe I misinterpreted my benchmark results. The one that became slow was wrapping casgstatus in systemstack, I tested the fix and it looks fine.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/545276 mentions this issue: syscall: remove ptrace1 on darwin

gopherbot pushed a commit that referenced this issue Nov 27, 2023
On Darwin, the ptrace syscall is called in ptrace1, which then be
called in ptrace. This allows ptrace1 be disabled on iOS (by
implementing ptrace differently). But we can also achieve this by
adding a conditional directly in ptrace. This reduces stack usage
with -N -l, while keeping ptrace disabled on iOS.

For #64113.

Change-Id: I89d8e317e77352fffdbb5a25ba21ee9cdf2e1e20
Reviewed-on: https://go-review.googlesource.com/c/go/+/545276
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Run-TryBot: Cherry Mui <cherryyz@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
@mknyszek
Copy link
Contributor

I think this should be fixed now as of Cherry's fix (not mine).

@github-project-automation github-project-automation bot moved this from In Progress to Done in Go Compiler / Runtime Nov 27, 2023
@cherrymui
Copy link
Member

cherrymui commented Nov 27, 2023

CL https://go.dev/cl/545276 eliminates the ptrace1 frame, which should save ~80 bytes. I think this should get us back to under the limit. I tried building cmd/go with -N -l and it doesn't overflow.

In the next cycle maybe we'll make -N less aggressive on nosplit functions, while keeping a reasonable debugging experience.

@golang golang locked and limited conversation to collaborators Nov 26, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done. release-blocker
Projects
None yet
Development

No branches or pull requests

5 participants