Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime/pprof: panic in appendLocsForStack on fuchsia-arm64 #40823

Open
tamird opened this issue Aug 16, 2020 · 18 comments
Open

runtime/pprof: panic in appendLocsForStack on fuchsia-arm64 #40823

tamird opened this issue Aug 16, 2020 · 18 comments
Milestone

Comments

@tamird
Copy link
Contributor

@tamird tamird commented Aug 16, 2020

We've been seeing this issue for a while; filing this bug now that we've upgraded to go1.15. (See https://fxbug.dev/52575).

This happens sometimes when running the runtime/pprof tests:

panic: runtime error: slice bounds out of range [3:2]

goroutine 6 [running]:
runtime/pprof.(*profileBuilder).appendLocsForStack(0x3e93c15c160, 0x3e93c130840, 0x0, 0x8, 0x3e93c0560b0, 0x2, 0x2, 0x0, 0x2, 0x8) 
/b/s/w/ir/x/w/fuchsia/out/not-default/host-tools/goroot/src/runtime/pprof/proto.go:410 +0x5d4
runtime/pprof.(*profileBuilder).build(0x3e93c15c160)
/b/s/w/ir/x/w/fuchsia/out/not-default/host-tools/goroot/src/runtime/pprof/proto.go:365 +0xf4
runtime/pprof.profileWriter(0xb4050418c0, 0x3e93c386030)
/b/s/w/ir/x/w/fuchsia/out/not-default/host-tools/goroot/src/runtime/pprof/pprof.go:813 +0xd4
created by runtime/pprof.StartCPUProfile
/b/s/w/ir/x/w/fuchsia/out/not-default/host-tools/goroot/src/runtime/pprof/pprof.go:784 +0x114

There's no diff between Fuchsia's fork and go1.15 in src/runtime/pprof. See https://fuchsia.googlesource.com/third_party/go/+diff/go1.15..master.

Possibly a regression of #35538, related to #37446 and its fix in https://golang.org/cl/221577.

@prattmic

@randall77
Copy link
Contributor

@randall77 randall77 commented Aug 16, 2020

@prattmic prattmic modified the milestones: Backlog, Go1.16 Aug 17, 2020
@prattmic
Copy link
Member

@prattmic prattmic commented Aug 17, 2020

I agree that at face value this looks like a regression of #35538, likely in http://golang.org/cl/221577.

http://golang.org/cl/221577 was merged into the Fuchsia tree in https://fuchsia-review.googlesource.com/c/third_party/go/+/374435 (April 4).

However, https://fxbug.dev/52575 was filed on May 13 just a few hours after https://fuchsia-review.googlesource.com/c/third_party/go/+/389033 (Go 1.14.3 merge), which IMO strongly implicates something in that merge.

@tamird do you know if the May 13 bug was really the first cases of this you saw, or was there perhaps some earlier bug that's not attached?

@tamird
Copy link
Contributor Author

@tamird tamird commented Aug 17, 2020

Yes, there were earlier bugs; sorry to have mislead you!

https://bugs.fuchsia.dev/p/fuchsia/issues/detail?id=44718 was filed on January 24
https://bugs.fuchsia.dev/p/fuchsia/issues/detail?id=47563 was filed on March 3

@prattmic
Copy link
Member

@prattmic prattmic commented Aug 17, 2020

Ah, interesting.

https://bugs.fuchsia.dev/p/fuchsia/issues/detail?id=44718

I don't recognize this crash (log, search for pprof) (@hyangah maybe you do?), but it does not seem related to this bug.

https://bugs.fuchsia.dev/p/fuchsia/issues/detail?id=47563

This crash looks like #38605. The May 13 merge included the fix for this. Though it is not clear to me if the appendLocsForStack panics were really new after that merge or the other allocToCache crash was just so much more frequent that we never saw the appendLocsForStack panic.

@hyangah
Copy link
Contributor

@hyangah hyangah commented Aug 17, 2020

Ah, interesting.

https://bugs.fuchsia.dev/p/fuchsia/issues/detail?id=44718

I don't recognize this crash (log, search for pprof) (@hyangah maybe you do?), but it does not seem related to this bug.

TestBlockProfile failed with debug=1 option. That's not the code path that exercises the proto encoding code for debug>0. Not sure why the block profile does not include the expected channel close event, but it's not a related bug to appendLocsForStack.

https://bugs.fuchsia.dev/p/fuchsia/issues/detail?id=47563

This crash looks like #38605. The May 13 merge included the fix for this. Though it is not clear to me if the appendLocsForStack panics were really new after that merge or the other allocToCache crash was just so much more frequent that we never saw the appendLocsForStack panic.

@gopherbot
Copy link

@gopherbot gopherbot commented Aug 17, 2020

Change https://golang.org/cl/248728 mentions this issue: runtime/pprof: add logging for debugging issue 40823

@tamird
Copy link
Contributor Author

@tamird tamird commented Aug 21, 2020

With the change above applied, here's what we see:

=== RUN   TestCPUProfileMultithreaded
stk:#   0xde474a80eb    runtime.nanotime1+0x10b /b/s/w/ir/x/w/fuchsia/out/not-default/host-tools/goroot/src/runtime/os_fuchsia.go:478
#       0xde474b1f83    runtime._System+0x3     /b/s/w/ir/x/w/fuchsia/out/not-default/host-tools/goroot/src/runtime/proc.go:3911
l.pcs:# 0xde474a80eb    runtime.nanotime1+0x10b /b/s/w/ir/x/w/fuchsia/out/not-default/host-tools/goroot/src/runtime/os_fuchsia.go:478
#       0xde474a80e4    runtime.nanotime+0x104  /b/s/w/ir/x/w/fuchsia/out/not-default/host-tools/goroot/src/runtime/time_nofake.go:19
#       0xde474a8050    runtime.suspendG+0x70   /b/s/w/ir/x/w/fuchsia/out/not-default/host-tools/goroot/src/runtime/preempt.go:252
panic: runtime error: slice bounds out of range [3:2]

goroutine 6 [running]:
runtime/pprof.(*profileBuilder).appendLocsForStack(0x1fd1c0b3340, 0x1fd1c130840, 0x0, 0x8, 0x1fd1c216130, 0x2, 0x2, 0x0, 0x6, 0x8) 
/b/s/w/ir/x/w/fuchsia/out/not-default/host-tools/goroot/src/runtime/pprof/proto.go:415 +0x640
runtime/pprof.(*profileBuilder).build(0x1fd1c0b3340)
/b/s/w/ir/x/w/fuchsia/out/not-default/host-tools/goroot/src/runtime/pprof/proto.go:366 +0xf4
runtime/pprof.profileWriter(0xde4768f8c0, 0x1fd1c10a120)
/b/s/w/ir/x/w/fuchsia/out/not-default/host-tools/goroot/src/runtime/pprof/pprof.go:813 +0xd4
created by runtime/pprof.StartCPUProfile
/b/s/w/ir/x/w/fuchsia/out/not-default/host-tools/goroot/src/runtime/pprof/pprof.go:784 +0x114
@prattmic
Copy link
Member

@prattmic prattmic commented Aug 21, 2020

Ah, this is very helpful. It looks like this might be a bug in runtime.expandFinalInlineFrame.

From the PCs, I suspect that nanotime1 and nanotime are inlined into suspendG, so expandFinalInlineFrame should have expanded the stack to include nanotime and suspendG, but it seems that it didn't for some reason.

Could you provide a copy of the crashing binary, or tell me how to build it so I can take a closer look?

@prattmic
Copy link
Member

@prattmic prattmic commented Aug 21, 2020

So I think the problem here is that expandFinalInlineFrame doesn't account for the magic _System, _GC, etc frames. It should expand the second frame if one of those appear at the bottom. I'll send out a patch for this.

That said, I don't understand why the traceback would be failing here: https://fuchsia.googlesource.com/third_party/go/+/refs/heads/master/src/runtime/proc.go#4069

@prattmic
Copy link
Member

@prattmic prattmic commented Aug 21, 2020

Regarding why the traceback is failing in the first place:

Since nanotime1 does call into the VDSO, it is possible there is some interaction with the handling of vdsoPC/vdsoSP or some weirdness with how profiling signals are delivered on Fuchsia (haven't looked at that much), though I'll want to take a closer at the binary.

As an aside, you all will want to port http://golang.org/cl/246763 for Fuchsia. I don't think that is the direct cause of these crashes, but is certainly suspicious.

@tamird
Copy link
Contributor Author

@tamird tamird commented Aug 21, 2020

@prattmic
Copy link
Member

@prattmic prattmic commented Aug 21, 2020

If you can reproduce this issue, I think it would be valuable to see if making nanotime1 reentrant fixes the issue, to rule that out. I wrote https://fuchsia-review.googlesource.com/c/third_party/go/+/419640, which I believe is what you need, though it is untested since I don't have a Fuchsia environment.

@tamird
Copy link
Contributor Author

@tamird tamird commented Aug 23, 2020

I haven't been able to reproduce locally, but the issue only appears on arm64. The previous file I provided was the amd64 binary; here's the arm64: https://gofile.io/d/xPjFh3.

@prattmic
Copy link
Member

@prattmic prattmic commented Aug 24, 2020

Reproducing only on ARM64 makes me wonder if this is a memory ordering issue.

VDSO calls set m.vdsoPC followed by m.vdsoSP. This order is important, because if sigprof observes that m.vdsoSP is set, then it assumes m.vdsoPC is valid. On Unix, sigprof is running on the same thread as the m it is profiling, therefore no memory synchronization is necessary.

However, on Fuchsia it is running on a different thread. Thus given that ARM allows stores to be reordered after stores, sigprof may actually observe non-zero m.vdsoSP without a valid m.vdsoPC.

Before calling sigprof, we suspend the target thread with zx_task_suspend_token. Does that call include a complete memory barrier to ensure we see fully-consistent values? The docs don't specify.

On AMD64, stores are ordered, so without a memory barrier we will still see consistent values (though perhaps stale!).

I see that the Windows port has a very similar structure, though there is no Windows ARM64 port, and I don't know if _SuspendThread includes memory barrier either.

cc @aclements @mknyszek to see if this sounds right, plus if you know more about Windows.

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Aug 24, 2020

@prattmic Your theory seems plausible to me, though I'm still tracking down how "bad m.vdsoPC" precisely translates into this crash. I suppose what this means is the gentraceback based on mp.vdsoPC and mp.vdsoSP will fail, we'll fall into the next case (the one you mentioned earlier with funcPC(_System)) and that's why stk gets populated the way it does.

Oh, I see, in an earlier post you posit that nanotime and nanotime1 are inlined into suspendG at the PC we encounter, so after expandFinalInlineFrame the _System is the extra element causing the crash. Have you confirmed yet that they got inlined? If so, I can see this all connecting together. For this inlined case the profiling code implicitly expects not to find a symbol that's marked as inline if it was marked with the pseudo-symbol _System, but because we observe a bad vdsoPC we end up falling into the wrong case (actually, any ordering where they're not both observed to be set could lead us to drop down into that case, making this somewhat likelier).

If this is the problem then we need either (1) synchronization of vdsoPC and vdsoSP or (2) make the profiling code be able to handle this extra _System element in the inlined case (which is what you suggested earlier), though it means some stacks might not be attributed "correctly." Does all this align with your thinking @prattmic?

Windows is the only platform (AFAIK) we officially support that has the "suspend thread at a distance" behavior, everything else uses signals. We had a windows-arm port (not arm64, as you point out) going at some point, but the builder was routinely broken.

@prattmic
Copy link
Member

@prattmic prattmic commented Aug 24, 2020

Yes, nanotime and nanotime1 are inlined in the binary from #40823 (comment). It looks like:

  :0                    0x7b0d8                 390083e1                MOVB R1, 32(RSP)                                
  :0                    0x7b0dc                 b4000284                CBZ R4, 20(PC)                                  
  :0                    0x7b0e0                 f90027e5                MOVD R5, 72(RSP)                                
  :0                    0x7b0e4                 d503201f                NOOP                                            
  :0                    0x7b0e8                 9400d2de                CALL syscall/zx.Sys_clock_get_monotonic(SB)     
  :0                    0x7b0ec                 f94007e0                MOVD 8(RSP), R0                                 
  :0                    0x7b0f0                 f94027e1                MOVD 72(RSP), R1                                
  :0                    0x7b0f4                 eb01001f                CMP R1, R0                                      
  :0                    0x7b0f8                 540000ca                BGE 6(PC)                                       
  :0                    0x7b0fc                 d2800140                MOVD $10, R0                                    
  :0                    0x7b100                 b9000be0                MOVW R0, 8(RSP)                                 
  :0                    0x7b104                 9400ad4f                CALL runtime.procyield(SB)                      
  :0                    0x7b108                 f94027e0                MOVD 72(RSP), R0                                
  :0                    0x7b10c                 17ffffc8                JMP -56(PC)                                     
  :0                    0x7b110                 9400b0dc                CALL runtime.osyield(SB)                        
  :0                    0x7b114                 d503201f                NOOP                                            
  :0                    0x7b118                 9400d2d2                CALL syscall/zx.Sys_clock_get_monotonic(SB)     
  :0                    0x7b11c                 f94007e0                MOVD 8(RSP), R0                                 
  :0                    0x7b120                 d282711b                MOVD $5000, R27                                 
  :0                    0x7b124                 8b1b0000                ADD R27, R0, R0                                 
  :0                    0x7b128                 17ffffc1                JMP -63(PC)                                     
...

Note that this isn't the exact same binary that crashed, but these instructions match up almost perfectly with the stack, so perhaps it is identical.

I think we ultimately need (1) as you suggest. (2) should stop the immediate crashing, but stacks will be misattributed, and I wonder if there are other odd cases we'll hit with odd memory ordering.

We could make these vdso fields explicitly atomics, or perhaps the ARM64 assembly should just manually include DMB/DSB instructions at the right point?

Of course, this is predicated on the assumption that zx_task_suspend_token doesn't include an implicit memory barrier. I briefly looked at the Zircon source and didn't notice any, but I'm far from familiar. @tamird do you know, or know who would be a better authority on this?

@prattmic
Copy link
Member

@prattmic prattmic commented Aug 24, 2020

w.r.t. Windows, I forgot that Windows has no VDSO and thus doesn't use vdsoSP/vdsoPC at all! That said, it has very similar libcallsp/libcallpc, which may have similar issues.

@dhobsd
Copy link
Contributor

@dhobsd dhobsd commented Aug 24, 2020

This makes sense. I believe there are implicit memory barriers in zx_object_wait_one (due to locking) -- which is called to wait for the profiled thread to stop after the zx_task_suspend_token but before sigprof is called. My familiarity with ARM RMO is poor; wouldn't we also need a barrier between the store to vdsoPC and vdsoSP to guarantee remote stores are seen in-order? I think sequential consistency is easier to reason about; my vote would be to make these atomic given that those APIs are generally expected to make that guarantee.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
7 participants
You can’t perform that action at this time.