Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: panic on plan9_arm builders #42303

Open
millerresearch opened this issue Oct 30, 2020 · 19 comments
Open

runtime: panic on plan9_arm builders #42303

millerresearch opened this issue Oct 30, 2020 · 19 comments

Comments

@millerresearch
Copy link

@millerresearch millerresearch commented Oct 30, 2020

The plan9_arm builders have been getting panics on every run since the afternoon of 27 Oct. The first one seems to be https://build.golang.org/log/6a299fffd128c3ed0bfdd1c471c2ca891dee8b34 after CL 232298 was merged.

The immediate cause of the panic varies. Could be a memory corruption.

@dmitshur
Copy link
Member

@dmitshur dmitshur commented Oct 30, 2020

This may be the same as or related to #42237.

@dmitshur dmitshur added this to the Backlog milestone Oct 30, 2020
@dmitshur dmitshur added the arch-arm label Oct 30, 2020
@millerresearch
Copy link
Author

@millerresearch millerresearch commented Oct 31, 2020

This may be the same as or related to #42237.

I don't think so. Those examples are deliberate panics because of a timeout. The plan9_arm panics are unexpected signals triggered by something going weirdly wrong in the runtime.

@millerresearch
Copy link
Author

@millerresearch millerresearch commented Nov 3, 2020

In some cases, the immediate cause of the panic is that _g_.m.p is nil at the top of findrunnable(). Shouldn't that be impossible?

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Nov 3, 2020

@millerresearch
Copy link
Author

@millerresearch millerresearch commented Nov 4, 2020

Another common panic cause is that _g_.m is nil after calling acquirep() in mstart1(). Also impossible?

@prattmic
Copy link
Member

@prattmic prattmic commented Nov 4, 2020

Thanks for the heads up. I agree that this looks different from the failures in #42237, though perhaps it is related. Hopefully this one is reproducible on gomote.

@prattmic
Copy link
Member

@prattmic prattmic commented Nov 4, 2020

Does this builder have stability issues (not sure who maintains it)? After waiting 4hr 40min (!!!!) for a gomote, I tried to run some tests:

$ gomote run user-mpratt-plan9-arm-0 ./go/src/all.rc  
Building Go cmd/dist using /usr/glenda/go
Building Go toolchain1 using /usr/glenda/go.
2020/11/04 16:02:28 user-mpratt-plan9-arm-0: timeout after 10s waiting for headers for /status
2020/11/04 16:02:50 Buildlet https://farmer.golang.org:443 failed three heartbeats; final error: 502 Bad Gateway
Error running run: Error trying to execute ./go/src/all.rc: Buildlet https://farmer.golang.org:443 failed heartbeat after 86.211812ms; marking dead; err=502 Bad Gateway

And it seems to still be gone:

$ gomote ping user-mpratt-plan9-arm-0
Error running ping: 502 Bad Gateway

I notice on the build dashboard that the builds that don't have these panics contain errors like: communication error to buildlet (promoted to terminal error): network error promoted to terminal error: runTests: dist test failed: all buildlets had network errors or timeouts, yet tests remain, which sounds awfully similar.

I'm not thrilled by the prospect of waiting another 5hr to see if the next gomote will share a similar fate.

cc @golang/release

@millerresearch
Copy link
Author

@millerresearch millerresearch commented Nov 4, 2020

Does this builder have stability issues

Well, that's one way of putting it. As I said above, these builders have been crashing on every run since the 27 October CL I referred to. Sometimes they panic, sometimes the corruption is so bad that the go runtime just hangs. There isn't a full time attendant to keep restarting them manually.

I've been putting in some time trying to debug this, and the common factor seems to be that
_g_.m is nil or garbage. I don't know enough about the internals of the scheduler to understand how this could happen.

@prattmic
Copy link
Member

@prattmic prattmic commented Nov 4, 2020

I see at least one crash that does look immediately like bad g.m:

Overall, this looks like general memory corruption to me, but that it is limited to runtime internals is interesting. Perhaps bad TLS somehow?

@dmitshur
Copy link
Member

@dmitshur dmitshur commented Nov 4, 2020

not sure who maintains it

Builder owners and additional notes are available at https://farmer.golang.org/builders.

@millerresearch
Copy link
Author

@millerresearch millerresearch commented Nov 5, 2020

To corroborate the circumstantial evidence implicating CL 232298 I've tried patching the current head of the master branch with the reverse of commit 8fdc79e. (It needed a bit of hand editing because of later changes in runtime/proc.go)

With the commit reversed, all.rc completes with ALL TESTS PASSED, and no memory reference traps.

I'll run it a few more times, but it does appear that something in that commit has introduced a bug or tickled an existing bug whose effect is specific to plan9_arm.

@millerresearch
Copy link
Author

@millerresearch millerresearch commented Nov 5, 2020

it does appear that something in that commit has introduced a bug

I've isolated it to the new call to wakep in wakeNetPoller. If I remove that, the panics stop.

The ongoing investigation into #42237 seems to show it's related after all, so I'll wait to see how that one is resolved.

@prattmic
Copy link
Member

@prattmic prattmic commented Nov 5, 2020

Does http://golang.org/cl/267257 (merged to tip now) fix the crashes? That was also directly fixing the startm call in wakep from wakeNetPoller.

@millerresearch
Copy link
Author

@millerresearch millerresearch commented Nov 6, 2020

Does http://golang.org/cl/267257 (merged to tip now) fix the crashes?

No, it still gets memory faults.

Since the Plan 9 runtime doesn't actually have a network poller, does the wakep serve any useful purpose there? Could we just make it conditional on GOOS != "plan9" ?

If I understand correctly, there's no network poller for wasm/js either. Should the wakep be skipped there too, to avoid unnecessary starting of threads?

@prattmic
Copy link
Member

@prattmic prattmic commented Nov 6, 2020

The netpoll delay sleep is used to wait for timers, so we still need that wakep to ensure some thread is waiting on timers.

@millerresearch
Copy link
Author

@millerresearch millerresearch commented Nov 6, 2020

The netpoll delay sleep is used to wait for timers

Is this documented somewhere? Or can you point me to where in the code this happens on systems with no netpoller?

@prattmic
Copy link
Member

@prattmic prattmic commented Nov 6, 2020

I'm not aware of great overview docs, but this was part of @ianlancetaylor's new timers last year. You can see the changes in http://golang.org/cl/171821 and its relation chain.

Notably, Plan 9's netpoll sleeps as requested, despite not performing any actual polling. findrunnable determines when the next timer expires (pollUntil here), and will later ask netpoll to wait that long.

@millerresearch
Copy link
Author

@millerresearch millerresearch commented Nov 6, 2020

Thanks, that's helpful. I actually fixed a bug in netpoll_stub earlier this year without being fully aware of how it's used. All I know is that, empirically, removing that wakep call makes the difference between memory faults (every test run) and ALL TESTS PASSED. If the wakep is "needed to ensure some thread is waiting", why don't I see deadlocks when it's removed?

@millerresearch
Copy link
Author

@millerresearch millerresearch commented Nov 7, 2020

If the wakep is "needed to ensure some thread is waiting", why don't I see deadlocks when it's removed?

The stub netpoll calls notetsleep to do an OS-level timed wait. notetsleep is being run by the m (aka thread, aka Plan 9 process, yes?) associated with g0 (the program's main goroutine, yes?). The Plan 9 implementation of notetsleep is semasleep which blocks on a timed semaphore in the OS. So isn't g0.m already the thread which is waiting on the timer?

Before calling netpoll, the code in findrunnable explicitly checks that there's no current p associated with the running thread, and after netpoll returns there's a call to pidleget to (try to) associate a p with the thread. So while the g0.m thread is blocked on the timed semaphore, what is the point of wakep looking for a spare p and m to hook together?

There must be some detail I'm not seeing, sorry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants
You can’t perform that action at this time.