Skip to content

runtime: "spinning with local work" #10573

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
aclements opened this issue Apr 24, 2015 · 8 comments
Closed

runtime: "spinning with local work" #10573

aclements opened this issue Apr 24, 2015 · 8 comments
Milestone

Comments

@aclements
Copy link
Member

Several of the builders have started crashing sometimes with "fatal error: spinning with local work". Here's the earliest one: http://build.golang.org/log/ed94ee0fce2dfcdf7cfca23716438f7f19596db2 (linux-386-387 on 0e6a6c5). This is presumably the fault of e870f06, which changed the scheduler to hand off the remainder of the time slice to the most recently ready()d G.

@fsouza
Copy link
Contributor

fsouza commented Apr 24, 2015

I've seen this failure on travis as well: https://travis-ci.org/tsuru/tsuru/jobs/59940568#L575.

@aclements
Copy link
Member Author

I can't reproduce this locally, so debugging has been slow going. Here's what I've figured out so far.

The "local work" returned by runqget always comes from p.runnext (the newly introduced field) and the run queue is otherwise empty (runqhead == runqtail). The sequence of events is something like:

  1. Something calls startm(p, true) where p is non-nil (so startm does not get a P off the idle list).
  2. startm() fails to mget() an existing m, so it calls newm(mspinning, p). At this point, runqempty(p) is true, so p.runnext == 0.
  3. The new M starts and enters mstart, which calls mstart1.
  4. mstart1 calls mspinning, which sets m.spinning to 1.
  5. mstart1 calls schedule.
  6. schedule calls runqget, which, somehow, returns a G from p.runnext, causing schedule to panic.

Either something in the above trace is wrong, or we've somehow added something to the P's run queue in the middle of it.

@minux
Copy link
Member

minux commented Apr 24, 2015 via email

@aclements
Copy link
Member Author

Finally caught it red handed calling runqput between steps 2 and 3. newm calls allocm, which "borrows" the P we're about to start and uses it to allocate, which assists GC, which has some chance of finishing the mark, which ready()s the main GC goroutine, which gets put on the borrowed P's run queue. This bug was introduced in ce502b0 when I switched this synchronization from using notes to using park/ready. It doesn't actually have anything to do with the new time slice hand off like I initially thought; this would have happened even without that. Now I just have to figure out a good way to fix it.

The full failure is here: https://storage.googleapis.com/go-build-log/7ab81449/windows-amd64-gce_edd15a5c.log from my debug hacks in https://go-review.googlesource.com/#/c/9331/.

And the relevant part:

--- FAIL: TestCgoCrashHandler (1.94s)
crash_test.go:92: building source: exit status 2
    # _/C_/Users/WINGOP~1/AppData/Local/Temp/go-build787559523
    runqput with p.startm = 112 p 0xc082014000
    fatal error: runqput at bad time

    runtime stack:
    runtime.throw(0x677af0, 0x13)
        C:/workdir/go/src/runtime/panic.go:543 +0x9d
    runtime.runqput(0xc082014000, 0xc082485a40, 0x420501)
        C:/workdir/go/src/runtime/proc1.go:3311 +0xd2
    runtime.ready(0xc082485a40, 0x0)
        C:/workdir/go/src/runtime/proc1.go:148 +0xfe
    runtime.gcBgMarkDone()
        C:/workdir/go/src/runtime/mgc.go:1102 +0xbc
    runtime.gcAssistAlloc.func1()
        C:/workdir/go/src/runtime/mgcmark.go:248 +0x12a
    runtime.systemstack(0x39fc60)
        C:/workdir/go/src/runtime/asm_amd64.s:278 +0xb5
    runtime.gcAssistAlloc(0x400, 0x401)
        C:/workdir/go/src/runtime/mgcmark.go:258 +0x16f
    runtime.mallocgc(0x400, 0x63c5a0, 0xc000000000, 0x220000)
        C:/workdir/go/src/runtime/malloc.go:694 +0x560
    runtime.newobject(0x63c5a0, 0x0)
        C:/workdir/go/src/runtime/malloc.go:725 +0x50
    runtime.allocm(0xc082014000, 0x6bb290, 0x4138a6)
        C:/workdir/go/src/runtime/proc1.go:777 +0x7d
    runtime.newm(0x6bb290, 0xc082014000)
        C:/workdir/go/src/runtime/proc1.go:995 +0x39
    runtime.startm(0xc082014000, 0x100000001)
        C:/workdir/go/src/runtime/proc1.go:1088 +0x195
    runtime.handoffp(0xc082014000)
        C:/workdir/go/src/runtime/proc1.go:1126 +0x126
    runtime.retake(0x55bd86dfcc, 0x0)
        C:/workdir/go/src/runtime/proc1.go:3016 +0x1ea
    runtime.sysmon()
        C:/workdir/go/src/runtime/proc1.go:2939 +0x209
    runtime.mstart1()
        C:/workdir/go/src/runtime/proc1.go:735 +0x100
    runtime.mstart()
        C:/workdir/go/src/runtime/proc1.go:705 +0x79

@aclements aclements self-assigned this Apr 25, 2015
@aclements aclements added this to the Go1.5 milestone Apr 25, 2015
@aclements
Copy link
Member Author

https://go-review.googlesource.com/#/c/9332/

@minux, since you seem to be able to reproduce this pretty reliably, would you mind giving that a whirl?

@josharian
Copy link
Contributor

I could reproduce reliably, and CL 9332 fixes it for me.

@minux
Copy link
Member

minux commented Apr 25, 2015 via email

@gopherbot
Copy link
Contributor

CL https://golang.org/cl/9332 mentions this issue.

@golang golang locked and limited conversation to collaborators Jun 25, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants