Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: shrinkstack during mark termination significantly increases GC STW time #12967

Closed
rhysh opened this issue Oct 17, 2015 · 27 comments
Closed
Milestone

Comments

@rhysh
Copy link
Contributor

@rhysh rhysh commented Oct 17, 2015

I have a process that experiences significant STW times during the mark termination phase of garbage collection. When it has around 100,000 goroutines and a live heap of around 500MB, its STW pauses take tens of milliseconds. The process runs on linux/amd64 with GOMAXPROCS=18, with a recent version of Go tip (30b9663).

I profiled its CPU usage with perf_events, looking specifically at samples with runtime.gcMark on the stack. 7.5% of samples include runtime.gcMark -> runtime.freeStackSpans (non-parallelized work, discussed in #11465). 90% of samples include runtime.gcMark -> runtime.parfordo -> runtime.markroot, and 75% of samples include runtime.gcMark -> runtime.parfordo -> runtime.markroot -> runtime.shrinkstack.

The work for stack shrinking is split 18 ways in my case, but at 75% of CPU cycles it's still a huge contributor to mark termination pauses, and the pauses my program sees are well beyond the 10ms goal with only a modest heap size.

Does stack shrinking need to take place while the world is stopped?

$ go version
go version devel +30b9663 Tue Oct 13 21:06:58 2015 +0000 linux/amd64

@aclements

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Oct 17, 2015

@randall77
Copy link
Contributor

@randall77 randall77 commented Oct 17, 2015

We could move stack shrinking to the phase where we stop one goroutine at a time.

@aclements
Copy link
Member

@aclements aclements commented Oct 19, 2015

I agree. I don't think there would be any problem with shrinking each stack right before we do the concurrent scan of that stack. We have to be careful not to use any cross-stack pointers between when we shrink our own stack and when we return from the system stack, but that may actually be easier to ensure if we do the shrink as part of stack scanning.

@rhysh
Copy link
Contributor Author

@rhysh rhysh commented Oct 20, 2015

I've written a reproducer, included below. Its perf_events profile, collected on linux/amd64, looks very similar to what I see for my production program in terms of CPU time spent in runtime.freeStackSpans (10%), in runtime.shrinkstack (83%), in runtime.stackalloc (37%), and in runtime.stackfree (37%).

The reproducer starts 10,000 goroutines which each repeatedly grow and shrink their stack by 10kB. The production program has around 200,000 goroutines and has a garbage collection every 60-120 seconds, so any goroutines that grow and shrink their stacks during that time will contribute to the mark termination pause. (Ironic that generating more garbage, which would lead to more frequent collections, could reduce the duration of each pause.)

Here's output from some runs of the program on darwin/amd64.

$ go version
go version devel +3bc0601 Tue Oct 20 03:16:09 2015 +0000 darwin/amd64
$ GODEBUG=gctrace=1 ./12967 
gc 1 @0.002s 20%: 22+9.1+0.15+2.4+0.72 ms clock, 45+9.1+0+0/4.5/0+1.4 ms cpu, 10->14->13 MB, 15 MB goal, 8 P
gc 2 @0.126s 7%: 0.24+24+0.001+0.065+0.50 ms clock, 0.48+24+0+1.0/0/6.2+1.0 ms cpu, 15->15->15 MB, 27 MB goal, 8 P
gc 3 @0.856s 1%: 0.31+24+0.001+0.067+0.54 ms clock, 2.4+24+0+1.4/0.001/10+4.3 ms cpu, 22->22->16 MB, 30 MB goal, 8 P
gc 4 @2.118s 0%: 0.25+24+0.001+0.066+0.51 ms clock, 2.0+24+0+0.029/0/13+4.1 ms cpu, 28->28->16 MB, 32 MB goal, 8 P
gc 5 @3.684s 2%: 0.19+20+0.002+0.068+79 ms clock, 1.5+20+0+0.95/0.003/6.7+638 ms cpu, 31->31->16 MB, 32 MB goal, 8 P
gc 6 @5.239s 2%: 0.36+24+0.001+0.10+0.53 ms clock, 2.9+24+0+0.031/0.021/13+4.2 ms cpu, 31->31->16 MB, 32 MB goal, 8 P
gc 7 @6.904s 1%: 0.25+29+0.002+0.069+0.59 ms clock, 2.0+29+0+7.2/0.002/0.039+4.7 ms cpu, 32->32->16 MB, 33 MB goal, 8 P
gc 8 @8.551s 1%: 0.22+20+0.002+0.065+23 ms clock, 1.8+20+0+1.7/0/7.1+185 ms cpu, 32->32->16 MB, 33 MB goal, 8 P
gc 9 @10.175s 1%: 0.24+25+0.001+0.071+0.57 ms clock, 1.9+25+0+0/0.014/10+4.5 ms cpu, 32->32->15 MB, 33 MB goal, 8 P
gc 10 @11.690s 1%: 0.21+26+0.002+0.072+0.60 ms clock, 1.6+26+0+5.6/0.002/0.016+4.8 ms cpu, 30->30->16 MB, 31 MB goal, 8 P
gc 11 @13.316s 1%: 0.25+23+0.001+0.054+13 ms clock, 2.0+23+0+1.8/0.006/12+105 ms cpu, 32->32->16 MB, 33 MB goal, 8 P
gc 12 @14.942s 1%: 0.24+19+0.001+0.059+8.8 ms clock, 1.9+19+0+0.80/0.001/7.1+70 ms cpu, 32->32->16 MB, 33 MB goal, 8 P
gc 13 @16.603s 1%: 0.22+25+0.001+0.085+0.61 ms clock, 1.7+25+0+3.8/0.020/7.4+4.9 ms cpu, 32->32->16 MB, 33 MB goal, 8 P
gc 14 @18.239s 1%: 0.22+20+0.001+0.15+14 ms clock, 1.7+20+0+0.034/0.010/10+112 ms cpu, 32->32->16 MB, 33 MB goal, 8 P
gc 15 @19.875s 1%: 0.23+21+0.001+0.069+9.7 ms clock, 1.8+21+0+0/0.002/8.1+78 ms cpu, 32->32->15 MB, 33 MB goal, 8 P
gc 16 @21.392s 0%: 0.21+23+0.001+0.067+0.59 ms clock, 1.6+23+0+0.26/0.002/10+4.7 ms cpu, 30->30->16 MB, 31 MB goal, 8 P
gc 17 @23.014s 0%: 0.20+21+0.001+0.067+13 ms clock, 1.6+21+0+1.7/0.015/9.0+105 ms cpu, 32->32->16 MB, 33 MB goal, 8 P

Disabling stack shrinking makes the mark termination STW pauses significantly shorter.

$ GODEBUG=gctrace=1,gcshrinkstackoff=1 ./12967 
gc 1 @0.002s 20%: 22+6.9+0.21+2.2+0.64 ms clock, 45+6.9+0+0/4.2/0+1.2 ms cpu, 10->14->13 MB, 15 MB goal, 8 P
gc 2 @0.128s 6%: 0.30+23+0.001+0.069+0.56 ms clock, 0.60+23+0+0.82/0.002/8.5+1.1 ms cpu, 15->15->15 MB, 27 MB goal, 8 P
gc 3 @0.841s 1%: 0.21+24+0.001+0.053+0.47 ms clock, 1.7+24+0+1.1/0.004/7.2+3.7 ms cpu, 22->22->16 MB, 30 MB goal, 8 P
gc 4 @2.059s 0%: 0.29+24+0.001+0.058+0.43 ms clock, 2.0+24+0+0/0.010/8.1+3.0 ms cpu, 28->28->15 MB, 32 MB goal, 8 P
gc 5 @3.475s 0%: 0.28+21+0.002+0.077+0.28 ms clock, 2.2+21+0+0.049/0.001/11+2.3 ms cpu, 29->29->16 MB, 30 MB goal, 8 P
gc 6 @4.995s 0%: 0.20+19+0.002+0.077+0.28 ms clock, 1.6+19+0+0/0.027/9.5+2.3 ms cpu, 31->31->15 MB, 32 MB goal, 8 P
gc 7 @6.509s 0%: 0.20+28+0.001+0.068+0.56 ms clock, 1.6+28+0+6.0/0.001/0.032+4.5 ms cpu, 30->30->16 MB, 31 MB goal, 8 P
gc 8 @8.138s 0%: 0.21+20+0.001+0.077+0.27 ms clock, 1.6+20+0+0/0.018/7.0+2.1 ms cpu, 32->32->15 MB, 33 MB goal, 8 P
gc 9 @9.652s 0%: 0.25+21+0.001+0.070+0.26 ms clock, 2.0+21+0+0.28/0.015/10+2.1 ms cpu, 30->30->16 MB, 31 MB goal, 8 P
gc 10 @11.275s 0%: 0.23+22+0.002+0.078+0.64 ms clock, 1.8+22+0+0/0.018/7.6+5.1 ms cpu, 32->32->15 MB, 33 MB goal, 8 P
gc 11 @12.791s 0%: 0.26+21+0.001+0.070+0.23 ms clock, 1.8+21+0+0/0.014/10+1.6 ms cpu, 30->30->15 MB, 31 MB goal, 8 P
gc 12 @14.302s 0%: 0.20+19+0.002+0.073+0.28 ms clock, 1.6+19+0+0/0.002/7.2+2.2 ms cpu, 30->30->15 MB, 31 MB goal, 8 P
gc 13 @15.815s 0%: 0.26+28+0.001+0.066+0.47 ms clock, 2.1+28+0+6.6/0/0.020+3.7 ms cpu, 30->30->16 MB, 31 MB goal, 8 P
gc 14 @17.441s 0%: 0.26+24+0.001+0.067+0.56 ms clock, 2.0+24+0+1.2/0.001/8.5+4.5 ms cpu, 32->32->16 MB, 33 MB goal, 8 P
gc 15 @19.064s 0%: 0.21+20+0.001+0.077+0.24 ms clock, 1.7+20+0+0/0.017/9.6+1.9 ms cpu, 32->32->15 MB, 33 MB goal, 8 P
gc 16 @20.577s 0%: 0.22+22+0.001+0.079+0.54 ms clock, 1.8+22+0+0/0.002/9.5+4.3 ms cpu, 30->30->15 MB, 31 MB goal, 8 P
gc 17 @22.091s 0%: 0.21+26+0.001+0.070+0.55 ms clock, 1.6+26+0+4.8/0.001/0.014+4.4 ms cpu, 30->30->16 MB, 31 MB goal, 8 P
package main

import (
    "sync"
    "time"
)

const (
    ballastSize   = 10 << 20
    garbageSize   = 1 << 20
    garbagePeriod = 100 * time.Millisecond

    stackCount  = 10000
    stackSize   = 10 << 10
    stackPeriod = 5 * time.Second
)

var (
    ballast []byte
    garbage []byte
)

func churn() {
    ballast = make([]byte, ballastSize)

    for {
        time.Sleep(garbagePeriod)
        garbage = make([]byte, garbageSize)
    }
}

func stack(a, b *sync.WaitGroup) {
    for {
        grow(a)
        b.Wait()
    }
}

func grow(a *sync.WaitGroup) byte {
    var s [stackSize]byte
    a.Wait()
    return s[0]
}

func main() {
    go churn()

    var a, b sync.WaitGroup
    a.Add(1)
    for i := 0; i < stackCount; i++ {
        go stack(&a, &b)
    }

    for {
        time.Sleep(stackPeriod / 2)
        b.Add(1)
        a.Add(-1)

        time.Sleep(stackPeriod / 2)
        a.Add(1)
        b.Add(-1)
    }
}
@aclements
Copy link
Member

@aclements aclements commented Oct 26, 2015

I thought of one problem with shrinking stacks during concurrent stack scan. If the goroutine is blocked in a channel operation or select, there are pointers in to that G's stack from runtime structures and updating these pointers during shrinking could race with another G completing that channel operation or select. This isn't an issue for stack growing because that's always done synchronously.

I can think of a few solutions:

  1. We could not shrink the stack concurrently. This would be unfortunate, since goroutines are often blocked on channels for a long time.

  2. The stack shrinking code could detect this condition and lock the appropriate channels while shrinking to prevent a channel operation from completing during the shrink. This might not be so hard: we could add the *hchan to the sudog and then walk the g.waiting list in shrinkstack. We'd probably want to rearrange copystack a bit so we didn't have to keep these channels locked for the whole copystack. This has the downside of blocking all other operations on these channels even if the G being shrunk is way back in the channel's queue.

  3. Channel operations could detect this condition and wait for the stack shrink to finish or just requeue the waiting G (which doesn't help if this is the only G waiting on the channel). Channel operations would also have to block a stack shrink from starting during the operation. This has the advantage of only blocking operations that actually deal with the G being shrunk. We could probably do this by synchronizing on the G's status. Introduce a new status for "in a channel op" and make the channel completion put the G in to this status. If the G is in _Gcopystack, that will spin until the copy stack is done. If it's not in _Gcopystack, that will cause a copystack to spin during the channel op completion until it can put it in to _Gcopystack.

@rhysh
Copy link
Contributor Author

@rhysh rhysh commented Oct 27, 2015

Does your first solution mean that the stacks would be shrunk during a STW phase, or that they would not be shrunk at all?

Most of the goroutines in the application I described are blocked on channel operations nearly all the time. The goroutines occasionally get a message on their channel that makes them call a function with a large stack requirement (and which returns quickly).

Locking all goroutines that are attempting to communicate on a channel, as described in the second solution, would make "quit" channels more expensive if shared across many goroutines. Stack shrinking on any one of the participating goroutines would delay all of the others.

The third solution seems easy to reason about, and sounds like it will have predictable performance.

@rsc
Copy link
Contributor

@rsc rsc commented Nov 5, 2015

Per Austin, this is too invasive for Go 1.6. He may look into having it ready for the start of the 1.7 cycle.

@randall77
Copy link
Contributor

@randall77 randall77 commented Nov 8, 2015

So as I see it there are a couple of options to enable shrinking outside a STW phase.

  1. Have receivers allocate space for the received data on the heap.
  2. Have unbuffered channels allocate one slot which is used to pass the sent value.
  3. Shrinker must grab the channel lock.
  4. Sender & shrinker must grab a lock on the stack (or G, as Austin proposes)
  5. Shrink stacks in place.

1 means allocation per channel receive.
2 means some extra memory + some extra synchronization.
3 might be difficult for select statements, we'd need to grab all the chan locks.
4 means a second lock per send.
5 means possibly greater fragmentation.

As others have also pointed out, I don't think preventing shrinking while in receive is a viable option.
Any others?

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Nov 9, 2015

You probably considered this, but everything flows through the sudog, so we could add a lock there. Once the sudog is selected for a channel op, we can lock it. Stack shrinking already knows about the sudogs associated with the channel, it could lock all of them first. If stack shrinking sees that a sudog is already locked, it can punt and wait until next time. If the channel op sees that the sudog is locked, we know there can be at must one channel op pending for a goroutine, so the channel op can sleep on a note until the stack shrinking is complete. So I think the sudog lock can be a single atomic integer managed through cas, plus a note. I think the overhead in the normal case is one additional atomic cas and one atomic store per channel send.

@aclements
Copy link
Member

@aclements aclements commented Nov 9, 2015

@RLH and I were talking about another solution that sort of combines @randall77's first suggestion and @ianlancetaylor's suggestion: reserve a few words in the sudog for receiving small values (which are presumably the majority of channel ops) and otherwise allocate space on the heap for a slot. The receiver would then copy the value in to its own stack when it wakes up. This would eliminate cross-stack writes, at the cost of slightly larger sudogs and a heap allocation for passing large values.

@RLH
Copy link
Contributor

@RLH RLH commented Nov 9, 2015

I suspect that we can also make the sudog slot solution lock free, just cas
in the value, possible using a 2 word cas so we can get status bits also.

On Mon, Nov 9, 2015 at 11:16 AM, Austin Clements notifications@github.com
wrote:

@RLH https://github.com/RLH and I were talking about another solution
that sort of combines @randall77 https://github.com/randall77's first
suggestion and @ianlancetaylor https://github.com/ianlancetaylor's
suggestion: reserve a few words in the sudog for receiving small values
(which are presumably the majority of channel ops) and otherwise allocate
space on the heap for a slot. The receiver would then copy the value in to
its own stack when it wakes up. This would eliminate cross-stack writes, at
the cost of slightly larger sudogs and a heap allocation for passing large
values.


Reply to this email directly or view it on GitHub
#12967 (comment).

@aclements
Copy link
Member

@aclements aclements commented Nov 9, 2015

@RLH, I think the sudog solution doesn't introduce any additional locks or even atomic ops over what we have now. Existing locking already ensures the sudog isn't going anywhere and there's no competition for writing to a slot in it since the sudog represents a single (goroutine, channel) pair.

@randall77
Copy link
Contributor

@randall77 randall77 commented Nov 9, 2015

The trouble with putting some space in the sudog is that it needs to be typed correctly for GC. So if we wanted 4 words of space, we'd need 16 pseudog types to encode ptr/nonptr for each of those 4 slots.

This isn't a showstopper, but it complicates things like acquireSudog.

@randall77
Copy link
Contributor

@randall77 randall77 commented Nov 9, 2015

Of course, if we went the route of different sudog types we could make them exactly the required size.

@aclements
Copy link
Member

@aclements aclements commented Jan 13, 2016

@davecheney's email (https://groups.google.com/d/topic/golang-dev/SJ3LAW5ZXDs/discussion) got me thinking about this again; specifically @randall77's 5th option of shrinking stacks in place. The benefit is of course that this would turn a fairly expensive moving operation into a cheap metadata operation.

The concern with shrinking stacks in place is that it would increase fragmentation. However, I wonder how bad of a problem this really is and to what extent we could mitigate it. I think the worst-case here is a large number of goroutines start and grow their stacks to 32K, then all shrink their stacks to 2K. If we do this all in place, each 2K stack will pin a 32K span for stacks, a 16X fragmentation ratio. However, this is nothing new; you can do the same thing by starting lots of goroutines and selectively exiting them, or even carefully allocating and releasing objects. But all of these situations naturally remedy themselves as a program continues doing real work. Fragmentation will decrease as other goroutines start (and use the free slots in the existing stack spans) or as the running goroutines grow their stacks (forcing them to move) or exit.

We could also actively mitigate fragmentation. We could move stacks if we detect large fragmentation, but I'd just as soon not. Alternatively, we could combat fragmentation generally by switching from the current size-segregated stack free lists to a buddy allocator. The constraints of stack allocation fit quite nicely with a buddy allocator. This would let one span be used for multiple stack size classes, reducing size calcification. It would efficiently merge neighboring free slots and break up large slots when necessary, while keeping allocation and freeing very efficient. It would only require 16 bits of metadata per stack span, which we could easily steal from several places. We could even do opportunistic in-place stack growth for small stacks. This wouldn't prevent the above worst-case fragmentation, but it would remedy it faster by allowing the free space to be used for other size classes.

@aclements
Copy link
Member

@aclements aclements commented Feb 29, 2016

Well that turned out to be hairier than I expected.

@rhysh, care to give https://go-review.googlesource.com/20044 a try? For me it reduces the STW time on your test program to under 2ms.

@gopherbot
Copy link

@gopherbot gopherbot commented Feb 29, 2016

CL https://golang.org/cl/20033 mentions this issue.

@gopherbot
Copy link

@gopherbot gopherbot commented Feb 29, 2016

CL https://golang.org/cl/20041 mentions this issue.

@gopherbot
Copy link

@gopherbot gopherbot commented Feb 29, 2016

CL https://golang.org/cl/20034 mentions this issue.

@gopherbot
Copy link

@gopherbot gopherbot commented Feb 29, 2016

CL https://golang.org/cl/20039 mentions this issue.

@gopherbot
Copy link

@gopherbot gopherbot commented Feb 29, 2016

CL https://golang.org/cl/20042 mentions this issue.

@gopherbot
Copy link

@gopherbot gopherbot commented Feb 29, 2016

CL https://golang.org/cl/20038 mentions this issue.

@gopherbot
Copy link

@gopherbot gopherbot commented Feb 29, 2016

CL https://golang.org/cl/20040 mentions this issue.

@gopherbot
Copy link

@gopherbot gopherbot commented Feb 29, 2016

CL https://golang.org/cl/20037 mentions this issue.

@gopherbot
Copy link

@gopherbot gopherbot commented Feb 29, 2016

CL https://golang.org/cl/20035 mentions this issue.

@gopherbot
Copy link

@gopherbot gopherbot commented Feb 29, 2016

CL https://golang.org/cl/20044 mentions this issue.

@rhysh
Copy link
Contributor Author

@rhysh rhysh commented Mar 3, 2016

Thanks @aclements, your changes result in a big improvement in mark termination time for my application, bringing it down to around 20ms with ~120k goroutines and ~500MB live heap (compared with 50ms at 90k goroutines and 330MB heap on go1.6). Some of the extra pause time may be attributable to #14419 or to #14406, which I did not mitigate while testing this change.

I've been testing with CL 20044 PS 2.

go1.6 with framepointer, ~90k goroutines:

gc 463 @23853.946s 0%: 14+69+49 ms clock, 478+716/606/779+1583 ms cpu, 652->653->334 MB, 669 MB goal, 36 P
gc 464 @23920.847s 0%: 5.9+66+48 ms clock, 190+716/573/826+1546 ms cpu, 653->653->334 MB, 669 MB goal, 36 P
gc 465 @23991.665s 0%: 5.5+68+48 ms clock, 177+646/575/780+1567 ms cpu, 651->652->334 MB, 668 MB goal, 36 P
gc 466 @24060.214s 0%: 6.8+75+50 ms clock, 185+939/657/439+1350 ms cpu, 651->651->333 MB, 668 MB goal, 36 P
gc 467 @24132.336s 0%: 5.7+64+53 ms clock, 185+796/559/650+1721 ms cpu, 650->651->333 MB, 667 MB goal, 36 P
gc 468 @24204.829s 0%: 6.4+80+61 ms clock, 205+778/665/805+1972 ms cpu, 650->650->333 MB, 666 MB goal, 36 P
gc 469 @24280.941s 0%: 5.1+69+56 ms clock, 165+645/578/808+1804 ms cpu, 650->652->332 MB, 667 MB goal, 36 P
gc 470 @24347.731s 0%: 5.3+80+53 ms clock, 172+754/689/611+1718 ms cpu, 647->649->332 MB, 664 MB goal, 36 P
gc 471 @24423.086s 0%: 10+74+48 ms clock, 326+595/635/875+1567 ms cpu, 647->649->331 MB, 663 MB goal, 36 P
gc 472 @24493.659s 0%: 5.5+69+50 ms clock, 176+1110/611/452+1616 ms cpu, 644->644->330 MB, 660 MB goal, 36 P

CL 20044 PS 2, devel +f9e73e9 Mon Feb 29 11:13:35 2016 -0500 with framepointer, ~120k goroutines:

gc 659 @42602.290s 0%: 4.6+108+21 ms clock, 74+1353/956/1017+345 ms cpu, 945->945->485 MB, 969 MB goal, 36 P
gc 660 @42655.022s 0%: 5.7+115+22 ms clock, 182+1215/1004/1070+709 ms cpu, 946->946->487 MB, 970 MB goal, 36 P
gc 661 @42701.260s 0%: 5.1+111+22 ms clock, 165+1230/967/1143+718 ms cpu, 951->952->488 MB, 975 MB goal, 36 P
gc 662 @42748.210s 0%: 5.0+106+19 ms clock, 100+1417/929/914+391 ms cpu, 953->953->489 MB, 977 MB goal, 36 P
gc 663 @42801.338s 0%: 8.2+98+17 ms clock, 265+1200/867/1211+569 ms cpu, 954->956->490 MB, 978 MB goal, 36 P
gc 664 @42847.753s 0%: 4.0+123+24 ms clock, 130+1347/948/984+796 ms cpu, 954->956->490 MB, 978 MB goal, 36 P
gc 665 @42891.652s 0%: 5.7+93+18 ms clock, 184+1097/821/1143+582 ms cpu, 957->957->490 MB, 981 MB goal, 36 P
gc 666 @42947.239s 0%: 8.0+109+18 ms clock, 258+1210/941/1224+577 ms cpu, 957->960->492 MB, 981 MB goal, 36 P
gc 667 @42993.313s 0%: 11+105+20 ms clock, 368+1298/908/1167+644 ms cpu, 958->959->492 MB, 982 MB goal, 36 P
gc 668 @43047.532s 0%: 10+102+19 ms clock, 325+1245/896/1163+636 ms cpu, 960->962->493 MB, 984 MB goal, 36 P
mk0x9 pushed a commit to mk0x9/go that referenced this issue Mar 14, 2016
Currently copystack adjusts pointers in the old stack and then copies
the adjusted stack to the new stack. In addition to being generally
confusing, this is going to make concurrent stack shrinking harder.

Switch this around so that we first copy the stack and then adjust
pointers on the new stack (never writing to the old stack).

This reprises CL 15996, but takes a different and simpler approach. CL
15996 still walked the old stack while adjusting pointers on the new
stack. In this CL, we adjust auxiliary structures before walking the
stack, so we can just walk the new stack.

For golang#12967.

Change-Id: I94fa86f823ba9ee478e73b2ba509eed3361c43df
Reviewed-on: https://go-review.googlesource.com/20033
Reviewed-by: Rick Hudson <rlh@golang.org>
gopherbot pushed a commit that referenced this issue Mar 14, 2016
In particular, write down the rules for stack ownership because the
details of this are about to get very important with concurrent stack
shrinking. (Interestingly, the details don't actually change, but
anything that's currently skating on thin ice is likely to fall
through.)

Fox #12967.

Change-Id: I561e2610e864295e9faba07717a934aabefcaab9
Reviewed-on: https://go-review.googlesource.com/20034
Reviewed-by: Rick Hudson <rlh@golang.org>
gopherbot pushed a commit that referenced this issue Mar 16, 2016
Given a G, there's currently no way to find the channel it's blocking
on. We'll need this information to fix a (probably theoretical) bug in
select and to implement concurrent stack shrinking, so record the
channel in the sudog.

For #12967.

Change-Id: If8fb63a140f1d07175818824d08c0ebeec2bdf66
Reviewed-on: https://go-review.googlesource.com/20035
Reviewed-by: Rick Hudson <rlh@golang.org>
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
gopherbot pushed a commit that referenced this issue Mar 16, 2016
Currently the g.waiting list created by a select is in poll order.
However, nothing depends on this, and we're going to need access to
the channel lock order in other places shortly, so modify select to
put the waiting list in channel lock order.

For #12967.

Change-Id: If0d38816216ecbb37a36624d9b25dd96e0a775ec
Reviewed-on: https://go-review.googlesource.com/20037
Reviewed-by: Rick Hudson <rlh@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
Run-TryBot: Austin Clements <austin@google.com>
gopherbot pushed a commit that referenced this issue Mar 16, 2016
gopark calls the unlock function after setting the G to _Gwaiting.
This means it's generally unsafe to access the G's stack from the
unlock function because the G may start running on another P. Once we
start shrinking stacks concurrently, a stack shrink could also move
the stack the moment after it enters _Gwaiting and before the unlock
function is called.

Document this restriction and fix the two places where we currently
violate it.

This is unlikely to be a problem in practice for these two places
right now, but they're already skating on thin ice. For example, the
following sequence could in principle cause corruption, deadlock, or a
panic in the select code:

On M1/P1:
1. G1 selects on channels A and B.
2. selectgoImpl calls gopark.
3. gopark puts G1 in _Gwaiting.
4. gopark calls selparkcommit.
5. selparkcommit releases the lock on channel A.

On M2/P2:
6. G2 sends to channel A.
7. The send puts G1 in _Grunnable and puts it on P2's run queue.
8. The scheduler runs, selects G1, puts it in _Grunning, and resumes G1.
9. On G1, the sellock immediately following the gopark gets called.
10. sellock grows and moves the stack.

On M1/P1:
11. selparkcommit continues to scan the lock order for the next
channel to unlock, but it's now reading from a freed (and possibly
reused) stack.

This shouldn't happen in practice because step 10 isn't the first call
to sellock, so the stack should already be big enough. However, once
we start shrinking stacks concurrently, this reasoning won't work any
more.

For #12967.

Change-Id: I3660c5be37e5be9f87433cb8141bdfdf37fadc4c
Reviewed-on: https://go-review.googlesource.com/20038
Reviewed-by: Rick Hudson <rlh@golang.org>
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
gopherbot pushed a commit that referenced this issue Mar 16, 2016
With concurrent stack shrinking, the stack can move the instant after
a G enters _Gwaiting. There are only two places that put a G into
_Gwaiting: gopark and newstack. We fixed uses of gopark. This commit
fixes newstack by simplifying its G transitions and, in particular,
eliminating or narrowing the transient _Gwaiting states it passes
through so it's clear nothing in the G is accessed while in _Gwaiting.

For #12967.

Change-Id: I2440ead411d2bc61beb1e2ab020ebe3cb3481af9
Reviewed-on: https://go-review.googlesource.com/20039
Reviewed-by: Rick Hudson <rlh@golang.org>
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
gopherbot pushed a commit that referenced this issue Mar 16, 2016
Currently sudog.elem is never accessed concurrently, so in several
cases we drop the channel lock just before reading/writing the
sent/received value from/to sudog.elem. However, concurrent stack
shrinking is going to have to adjust sudog.elem to point to the new
stack, which means it needs a way to synchronize with accesses to
sudog.elem. Hence, add sudog.elem to the fields protected by
hchan.lock and scoot the unlocks down past the uses of sudog.elem.

While we're here, better document the channel synchronization rules.

For #12967.

Change-Id: I3ad0ca71f0a74b0716c261aef21b2f7f13f74917
Reviewed-on: https://go-review.googlesource.com/20040
Reviewed-by: Keith Randall <khr@golang.org>
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
gopherbot pushed a commit that referenced this issue Mar 16, 2016
Currently, locking a G's stack by setting its status to _Gcopystack or
_Gscan is unordered with respect to channel locks. However, when we
make stack shrinking concurrent, stack shrinking will need to lock the
G and then acquire channel locks, which imposes an order on these.

Document this lock ordering and fix closechan to respect it.
Everything else already happens to respect it.

For #12967.

Change-Id: I4dd02675efffb3e7daa5285cf75bf24f987d90d4
Reviewed-on: https://go-review.googlesource.com/20041
Reviewed-by: Rick Hudson <rlh@golang.org>
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
gopherbot pushed a commit that referenced this issue Mar 16, 2016
Currently shinkstack is only safe during STW because it adjusts
channel-related stack pointers and moves send/receive stack slots
without synchronizing with the channel code. Make it safe to use when
the world isn't stopped by:

1) Locking all channels the G is blocked on while adjusting the sudogs
   and copying the area of the stack that may contain send/receive
   slots.

2) For any stack frames that may contain send/receive slot, using an
   atomic CAS to adjust pointers to prevent races between adjusting a
   pointer in a receive slot and a concurrent send writing to that
   receive slot.

In principle, the synchronization could be finer-grained. For example,
we considered synchronizing around the sudogs, which would allow
channel operations involving other Gs to continue if the G being
shrunk was far enough down the send/receive queue. However, using the
channel lock means no additional locks are necessary in the channel
code. Furthermore, the stack shrinking code holds the channel lock for
a very short time (much less than the time required to shrink the
stack).

This does not yet make stack shrinking concurrent; it merely makes
doing so safe.

This has negligible effect on the go1 and garbage benchmarks.

For #12967.

Change-Id: Ia49df3a8a7be4b36e365aac4155a2416b94b988c
Reviewed-on: https://go-review.googlesource.com/20042
Reviewed-by: Keith Randall <khr@golang.org>
Run-TryBot: Austin Clements <austin@google.com>
@gopherbot gopherbot closed this in f11e4eb Mar 16, 2016
@aclements aclements modified the milestones: Go1.7, Unplanned Mar 18, 2016
@golang golang locked and limited conversation to collaborators Mar 19, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
7 participants
You can’t perform that action at this time.