-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: fall back to fair locks after repeated sleep-acquire failures #13086
Comments
Some notes on the measuring strategy. The same unfair algorithm is used on two levels: sync.Mutex and runtime semaphores. If you change only one of them you can get all kinds of weird non-linear effects that can both improve and worsen fairness and performance. In the test you statically partition iterations among goroutines. And this strategy strongly prefers fair execution -- all goroutines need to finish at the same time. That can explain better numbers in benchmarks. I've did a quick prototype of fully fair (FIFO) sync.Mutex and runtime sema: func benchmarkRand(b *testing.B, n int) {
b.SetParallelism(n)
b.RunParallel(func(pb *testing.PB) {
for pb.Next() {
rand.Int()
}
})
} Here are results on a silly 2 HT core notebook (and number of goroutines is not as high as in typical Go servers):
You can see that the things Doug is talking about are not fake. Even with GOMAXPROCS=1 slowdown is significant (that's our goroutine switching overhead). |
I think we can do something along the lines of your proposal. However, we should not rely on absolute number of failed acquire tries. What matters for both user perceived latency and performance is real time (for performance it is probably frequency of acquisitions, but we can approximate it with real time). If we take a frequently acquired mutex (like the rand test), 3 is effectively the same as 1 (see the benchmark results). User does not care about number of mutex acquisitions, he cares about milliseconds. |
@dvuykov, thanks for building that prototype and showing what I did wrong. :-) You reported new benchmark results with two different things changed: both the implementation and the benchmark. What about your correct benchmark with my pseudo-fair implementation? Perhaps your results show only that we don't want complete fairness on by default (or perhaps even at all). My pseudo-fairness change fixed the real-world pathology with only a 5% slowdown in the macrobenchmarks (bench -bench=http), so maybe that's a better starting place. If you wanted to switch the detection to time instead of number of spurious wakeups, how would you incorporate real time efficiently? My concern is that I didn't want to put new time calls into every Lock acquisition. |
If 10ms is the Go preemption period, might we have a crude+cheap counter available based on that? If the thread at the head of the queue is denied access 3 times in a row, then it can check that counter each denial after that, and if it changes twice, that's between 10 and 19.9 ms. |
@dr2chase Sysmon thread can provide such counter, sysmon runs periodically and queries time anyway. However, my plan always was to get rid of sysmon entirely in favor of Windows UMS and Linux fibers. @rsc Humm... context switch after unlock is exactly what we want to avoid. The micro benchmark does not test the real thing here. 2 points: (1) the context switch will destroy cache locality (in the benchmark there is no significant data), (2) the context switch will increase latency and memory consumption; consider that a server has 10 requests to process, each takes 1 ms to process, if you process them one-by-one to completion first request will be serviced in 1 ms, second in 2 ms, etc; if you constantly round-robin between them, all requests will be serviced in 10 ms and working set the sum of 10 requests; servers need to prioritize the oldest work and get rid of it entirely as soon as possible to free all associated resources before switching to new work. |
In multiple goroutines, the effect is still there, just not as severe as 2 goroutines illustrated in the test. I think the fundamental cause is probably that the scheduler is always favor to keep the runnable goroutine keep running, instead of even looking at the goroutines in the waiting queue that some might also just ready to run... I think in Linux there is third queue (if my memory still correct) that used to describe threads that can actually be run (don't block on anything) but just not yet to be scheduled to run. On a scheduling point, all of them are just relatively fair picked. (don't know exactly algorithm but I'd imagine just a randomly pick would provide some basic fairness). I am wondering if this could be somewhat useful in Go, as it seems a very common case where a bunch of goroutines would block on waiting some lock and there is one that's doing something and holding the lock. Upon the time when the holder releases the lock, all of these parked goroutines that are waiting on the same lock should be just in the runnable queue and be fair pick along with the runnable. I think the Gosched() is pretty much just doing this artificially. Although dvyukov@ also has the point of cache destroy... but hmm I am wondering if generally seconds of block of preventing from continuing doing work would actually be much worse than ctx swtiches in microseconds (and we aren't even talking about constant context switches, in practice, you don't lock and unlock right after that...). |
Go scheduler has just one queue, and it holds exactly such goroutines. The queue is services in a fair manner (FIFO).
It happens quite frequently in practice, e.g. in some helper function that is called frequently and takes a mutex internally. |
Inspiration might be found in an older version of the Solaris Nevada sources: "LIFO queue ordering is unfair and can lead to starvation, but it gives better performance for heavily contended locks..." Obviously, this applies to OS-level thread scheduling, but the principles should remain applicable. |
@rsc ping! |
NB for your consideration: we have seen this in production in a similar variation: We have many incoming requests contending on a lock protecting a long-running (order of milliseconds) journal transaction. The lock unfairness stretched the long tail into hours(!). The solution was to track the waiters explicitly in a queue and guard the queue with a mutex, this sorted out nearly all of the unfairness. |
I wrote initially "The proposal is that we detect severe unfairness and revert to handoffs in that case. How to do that is an open question, which is why this is not a design doc." That's still true, unfortunately. We should probably plan to investigate this for Go 1.9 though. Thanks for the ping. |
Right, would be great if locks didn't observe perfect fairness, but nevertheless followed the principle of least surprise. Thanks for keeping the issue open. |
We don't know what to do here. It seems like maybe I should retract this proposal, but this is a problem for some people, and it would be nice to solve it before they realized they had it (that is, not require people to type things like "FairMutex"). Does anyone have any suggestions for what we might do? |
Do something crudely random, say allow barging (N-1)/N of the time, hand off 1/N? Pick N a power of two, use a super-crude (i.e., fast) RNG and just check for the lower log2(N) bits zero. |
It seems like the unfairness only becomes a noticeable problem when the lock is held for a "very long time" (or, in other words, when the lock holder reschedules while holding the lock). You could detect that condition by looking at how long the lock is held on average, or by looking at the lock queue depth (or the wait time of the goroutine at the head of the queue). |
In other words, if the semaphore was fair (I know it's not now), then you could choose lock handoff if the next goroutine on the semaphore had been waiting for longer than 10ms. |
CL https://golang.org/cl/34310 mentions this issue. |
Since Dmitriy figured out a way to do this, proposal accepted. |
Preparation to switch to workqueue, to work around [1]. The Informer is written in a way that new updates are starving when the the process loop holds the lock too long. [1] golang/go#13086
Go's locks make no guarantee of fairness. However, there exists at least one common pattern that exhibits severely unfair behavior, to the extent where it might be considered a serious bug. I propose we look into whether we can do anything in the sync.Mutex implementation that would ameliorate this severe unfairness without hurting the common case. I have one concrete suggestion at the bottom, but I hope to spur thought and discussion about this. I don't claim that this specific solution is the right one.
Problem
I've seen two different reports now of severe lock unfairness under roughly the same conditions, in two independent programs. The general problem occurs when there are two goroutines contending for a lock and one of them holds it for a long time while only briefly releasing it, and the other releases it for a long time and only briefly acquires it.
The demo program available by
go get rsc.io/tmp/lockskew
boils down to this:Here, goroutine 1 basically wants to hold the lock all the time, releasing it briefly, while goroutine 2 only wants to hold the lock for brief amounts of time. Developers naively (but I don't think completely unreasonably) expect that although goroutine 2 may not get the lock quite as much as goroutine 1, it will get the lock once in a while, maybe after a second or two at worst. In fact, on my Linux workstation it's common for goroutine 2 to take 100+ seconds to acquire the lock just once.
When goroutine 1 executes mu.Unlock(), it marks the lock unlocked and wakes up goroutine 2, intending that goroutine 2 will run and acquire the lock itself. But goroutine 2 does not run immediately. Goroutine 1 keeps running, goes around the loop again, calls mu.Lock(), and reacquires the lock. By the time goroutine 2 actually runs and tries to acquire the lock, goroutine 1 has taken it back. The basic pattern is:
Given that g2 is woken by g1 only once per 100µs, the fact that g2 takes 100+ seconds to acquire the lock means that this cycle repeats millions of times before g1 is interrupted enough to let g2 run.
If you run
lockskew
it will acquire the lock as much as it can and print about acquisitions, at most once per second. The performance is sensitive to the specific length of the sleep, but they're all bad. On my system, here is lockskew with 100µs sleeps, as in the program above:And here is lockskew with 100ms sleeps:
Barging vs Handoff
The fundamental problem here is when g1 decides to wake up g2, it leaves the sync.Mutex unlocked and lets g2 deal with reacquiring it. In the time between those two events, any goroutine other than g2 (here g1, but in general any other goroutine) can arrive at mu.Lock and grab the mutex instead of g2. This "drive-by acquisition" is sometimes called barging.
The alternative to barging is for g1 to leave the lock locked when waking g2, explicitly handing off the lock to g2. This coordinated handoff leaves the lock in the "locked" state the entire time, eliminating the window in which barging can occur.
In his work on java.util.concurrent, Doug Lea found that barging helped throughput [1], and that fact has entered the received wisdom surrounding locking. Following standard Java at the time, the evaluation used operating system threads and an operating system scheduler (specifically, Linux 2.4 NPTL on Pentium3/4 and Solaris 9 on UltraSparc2/3). As I understand it, barging is a win because it makes some bad operating system scheduling decisions impossible. Specifically, if the operating system delays the scheduling of g2, making the handoff delay long, not holding the lock during the handoff lets another thread do useful work in the interim.
As the test program and the real programs that inspired it show, however, at least in the Go implementation, barging enables very bad states. Also, the delays that barging helps avoid may not apply to Go, which is primarily using goroutines and a goroutine scheduler.
Alternatives
It's worth noting that there is a known workaround to this problem. If you use a pair of locks, with mu2 protecting mu.Lock, then that makes barging impossible in the two-goroutine case and at least less of an issue in the case of more goroutines. Specifically, the above code changes to:
Now, because g1 sleeps without holding mu2, goroutine 2 can get mu2 and block in mu.Lock, and then when goroutine 1 goes back around its loop, it blocks on mu2.Lock and cannot barge in on goroutine 2's acquisition of mu.
If we take this approach in the lockskew test, it works much better:
Note that there is only one print per second but the program is averaging over 2500 loop iterations per second, about 500,000X faster than above.
Ideally developers would not need to know about, much less use, this kind of hack. Another possibility is to wrap this idiom or some other implementation into a (say) sync.FairMutex with guaranteed fairness properties. But again ideally developers would not need to know or think about these awful details. I do not want to see blog posts explaining when to use sync.Mutex and when to use sync.FairMutex.
If Go's sync.Mutex can do something better by default without sacrificing performance, it seems it should.
Proposal
The proposal is that we detect severe unfairness and revert to handoffs in that case. How to do that is an open question, which is why this is not a design doc.
Performance
As an approximate lower bound on the cost of this in the general case, I added one line to the implementation of sync.Mutex's Unlock:
Today at least, the implementation of Gosched essentially guarantees to run the goroutine just woken up by the Semrelease. This is not strictly speaking a handoff, since barging can still happen between the unlock and the acquisition in g2, but the handoff period is much smaller than it used to be, and in particular g1 is kept from reacquiring the lock. Anyway, the point is to evaluate how costly this might be, not to do a perfect job. A perfect job would require more lines changed in both sync.Mutex and the runtime (although perhaps not many).
With this line, lockskew with only a single lock runs at about the same speed as with two locks:
The pathological case is gone, but what about the common case? Barging is supposed to help throughput, so one expects that average throughput must have gone down. Doug Lea's paper uses as its benchmark multiple threads calling a random number generator protected by a lock. Each call acquires a lock, does a tiny amount of work, and releases the lock. The paper found that with 256 threads running on even single-processor systems, implementations allowing barging ran 10-100X faster than handoffs. Again this was with operating system threads, not goroutines, so the result may not carry over directly. A goroutine switch is much cheaper than a thread switch.
In fact, the result does not seem to carry over to goroutines, at least not in the random number case. Running
go test -bench . rsc.io/tmp/lockskew
will benchmark Go's rand.Int (with the same mutex around simple math) called from multiple goroutines simultaneously, from 1 to 256. As a worst case, I ran it with GOMAXPROCS=256 even though my system has only 12 cores. Here is the effect of adding the Gosched:For the most part, the Gosched makes things better, not worse.
Although this microbenchmark doesn't show it, there may be a small degradation in real programs. I ran the
golang.org/x/benchmarks/bench program
with-bench=http
with and without the change. The change seemed to cause a roughly 5% increase in the reported "RunTime" result.Perhaps with work to detect and fall back instead of having the "handoff" on always, we could keep real programs running as they do today but fix these pathological (but real) behaviors at the same time. For example, one idea is to have the blocked mu.Lock call do the detection. In the pathological behavior, it goes to sleep, is awoken, and then can't acquire the lock and goes back to sleep. If it is woken up to find the lock unavailable (say) three times in a row, it could set a "now we're in handoff mode" bit in the lock word. The handoff mode bit could be sticky for the life of the lock, or it could be cleared on a call to mu.Unlock that finds no goroutines waiting.
[1] http://gee.cs.oswego.edu/dl/papers/aqs.pdf but ignore section 5.2. @RLH says the theoretical analysis there is known to be flawed.
/cc @randall77 @dvyukov @aclements @RLH @dr2chase
The text was updated successfully, but these errors were encountered: