Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Blocking mutexes #4
Conversation
|
To test the fairness, I wrote this little program: package main
import (
"fmt"
"sync/atomic"
"time"
"github.com/juju/mutex"
"github.com/juju/utils/clock"
)
func main() {
spec := mutex.Spec{
Name: "foo",
Clock: clock.WallClock,
Delay: 250 * time.Millisecond,
}
var count uint64
go func() {
ticker := time.NewTicker(5 * time.Second)
last := time.Now()
for now := range ticker.C {
count := atomic.SwapUint64(&count, 0)
fmt.Println(count, "acquisitions over", now.Sub(last))
last = now
}
}()
for {
m, err := mutex.Acquire(spec)
if err != nil {
panic(err)
}
atomic.AddUint64(&count, 1)
time.Sleep(100 * time.Millisecond)
m.Release()
}
} Running two instances side by side. With the old mutex implementation, one process would mostly be starving out the other. With the new implementation, they're fairly equal. |
|
I can confirm the fairness on Mac OSX. With the old code and 2 processes it was: With the new implementation it says |
|
I confirmed that the update code was not fair on Windows, but the test suite does not pass: FAIL: mutex_test.go:166: mutexSuite.TestLockBlocks mutex_test.go:176: PASS: mutex_test.go:123: mutexSuite.TestLockContentionWithinProcessCancel 0.000s FAIL: mutex_test.go:195: mutexSuite.TestProcessReleasesWhenDead mutex_test.go:206: mutex_test.go:219: PASS: mutex_test.go:145: mutexSuite.TestSecondReleaseFine 0.001s And it seems to have a bug, because both sides ended up with: (both left and right gave 0 acquisitions / 5s which means something is just broken) |
jameinel
requested changes
Nov 8, 2017
Broken on Windows, I haven't dug into why.
Running the "fair" script, I got 13 acquisitions and then everything fell down into 0. I wonder if there is some failure to actually release the lock?
|
Interesting note. once I killed the original process, the second process started to be able to acquire the lock. So I ran it again, and if I just run 1 script, then it manages to get the lock and says 50/5s reliably. However, as soon as I start a second one they both go to 0/5s. So it seems that contention causes the one that has had the lock to try to let go but not actually let go, until that process actually dies. |
|
@jameinel thanks for picking that up. The issue was that I had mistakenly dropped a call to ReleaseMutex. The owning thread needs to call ReleaseMutex before closing the handle. I thought closing it would be enough, but apparently not. |
|
How does the fairness fare if we drop the wait from release to acquire to 10ms or 1ms? |
|
'go test' passes on Windows, and 'go run fair\fair.go' shows the two sides
fairly sharing the mutex.
…
|
|
Updated to remove the "active" mutex tracking on Windows, since we have to lock to an OS thread anyway, which prevents reentrancy. |
|
Are you talking about the code in Andrew's manual test case?
I tried running 2 clients, each with a 10ms sleep. And on Windows they
settled down to both get 250 acq per second.
I changed it to 1 client running 10ms (from before), and one with a 1ms
sleep.
They stabilized at 400-450 acq per 5s.
So even when one holds it longer than the other, they still stay fair.
Running 3, 2 @1ms, 1@10ms, and they all stayed around 400 acq/5s.
This is all on Windows.
…
|
|
Running 3 Windows fair.go processes at 1ms sleep got 1500 acq/5s across all
three.
Do we have reasonable tests around behavior when we've decided that we
don't need the lock anymore and we get it anyway?
…
|
|
@howbazaar it fluctuates quite a lot more, but is still unfair with the old code. With the new code, it's fair either way. |
|
With the old code and 1ms sleep (instead of 100ms), I see it bounce between
the two, where one wanders down to 430 acq, and then up to 2825 acq, and
then back down again. So it doesn't starve, but it isn't stable. The new
code is very close to perfectly fair.
…
|
I think the test I added (TestLockAfterTimeout) covers this reasonably well. It's non-deterministic in that the second (timed out) acquire may wake up before or after the final acquire has started. I don't think we can force a particular path, but I'm open to ideas of course. Are there any particular cases where you think we're lacking? |
jameinel
reviewed
Nov 8, 2017
The fairness is much improved, so I think we have that right.
I'm curious if it is fair when you have multiple local goroutines, though.
| +acquired at the same time, within and across process boundaries. If a | ||
| +process dies while the mutex is held, the mutex is automatically released. | ||
| + | ||
| +The *nix implementation uses flock, while the Windows implementation uses |
jameinel
Nov 8, 2017
Owner
This "*" causes The diff Markdown preview to freak out for the entire rest of the file treating everything as italics. Hopefully its not actually a problem.
axw
Nov 8, 2017
Member
I foolishly assumed that godoc2md would escape, but alas. I'm going to be lazy this time and just spell out Linux and MacOS.
| ErrCancelled is returned. | ||
| -## type Spec | ||
| + | ||
| +## <a name="Spec">type</a> [Spec](/src/target/mutex.go?s=986:1583#L38) |
jameinel
Nov 8, 2017
Owner
How were these generated? They feel like something that will become inaccurate quickly.
I see the godoc2md at the end. Is that already part of a make/etc so it happens naturally?
| + // we must also acquire a legacy mutex. We acquire the new | ||
| + // version first, since that provides the preferred blocking | ||
| + // behaviour. | ||
| + m2, err := acquireLegacy( |
jameinel
Nov 8, 2017
Owner
Have we done any testing around what happens with racing new-and-old? I don't think we can deadlock, so we're probably fine. But just want to make sure we thought that part through.
| + var sent bool | ||
| + waiters := active[name] | ||
| + for !sent && len(waiters) > 0 { | ||
| + w, waiters = waiters[0], waiters[1:] |
axw
Nov 8, 2017
Member
we do, below (depending on whether len(waiters)>0 or not, we update active or delete from it)
| + select { | ||
| + case w.ch <- result: | ||
| + sent = true | ||
| + case <-w.done: |
jameinel
Nov 8, 2017
Owner
Is this done equivalent to Cancel in the other lock code? Is there a reason to have the attributes named differently? Or does this loop not actually handle when we aren't looking for the lock anymore?
axw
Nov 8, 2017
Member
w.done is closed when the caller of Acquire aborts for whatever reason. That might be because it cancelled, or it might be because of a timeout, or because of a panic.
jameinel
Nov 8, 2017
Owner
So maybe its just that you're looping until you find a waiter that isn't already done?
In that case we are releasing the flock between waiters, so that seems good.
| + // remove it from the list | ||
| + // and try the next one | ||
| + } | ||
| + } |
jameinel
Nov 8, 2017
Owner
This seems odd that we grab a single flock() and then run all waiters.
Does that mean if we have 10 goroutines that want the lock in this process, and 10 in another process, all of this process will run if we get the lock, before we release the flock and let any of their 10 run?
I think it would be better to drop and re-acquire the flock for every waiter.
axw
Nov 8, 2017
Member
Note the "!sent" in the for condition: we stop as soon as we've sent to one of them.
| -func waitForSingleObject(handle syscall.Handle) error { | ||
| - noWait := 0 | ||
| - result, _, errno := syscall.Syscall(procWaitForSingleObject.Addr(), 2, uintptr(handle), uintptr(noWait), 0) | ||
| +func waitForMultipleObjects(handles []syscall.Handle, timeout time.Duration) (int, error) { |
jameinel
Nov 8, 2017
Owner
I'm trusting your wrapper for this the ranging with _WAIT_ABANDONED_0 seems really strange.
Do you have a link for the docs so that we can have someone else understand this in the future?
jameinel
approved these changes
Nov 8, 2017
I think my concern about fairness with multiple goroutines was unfounded, so I think with some minor tweaks that you've already addressed, LGTM
axw commentedNov 8, 2017
•
Edited 1 time
-
axw
Nov 8, 2017
Update mutex implementations to make blocking
syscalls, rather than polling. This should ensure
fairer scheduling.
On Windows, we now use named mutexes.
On Linux and MacOS, we use blocking flock.
Due to the way signals work in Go we cannot
interrupt a flock syscall, so instead we
allow the syscall to complete and then release
the lock if nothing is waiting for it. To
avoid an unbounded growth of goroutines
making flock syscalls, we ensure that only
one goroutine is making a call at a time.
For compatibility with older implementations,
the Linux implementation also acquires an
abstract domain socket. On Windows, we do
the same with a named semaphore.
Tested on Linux and Wine (Windows).