Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sync: shrink types in sync package #37142

Open
balasanjay opened this issue Feb 8, 2020 · 24 comments
Open

sync: shrink types in sync package #37142

balasanjay opened this issue Feb 8, 2020 · 24 comments
Labels
Milestone

Comments

@balasanjay
Copy link
Contributor

@balasanjay balasanjay commented Feb 8, 2020

Proposal: shrink types in sync package

Current API

The types in the sync package make use of semaphore operations provided by the runtime package. Specifically, they use the following two APIs

semacquire(s *uint32): waits until *s > 0 and atomically decrements it.

semrelease(s *uint32): atomically increments *s and notifies any goroutine blocked in a semacquire (if any).

Each semaphore logically manages a queue of goroutines. The sync types store their atomically-modified state separate from their semaphores; for this reason, semaphores have to support out-of-order operations (e.g. when unlocking a sync.Mutex, you could have a semrelease show up to the semaphore 'before' a corresponding semacquire). The separation into atomic state and semaphores also results in the sync types having a larger memory footprint.

Proposed new API

Let's consider a slightly different API for queueing and dequeueing goroutines.

semqueue(s *uint32, mask uint32, cmp uint32, bit uint8) bool: checks (*s & mask) == cmp; if false, will immediately return false (indicating 'barging'). Otherwise, puts the calling goroutine to sleep; if this goroutine is the first to sleep for the given s and bit, will also perform *s = *s | (1 << bit).

semdequeue(s *uint32, bit uint8, dequeueAll bool) int: wakes up 1 goroutine sleeping on a corresponding semqueue call with a matching s and bit (or all, if dequeueAll is true). If there are no more goroutines sleeping, then will unset the corresponding bit in *s: *s = *s & ^(1 << bit). Returns the number of woken goroutines.

Notes about this API:
The names listed above are placeholders (other suggestions welcome).

All of the operations for the given function are atomic (by making use of a lock keyed on s).

Also, the real signatures will likely need to be more complicated to support mutex profiling, and FIFO vs LIFO queueing.

Benefits

Given this kind of API, I believe we could shrink the types in the sync package. This is because this new API only ever modifies a single bit in an atomic word, so all the other bits are available to store atomic state. We can manage multiple logical queues using the same 4-byte atomic word; this will allow it to handle types like sync.RWMutex that have a queue for readers and a queue for writers. Here are the possible savings for reimplementing various types in the sync package:

Type Current Size (bytes) New Size (bytes)
sync.Once 12 4
sync.Mutex 8 4
sync.RWMutex 24 4

These are the ones I'm fairly confident about. I also think we can shrink sync.Waitgroup, and sync.Cond (the former from 12 bytes to 4 bytes, and I think we can shave off 28 bytes from the latter), but I'm less sure of these two as I'm unfamiliar with their implementations.

Also, while this isn't the goal, I think this might also improve the performance of (*sync.RWMutex).Unlock. It currently repeatedly calls semrelease in a loop to wake readers, acquiring and releasing locks for each iteration. The API above offers batch dequeue functionality, which will allow us to avoid a number of atomic operations.

Backwards Compatibility

This is only modifying internal details of the Go runtime, so from that perspective it should be backwards compatible. There are two other considerations here:

If any users are carefully sizing their types to avoid false sharing or to align with cache-line boundaries, changing the size of types in sync could be problematic for them.

There are users who are using go:linkname to directly invoke semacquire and semrelease. See gVisor for an example of this. So, even if we convert all uses in the standard library, we will likely want to have these functions stick around for a little while.

Edits

  • Reordered types in order of implementation simplicity
  • Increased size of sync.RWMutex from 4 to 8 bytes (see comments for explanation)
  • Decreased size of sync.RWMutex back down to 4 bytes (see comments for explanation)
@gopherbot gopherbot added this to the Proposal milestone Feb 8, 2020
@gopherbot gopherbot added the Proposal label Feb 8, 2020
@balasanjay
Copy link
Contributor Author

@balasanjay balasanjay commented Feb 8, 2020

Tagging owners of sync package:
@rsc @ianlancetaylor @dvyukov @aclements

@networkimprov
Copy link

@networkimprov networkimprov commented Feb 9, 2020

It would help if any new implementation would also accommodate a future RWMutex.TryRLock(), which is not trivial to implement (unlike TryLock).

Discussion at https://groups.google.com/d/topic/golang-nuts/h_Dm-XtrlOA/discussion

@balasanjay
Copy link
Contributor Author

@balasanjay balasanjay commented Feb 10, 2020

@networkimprov That seems entirely orthogonal to this proposal (it could be implemented with the existing internal semaphore APIs or with the proposed new ones). This proposal also doesn't involve any public API changes. It would therefore seem best to file a separate proposal for that.

@networkimprov
Copy link

@networkimprov networkimprov commented Feb 10, 2020

I wasn't suggesting you add a new API in this proposal, but flagging a missing feature of sync.RWMutex which any reimplementation should allow for.

@balasanjay
Copy link
Contributor Author

@balasanjay balasanjay commented Feb 10, 2020

Got it, thanks. I believe the implementation I have in mind could accommodate a TryRLock method.

@rsc
Copy link
Contributor

@rsc rsc commented Feb 12, 2020

@networkimprov, I don't believe we've decided that TryRLock is a good idea to add at all, so it seems not necessary to condition this proposal on support for it.

This is not a proposal in the sense that it does not create new API.
It definitely needs a discussion and decision from the owners of the package
(me, @dvyukov, @aclements is a good set).

Removing from the proposal process and changing to NeedsDecision.

@rsc rsc added NeedsDecision and removed Proposal labels Feb 12, 2020
@gopherbot gopherbot added Proposal and removed NeedsDecision labels Feb 12, 2020
@rsc rsc modified the milestones: Proposal, Unplanned Feb 12, 2020
@rsc rsc changed the title Proposal: shrink types in sync package sync: shrink types in sync package Feb 12, 2020
@rsc rsc added NeedsDecision and removed Proposal labels Feb 12, 2020
@dvyukov
Copy link
Member

@dvyukov dvyukov commented Feb 13, 2020

Are there any significant downsides? It seems that this mostly improve things and don't change public interfaces, if so this like a no brainer to me wrt to decision. Somebody just needs to do it :)
I understand it may slightly increase code complexity, or maybe not (comparable complexity to current code). It looks reasonable to me, moreover the complexity is contained within semaphore/mutex.
Re false sharing, I don't think we should significantly take this into account. Nobody never promised exact sizes, and there are ways to resolve this and hopefully the careful users have build asserts for their assumptions re implementation details.

@balasanjay
Copy link
Contributor Author

@balasanjay balasanjay commented Feb 13, 2020

I'd be happy to implement this, assuming that owners are ok with the plan.

(Since I'm not a fan of chainsaw juggling, I'd want to first write the new implementations in Promela or Pluscal first to make sure they're sound and the APIs proposed above are usable)

I don't know of any significant downsides. I don't have concrete implementations yet, so I don't have an answer for you regarding code complexity. Based on the implementation in my head, I think it'd be comparable.


One question: what is the maximum number of concurrent RLocks we need to support in RWMutex?

I realized that the original implementation I had in mind for RWMutex used up 7 bits for control state, leaving only 25 bits for a count of readers; this would cap the number of concurrent RLocks to 33 million, which feels like it might be too low. So for now, I've increased the RWMutex type up to 8 bytes.

I think its possible to compress the control state down into 5 bits (two queue bits, 1 "writerStarving" bit, and 2 bits that record whether the mutex is unlocked/read-locked/write-locked), which increases the max count to 134 million concurrent RLocks. If that's enough, then we should be able to use 4 bytes again.

@dvyukov
Copy link
Member

@dvyukov dvyukov commented Feb 13, 2020

I don't think we have exact number.
I hope nobody blocks more than 32M goroutines on a single mutex :) However, this number seems to be close to max number of goroutine users may use today. And assuming all worker goroutines rlock some mutex and there is 1 episodic writer, this scenario does not look completely impossible.
Perhaps it's possible to provide some kind of safe but slow fallback? E.g. make all excessive readers spin and sleep until number of readers drops.

@bcmills
Copy link
Member

@bcmills bcmills commented Feb 13, 2020

what is the maximum number of concurrent RLocks we need to support in RWMutex?

Perhaps it's possible to provide some kind of safe but slow fallback?

sync.RWMutex already scales very poorly anyway (#17973), so it is unlikely that many folks are using a large number of concurrent readers, and a safe-but-slow fallback seems fine for now.

(I notice that @balasanjay is the author of CL 215361, so this probably not new information but perhaps useful to other readers of this thread.)

@balasanjay
Copy link
Contributor Author

@balasanjay balasanjay commented Feb 13, 2020

I am now pretty sure that 5 bits of control state are enough. I think 134 million concurrent RLocks should probably be enough for everybody.

If anyone disagrees with this, then I think one possible solution is to have readers queue even when the mutex is in read-locked mode. And RUnlocks that see readers queued with no writers queued would wake a single reader. That overflow behavior seems difficult to model without a state-space explosion, though, so I'd rather avoid it. If we do need to avoid it, then representing an RWMutex in two words is probably the simplest solution.

@dvyukov One other downside: the current RLock implementation's fast-path is an AddUint32. I think the new implementation would need to increment the reader count with a CAS-loop (i.e. only increment the reader count if the mutex stays in read-locked mode).

@CAFxX
Copy link
Contributor

@CAFxX CAFxX commented Feb 13, 2020

Unless this solves actual scalability problems (the benefits section does not list much in this regard, where the only ones listed are apparently using less memory and making the Unlock slow path faster) I would recommend to ensure we don't regress performance of the fast paths.

E.g. making the Unlock slow path a bit faster at the expense of the fast path does not seem like a worthwhile direction.

@balasanjay
Copy link
Contributor Author

@balasanjay balasanjay commented Feb 14, 2020

Fair enough; I'll do some work to quantify the difference between an AddUint32-based fast-path and a CAS-loop-based fast path (by just making the current implementation use a CAS-loop).

The fast path is only actually fast when there is no contention, and in that case, I'd expect a Load+CAS to be roughly as fast as an atomic add. FWIW, I did check that both Folly's SharedMutex and Abseil's Mutex types use a Load+CAS for their fast paths.

To be clear, the benefit I'm after is simply to reduce the space used by the sync types; when trying to write cache-line optimized data structures, you don't really have a lot of room to spare and usually are forced to resort to terrible bit-packing hacks.
Using 24 bytes for an RWMutex type makes this much harder than necessary.

@balasanjay
Copy link
Contributor Author

@balasanjay balasanjay commented Feb 14, 2020

Hmm, I think it might be impossible to make an inlineable fast-path using a Load+CAS. This function is marked as not-inlineable:

func (rw *RWMutex) rlock() {
	rc := atomic.LoadInt32(&rw.readerCount)
	if atomic.CompareAndSwapInt32(&rw.readerCount, rc, rc+16) {
		return
	}

	rw.rlockSlow()
}                                                             

This has a cost of 82, and I can't seem to get it below 80. Also, any actual implementation would need slightly more in here (e.g. a test of "rc & constantMask == constantValue" before the CAS). That's rather unfortunate. I suppose I could just add a hack to the compiler to increase the inlining limit for some functions in the sync package, but that feels somewhat gross.

@OneOfOne
Copy link
Contributor

@OneOfOne OneOfOne commented Feb 14, 2020

@balasanjay #17566 and #22310 are slightly related to that last comment.

@josharian
Copy link
Contributor

@josharian josharian commented Feb 14, 2020

@balasanjay I think you should hack the compiler temporarily to inline this, in order to make progress. If it turns out the Load+CAS isn't performant enough, no harm done. And if it is a viable path forward, we can focus our attention on getting it past the execrable inliner.

E.g. I just noticed that we're charging a budget of 2 for intrinsic calls, one for the intrinsic and one for the CALLFUNC. But given that they're assumed to be comparable in cost to an arithmetic operation, we should just assign them a cost of 1. I suspect this is an oversight/accident. (Insofar as any of this can be called principled.) Fixing this gets your function inlined, barely. :)

diff --git a/src/cmd/compile/internal/gc/inl.go b/src/cmd/compile/internal/gc/inl.go
index 48c7de327d..4379be18ba 100644
--- a/src/cmd/compile/internal/gc/inl.go
+++ b/src/cmd/compile/internal/gc/inl.go
@@ -321,7 +321,7 @@ func (v *hairyVisitor) visit(n *Node) bool {
                }
 
                if isIntrinsicCall(n) {
-                       v.budget--
+                       // Treat like any other node.
                        break
                }

I'll plan to mail this as a CL during Go 1.15 and see whether it flies.

In the meantime, please forge ahead using whatever means necessary. :)

@balasanjay
Copy link
Contributor Author

@balasanjay balasanjay commented Feb 15, 2020

Alrighty, I benchmarked AddInt32 vs a Load+CAS (both a branch-before variant, and a branch-after variant), and it appears I was wrong to be optimistic:

name              time/op
Add               5.91ns ± 4%
AddCASBranchy     10.4ns ± 8%
AddCASBranchLess  11.0ns ± 3%
Code
func BenchmarkAdd(b *testing.B) {
        foo := int32(0)
        for i := 0; i < b.N; i++ {
                nv := atomic.AddInt32(&foo, 2)
                if nv&1 == 0 {
                        continue
                }
                panic("slow path")
        }
        _ = foo
}

func BenchmarkAddCASBranchy(b *testing.B) {
        foo := int32(0)
        for i := 0; i < b.N; i++ {
                ov := atomic.LoadInt32(&foo)
                if ov&1 == 0 && atomic.CompareAndSwapInt32(&foo, ov, ov+2) {
                        continue
                }
                panic("slow path")
        }

        _ = foo
}

func BenchmarkAddCASBranchLess(b *testing.B) {
        foo := int32(0)
        for i := 0; i < b.N; i++ {
                ov := atomic.LoadInt32(&foo) & (^1)
                if atomic.CompareAndSwapInt32(&foo, ov, ov+2) {
                        continue
                }
                panic("slow path")
        }

        _ = foo
}

That's unfortunate. I think I know how to implement RLock with an AddUint32-based fast-path and a fuzzy count of readers, but its not quite as straightforward as what I had in mind.

I'll try to spend some time this weekend writing a model of my idea to validate it.

@balasanjay
Copy link
Contributor Author

@balasanjay balasanjay commented Feb 16, 2020

I wrote a model of sync.Mutex using the proposed new API, and simulated all interleavings for 4 concurrent goroutines. I successfully verified safety (specifically that no two goroutines were ever in the critical section at the same time).

I then tried to verify starvation-freedom for the model, but even with 2 goroutines, the model checker quickly found a trivial counter-example. I think this counter-example is an issue even with the current Mutex implementation. Essentially, the sequence of events is:

  • W1 locks mutex
  • W2 tries to lock mutex, finds it locked and queues
  • W1 unlocks mutex, waking W2 (but W2 isn't running yet)
  • W1 locks mutex (barging ahead of W2)
  • W2 tries to lock mutex, finds it locked and decides to enable "starvation mode" for the mutex
  • W1 unlocks mutex (using fast-path unlock CAS)
  • W2's starvation mode CAS fails (because the mutex was unlocked)
  • W1 locks mutex (using fast-path lock CAS)
  • W2 tries to lock mutex, finds it locked and decides to enable "starvation mode" for the mutex

We're in a loop, so the strong starvation freedom property fails to verify for goroutine W2 (the property I was trying to verify was "if a goroutine starts trying to lock the mutex, then that goroutine will eventually acquire the mutex").

That said, this execution trace seems fairly unrealistic. I don't think we've seen starvation of this sort with the current implementation in practice (which AFAICT, has an exactly analogous starvation event possible).

I tried a weaker starvation freedom property ("if a goroutine starts trying to lock the mutex, then some goroutine will eventually acquire the mutex") and verified that for 3 goroutines. That gives me confidence that there isn't some sequence of events which leaves the mutex in a corrupt state that can never be locked again. I might leave 4 goroutines running overnight (liveness properties are much slower to check).

I'll work on adding read-locks to the model next.

Here's the model so far:
rwmutex.tla
rendered to pdf

@CAFxX
Copy link
Contributor

@CAFxX CAFxX commented Feb 16, 2020

W2 tries to lock mutex, finds it locked and decides to enable "starvation mode" for the mutex
W1 unlocks mutex (using fast-path unlock CAS)

Isn't this impossible in the current implementation? If starving we will fall back to the slow path, since new will not be 0.

go/src/sync/mutex.go

Lines 186 to 191 in 6917529

new := atomic.AddInt32(&m.state, -mutexLocked)
if new != 0 {
// Outlined slow path to allow inlining the fast path.
// To hide unlockSlow during tracing we skip one extra frame when tracing GoUnblock.
m.unlockSlow(new)
}

@balasanjay
Copy link
Contributor Author

@balasanjay balasanjay commented Feb 17, 2020

If the mutex were successfully put into starvation mode, then yep, it would prevent the fast path unlock.

However, what I'm saying is that W2 decides to enable starvation mode but has not actually done so yet.
Then, W1 unlocks the mutex (using the fast-path).
Then W2 actually tries to CAS in the starvation mode bit; this CAS to enable starvation mode will fail (because W1 unlocked the mutex).

In this trace, the mutex never successfully gets put into starvation mode.

@balasanjay
Copy link
Contributor Author

@balasanjay balasanjay commented Feb 17, 2020

Alrighty, I have a model of an RWMutex implementation.

I think it ticks all the boxes (has the same fast paths, handles writer starvation, alternates between writers and batches of readers (also, as a nice-to-have, it has 3 states stored in 2 bits, leaving 1 leftover state for use by a future solution to #17973), and fits in a single uint32 (assuming we're ok with max 133 million readers)). I believe we could implement Mutex in terms of this RWMutex; I think if no one ever calls any of the reader APIs, it'll basically identically to today's Mutex.

One issue I ran into is that it appears that the semdequeue API I proposed above is a bit too simple. In particular, when a writer is handing off a lock to a waiting batch of readers, it wants to atomically switch the state of mutex from "write-locked" to "read-locked", and set the reader-count to the number of dequeued goroutines, and clear the flag indicating that there are readers queued. The original API I proposed above will only do the last one of these.

I think its possible to work around this by letting the reader count go "negative" and try and work with async events again.

But I think its simpler to just split the dequeue API into two to accommodate this use-case cleanly. One is the semdequeueSingle (pretty much the same as what I said before). Dequeues a single goroutine and if its the last one in the queue unsets the waiting bit (we don't need anything fancier for this).

The other is semdequeueAll, which will wake all goroutines, and crucially before unsetting the queue bit, it invokes a passed-in function pointer to compute a new value for the atomic word given the old value and the number of dequeuing goroutines. It takes the output of that function, clears the queue-bit from that, and then CAS-es that final value into place (all of this in a loop, until the CAS succeeds). This additional hook will let us run the necessary logic to atomically wake the goroutines and handoff the mutex to them (it'd be atomic because we're holding the runtime-internal mutex while we do all this, meaning no new readers can queue up while we're doing this). I admit its a tiny bit sketchy to call arbitrary callbacks while holding a runtime lock; if folks prefer, we can just hard-code the behavior we want directly in the runtime (we'll likely need to pass a flag to enable the "RWMutex" dequeue behavior, because I suspect when we get around to migrating WaitGroup or Cond, they'll need slightly different semantics because they probably don't need to store a count in the same way).


I've verified safety properties for the model with 3 readers and 1 writer, with 1 reader and 3 writers, and with 2 readers and 2 writers (which found a number of errors in the model, I should note). Anything bigger looked like it would take too long on my laptop.

I want to run a simulation with 3 readers and 2 writers (after 3 hours of simulation, it had only gotten ~60 steps deep, and I think it'll finally end up being something like 100 steps). I'll try that on my workstation sometime this week, or failing that I guess I'll need to figure out how to make the distributed simulation version work.

I should probably try to optimize the model a bit. I wrote it like pseudocode, so each function is composed of many atomic steps. I bet if I wrote it more like math, I could reduce the number of atomic steps in each function, which might dramatically reduce the state space that needs to be explored.

Updated model files:
rwmutex.tla
rwmutex.pdf

@dvyukov
Copy link
Member

@dvyukov dvyukov commented Feb 17, 2020

I would say it's OK to have semdequeue to know about rwmutex details and just do the necessary update. It's nice to have it generic, but we also don't have to, it's completely internal api. We could make it accept an "enum mode" parameter to support rwmutex/cond/waitgroup/etc.

@balasanjay
Copy link
Contributor Author

@balasanjay balasanjay commented Feb 19, 2020

Cool, 3 readers and 2 writers verified successfully. Also the number of states dramatically increased:

1972646627 states generated, 435337469 distinct states found, 0 states left on queue.

Any other objections to me working on this when the tree's open?

@balasanjay
Copy link
Contributor Author

@balasanjay balasanjay commented Feb 24, 2020

@rsc @aclements Gentle ping.

(If this sounds good, I'm hoping to work on it next weekend and have CLs out for Monday)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
9 participants
You can’t perform that action at this time.