Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sync: RWMutex scales poorly with CPU count #17973

Open
bcmills opened this issue Nov 18, 2016 · 50 comments
Open

sync: RWMutex scales poorly with CPU count #17973

bcmills opened this issue Nov 18, 2016 · 50 comments

Comments

@bcmills
Copy link
Member

@bcmills bcmills commented Nov 18, 2016

On a machine with many cores, the performance of sync.RWMutex.R{Lock,Unlock} degrades dramatically as GOMAXPROCS increases.

This test program:

package benchmarks_test

import (
	"fmt"
	"sync"
	"testing"
)

func BenchmarkRWMutex(b *testing.B) {
	for ng := 1; ng <= 256; ng <<= 2 {
		b.Run(fmt.Sprint(ng), func(b *testing.B) {
			var mu sync.RWMutex
			mu.Lock()

			var wg sync.WaitGroup
			wg.Add(ng)

			n := b.N
			quota := n / ng

			for g := ng; g > 0; g-- {
				if g == 1 {
					quota = n
				}

				go func(quota int) {
					for i := 0; i < quota; i++ {
						mu.RLock()
						mu.RUnlock()
					}
					wg.Done()
				}(quota)

				n -= quota
			}

			if n != 0 {
				b.Fatalf("Incorrect quota assignments: %v remaining", n)
			}

			b.StartTimer()
			mu.Unlock()
			wg.Wait()
			b.StopTimer()
		})
	}
}

degrades by a factor of 8x as it saturates threads and cores, presumably due to cache contention on &rw.readerCount:

# ./benchmarks.test -test.bench . -test.cpu 1,4,16,64
testing: warning: no tests to run
BenchmarkRWMutex/1      20000000                72.6 ns/op
BenchmarkRWMutex/1-4    20000000                72.4 ns/op
BenchmarkRWMutex/1-16   20000000                72.8 ns/op
BenchmarkRWMutex/1-64   20000000                72.5 ns/op
BenchmarkRWMutex/4      20000000                72.6 ns/op
BenchmarkRWMutex/4-4    20000000               105 ns/op
BenchmarkRWMutex/4-16   10000000               130 ns/op
BenchmarkRWMutex/4-64   20000000               160 ns/op
BenchmarkRWMutex/16     20000000                72.4 ns/op
BenchmarkRWMutex/16-4   10000000               125 ns/op
BenchmarkRWMutex/16-16  10000000               263 ns/op
BenchmarkRWMutex/16-64   5000000               287 ns/op
BenchmarkRWMutex/64     20000000                72.6 ns/op
BenchmarkRWMutex/64-4   10000000               137 ns/op
BenchmarkRWMutex/64-16   5000000               306 ns/op
BenchmarkRWMutex/64-64   3000000               517 ns/op
BenchmarkRWMutex/256                    20000000                72.4 ns/op
BenchmarkRWMutex/256-4                  20000000               137 ns/op
BenchmarkRWMutex/256-16                  5000000               280 ns/op
BenchmarkRWMutex/256-64                  3000000               602 ns/op
PASS

A "control" test, calling a no-op function instead of RWMutex methods, displays no such degradation: the problem does not appear to be due to runtime scheduling overhead.

@josharian
Copy link
Contributor

@josharian josharian commented Nov 18, 2016

@josharian
Copy link
Contributor

@josharian josharian commented Nov 18, 2016

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Nov 18, 2016

It may be difficult to apply the algorithm described in that paper to our existing sync.RWMutex type. The algorithm requires an association between the read-lock operation and the read-unlock operation. It can be implemented by having read-lock/read-unlock always occur on the same thread or goroutine, or by having the read-lock operation return a pointer that is passed to the read-unlock operation. Basically the algorithm builds a tree to avoid contention, and requires each read-lock/read-unlock pair to operate on the same node of the tree.

It would be feasible to implement the algorithm as part of a new type, in which the read-lock operation returned a pointer to be passed to the read-unlock operation. I think that new type could be implemented entirely in terms of sync/atomic functions, sync.Mutex, and sync.Cond. That is, it doesn't seem to require any special relationship with the runtime package.

@dvyukov
Copy link
Member

@dvyukov dvyukov commented Nov 18, 2016

Locking per-P slot may be enough and is much simpler:
https://codereview.appspot.com/4850045/diff2/1:3001/src/pkg/co/distributedrwmutex.go

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Nov 18, 2016

What happens when a goroutine moves to a different P between read-lock and read-unlock?

@dvyukov
Copy link
Member

@dvyukov dvyukov commented Nov 18, 2016

RLock must return a proxy object to unlock, that object must hold the locked P index.

@bcmills
Copy link
Member Author

@bcmills bcmills commented Nov 18, 2016

The existing RWMutex API allows the RLock call to occur on a different goroutine from the RUnlock call. We can certainly assume that most RLock / RUnlock pairs occur on the same goroutine (and optimize for that case), but I think there needs to be a slow-path fallback for the general case.

(For example, you could envision an algorithm that attempts to unlock the slot for the current P, then falls back to a linear scan if the current P's slot wasn't already locked.)

@bcmills
Copy link
Member Author

@bcmills bcmills commented Nov 18, 2016

At any rate: general application code can work around the problem (in part) by using per-goroutine or per-goroutine-pool caches rather than global caches shared throughout the process.

The bigger issue is that sync.RWMutex is used fairly extensively within the standard library for package-level locks (the various caches in reflect, http.statusMu, json.encoderCache, mime.mimeLock, etc.), so it's easy for programs to fall into contention traps and hard to apply workarounds without avoiding large portions of the standard library. For those use-cases, it might actually be feasible to switch to something with a different API (such as having RLock return an unlocker).

@dvyukov
Copy link
Member

@dvyukov dvyukov commented Nov 18, 2016

For these cases in std lib atomic.Value is much better fit. It is already used in json, gob and http. atomic.Value is perfectly scalable and virtually zero overhead for readers.

@bcmills
Copy link
Member Author

@bcmills bcmills commented Nov 18, 2016

I agree in general, but it's not obvious to me how one could use atomic.Value to guard lookups in a map acting as a cache. (It's perfect for maps which do not change, but how would you add new entries to the caches with that approach?)

@dvyukov
Copy link
Member

@dvyukov dvyukov commented Nov 18, 2016

What kind of cache do you mean?

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Nov 18, 2016

E.g., the one in reflect.ptrTo.

@dvyukov
Copy link
Member

@dvyukov dvyukov commented Nov 18, 2016

See e.g. encoding/json/encode.go:cachedTypeFields

@bcmills
Copy link
Member Author

@bcmills bcmills commented Nov 18, 2016

Hmm... that trades a higher allocation rate (and O(N) insertion cost) in exchange for getting the lock out of the reader path. (It basically pushes the "read lock" out to the garbage collector.)

And since most of these maps are insert-only (never deleted from), you can at least suspect that the O(N) insert won't be a huge drag: if there were many inserts, the maps would end up enormously large.

It would be interesting to see whether the latency tradeoff favors the RWMutex overhead or the O(N) insert overhead for more of the standard library.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Nov 18, 2016

@dvyukov Thanks. For the reflect.ptrTo case I wrote it up as https://golang.org/cl/33411. It needs some realistic benchmarks--microbenchmarks won't prove anything one way or another.

@quentinmit quentinmit added the NeedsFix label Nov 18, 2016
@quentinmit quentinmit added this to the Go1.8Maybe milestone Nov 18, 2016
@josharian
Copy link
Contributor

@josharian josharian commented Nov 19, 2016

It would be feasible to implement the algorithm as part of a new type, in which the read-lock operation returned a pointer to be passed to the read-unlock operation. I think that new type could be implemented entirely in terms of sync/atomic functions, sync.Mutex, and sync.Cond.

Indeed. Seems like it might be worth an experiment. If nothing else, might end up being useful at go4.org.

For these cases in std lib atomic.Value is much better fit.

Not always. atomic.Value can end up being a lot more code and complication. See CL 2641 for a worked example. For low level performance critical things like reflect, I'm all for atomic.Value, but much of the rest of the standard library, it'd be nice to fix the scalability of RWMutex (or have a comparably easy to use alternative).

@dvyukov
Copy link
Member

@dvyukov dvyukov commented Nov 21, 2016

Note that in most of these cases insertions happen very, very infrequently. Only during server warmup when it receives a first request of a new type or something. While reads happen all the time. Also, no matter how scalable RWMutex is, it still blocks all readers during updates increasing latency and causing large overheads for blocking/unblocking.

For the reflect.ptrTo case I wrote it up as https://golang.org/cl/33411. It needs some realistic benchmarks--microbenchmarks won't prove anything one way or another.

Just benchmarked it on realistic benchmarks in my head. It is good :)

@dvyukov
Copy link
Member

@dvyukov dvyukov commented Nov 21, 2016

See CL 2641 for a worked example.

I would not say that it is radically more code and complication. Provided that one does it right the first time, rather than do it non-scalable first and then refactor everything.

@rsc rsc modified the milestones: Go1.9Early, Go1.8Maybe Nov 21, 2016
@gopherbot
Copy link

@gopherbot gopherbot commented Nov 21, 2016

CL https://golang.org/cl/33411 mentions this issue.

@bcmills
Copy link
Member Author

@bcmills bcmills commented Dec 1, 2016

https://go-review.googlesource.com/#/c/33852/ has a draft for a more general API for maps of the sort used in the standard library; should I send that for review? (I'd put it in the x/sync repo for now so we can do some practical experiments.)

@davidlazar
Copy link
Member

@davidlazar davidlazar commented Dec 2, 2016

@jonhoo built a more scalable RWMutex here: https://github.com/jonhoo/drwmutex/

@minux
Copy link
Member

@minux minux commented Dec 3, 2016

@bcmills
Copy link
Member Author

@bcmills bcmills commented Dec 3, 2016

@minux That would be easy enough to fix by using runtime.NumCPU() instead.

@minux
Copy link
Member

@minux minux commented Dec 3, 2016

@jonhoo
Copy link

@jonhoo jonhoo commented Dec 3, 2016

@minux I did at some point have a benchmark running on a modified version of Go that used P instead of CPUID. Unfortunately, I can't find that code any more, but from memory it got strictly worse performance than the CPUID-based solution. The situation could of course be different now though.

@baryluk
Copy link

@baryluk baryluk commented Mar 21, 2019

					for i := 0; i < quota; i++ {
						mu.RLock()
						mu.RUnlock()
					}

It is pointless to use RWMutex when you do nothing under the lock.

For RLock to be useful and performant/scalable, you really need to have some work between RLock and RUnlock.

@networkimprov
Copy link

@networkimprov networkimprov commented Nov 7, 2019

Maybe fixed here #33747?

@balasanjay
Copy link
Contributor

@balasanjay balasanjay commented Jan 6, 2020

I spent some time hacking on this over the holidays, I'll try and have some CLs ready sometime this week.

(@networkimprov seems unlikely, this issue is about a more fundamental problem with RWMutexes)

@balasanjay
Copy link
Contributor

@balasanjay balasanjay commented Jan 11, 2020

The good news is that the new implementation significantly helps the performance of this benchmark in the multi-threaded case.

name                      old time/op  new time/op   delta
RWMutexIssue17973/1-48    12.5ns ± 8%   24.2ns ± 8%    +92.82%  (p=0.008 n=5+5)
RWMutexIssue17973/4-48    34.3ns ± 6%   10.3ns ± 5%    -69.98%  (p=0.008 n=5+5)
RWMutexIssue17973/16-48   36.9ns ± 4%    2.8ns ± 1%    -92.55%  (p=0.016 n=5+4)
RWMutexIssue17973/64-48   35.9ns ± 3%    1.7ns ± 3%    -95.38%  (p=0.008 n=5+5)
RWMutexIssue17973/256-48  36.3ns ± 3%    1.7ns ± 3%    -95.41%  (p=0.008 n=5+5)

The bad (somewhat expected) news is that it penalizes write-heavy workloads significantly. An example is the current benchmark "RWMutexUncontended":

name                      old time/op  new time/op   delta
RWMutexUncontended-48     2.29ns ± 0%  74.92ns ± 2%  +3171.62%  (p=0.016 n=4+5)
RWMutexWrite100-48         145ns ± 7%     96ns ± 7%    -33.70%  (p=0.008 n=5+5)
RWMutexWrite10-48          101ns ± 2%    153ns ± 5%    +51.74%  (p=0.008 n=5+5)
RWMutexWorkWrite100-48     355ns ± 2%    211ns ±20%    -40.65%  (p=0.008 n=5+5)
RWMutexWorkWrite10-48      535ns ± 8%    463ns ±13%    -13.47%  (p=0.016 n=5+5)

Coincidentally, RWMutexUncontended is perfectly designed as a torture test for this implementation. It takes 2 concurrent read-locks (which puts the lock into contention mode), and then takes a write lock, which has to take the lock back out of contention mode by doing an O(n) operation (where n here is GOMAXPROCS).

I need to write some more torture mode benchmarks that tickle the worst cases for the new implementation, and then I'll try to figure out how to upload these CLs to Gerrit.

Also, the new implementation is somewhat more lax about detecting mutex misuse; e.g. if you RUnlock() an unlocked mutex, the current implementation will consistently crash your program. The new implementation might not immediately do so when in contention mode. It might require several more RUnlocks to be certain of the issue and throwing (or calling Lock() on that corrupted mutex will also immediately detect the issue and throw).

@gopherbot
Copy link

@gopherbot gopherbot commented Jan 19, 2020

Change https://golang.org/cl/215360 mentions this issue: sync: refactor RWMutex slightly to prepare for future changes.

@gopherbot
Copy link

@gopherbot gopherbot commented Jan 19, 2020

Change https://golang.org/cl/215364 mentions this issue: sync: implement the RWMutex's lockTable type.

@gopherbot
Copy link

@gopherbot gopherbot commented Jan 19, 2020

Change https://golang.org/cl/215361 mentions this issue: sync: Implement a version of RWMutex that can avoid cache contention.

@gopherbot
Copy link

@gopherbot gopherbot commented Jan 19, 2020

Change https://golang.org/cl/215365 mentions this issue: sync: write a benchmark for RLocks with a full table.

@gopherbot
Copy link

@gopherbot gopherbot commented Jan 19, 2020

Change https://golang.org/cl/215359 mentions this issue: sync: add benchmark for issue 17973.

@gopherbot
Copy link

@gopherbot gopherbot commented Jan 19, 2020

Change https://golang.org/cl/215362 mentions this issue: sync: Implement a procLocal abstraction.

@balasanjay
Copy link
Contributor

@balasanjay balasanjay commented Feb 13, 2020

Update: the two problems with RWMutex is that it has poor multi-core scalability (this issue) and that its fairly big (takes up 40% of a cache-line on its own). I originally decided to look at multi-core scalability first, but upon reflection, it makes more sense to tackle these problems in the other order. I intend to return to this issue once #37142 is resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
You can’t perform that action at this time.