-
Notifications
You must be signed in to change notification settings - Fork 17.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/cc: atomic intrinsics #4947
Labels
Milestone
Comments
Currently the profile for ./runtime.test -test.run=none -test.bench="BenchmarkSyscall$|BenchmarkCreateGoroutines$|BenchmarkChanSync$" no linux/amd64 looks as follows: Events: 6K cycles 25.52% runtime.test runtime.test [.] runtime.xchg 7.56% runtime.test runtime.test [.] runtime.atomicstore 5.50% runtime.test runtime.test [.] runtime.cas 4.36% runtime.test runtime.test [.] runtime.exitsyscall 3.73% runtime.test runtime.test [.] runtime.lock 3.66% runtime.test runtime.test [.] runtime.entersyscall 3.62% runtime.test runtime.test [.] runtime_test.func·009 3.61% runtime.test runtime.test [.] runtime.newproc1 3.42% runtime.test runtime.test [.] runtime.chansend 3.22% runtime.test runtime.test [.] schedule 3.11% runtime.test runtime.test [.] runtime.unlock 2.89% runtime.test runtime.test [.] runtime.mcall 2.76% runtime.test runtime.test [.] runtime.memclr 2.13% runtime.test runtime.test [.] runtime.xadd64 atomic ops are: 25.52% runtime.test runtime.test [.] runtime.xchg 7.56% runtime.test runtime.test [.] runtime.atomicstore 5.50% runtime.test runtime.test [.] runtime.cas 2.13% runtime.test runtime.test [.] runtime.xadd64 That is, 40% is spent in atomic ops. Note that this is with GOMAXPROCS=1, so no contention. Of course this is synthetic benchmarks, but I believe that a real program can spend up to 10% in atomic ops. Inlining can save a half of that. Moreover, atomic ops will be more heavily used in chans and scheduler over time. Moreover, we have some places in runtime where we especially avoid atomic ops due to exactly call cost. This jeopardizes correctness. Not a good situation. |
I will send you a prototype if you LGTM CL 10909045. Owner changed to @rsc. Status changed to Accepted. |
For x86 we only need separate STORE_RELEASE and STORE_SEQ_CST, the the former uses plain MOV, while the latter uses XCHG. All variations of LOAD just use MOV. All LOCK prefixed RMW operations are SEQ_CST anyway. For ARM we will need all variations of RELAXED/CONSUME/ACQUIRE/RELEASE/ACQ_REL/SEQ_CST, this makes actual performance difference. But this is iff the prototype succeeds, and we can have only SEQ_CST implementation for starters (all others forward to it). |
MOV vs XCHG for STORE can make significant performance difference on x86. Currently we have to use XCHG always. RELAXED vs non-RELAXED can significantly affect register allocation, for RELAXED there is no need to dump/restore registers. There are lots of fast-path checks that only need RELAXED load (e.g. does global runq empty? does chan has any recv waiters?) I am not saying about ARM that needs dmb; op; dmb for SEQ_CST operations. Precise memory ordering constraints also improve code readability by making the intent clear and will help with any formal verification. |
T ATOMIC_FETCH_ADD_ORDER(T*, T); T ATOMIC_EXCHANGE_ORDER(T*, T); T ATOMIC_COMPARE_EXCHANGE_ORDER(T*, T, T); // ideally (T*, T*, T) where second operand is a pointer to CMP and is updated on failure T ATOMIC_LOAD_ORDER(T*); void ATOMIC_STORE_ORDER(T*, T); Where ORDER is one of RELAXED/CONSUME/ACQUIRE/RELEASE/ACQ_REL/CEQ_CST. It is portable (employed by C/C++). RELAXED operations do not need to dump/restore registers. On x86, CEQ_CST STORE is XCHG, other stores are MOV. All loads are MOV. All RMW operations are LOCK prefixed and so SEQ_CST. |
>Actually with int32 int64 and ptr it is really 60 or 90 new intrinsics. Are >you sure that's all of them? We may also need FETCH_OR for GC to manipulate bitmasks, or maybe not. I would expect ORDER to be represented as enum and type as Type (for codegen only size is relevant so int32/uint32/int64/uint64/inptptr/uintptr is only 2 cases), so it's only 4 intrinsics -- LOAD, STORE, COMPARE_EXCHANGE and FETCH_OP. |
> I'm sorry, but that's not an API I am smart enough to program against. We > will never be able to write correct code if we have to reason through all > of that. There are 3 orthogonal parts -- operation, type and order. Type and operation are simple. The only remaining part is order. Well, I agree, it's not trivial. The good news is that you can use SEQ_CST for all operations and it will give you the current semantics of simple interleaving of operations. The remaining 2 types of synchronization (RELAXED and ACQUIRE-RELEASE) are more complex, but that's what can provide additional performance benefit. For x86, the only operations that would benefit from non-SEQ_CST memory order are: STORE_RELEASE -- allows to replace XCHG with MOV LOAD_RELAXED -- allows to not disturb register allocator I am fine if we do only them besides SEQ_CST (which may be implied then, i.e. ATOMIC_LOAD instead of ATOMIC_LOAD_SEQ_CST). |
I am still uncomfortable with all this complexity. No one is going to be able to reason about and write correct code. I thought the purpose of this issue was to avoid the function call overhead, and now instead we're talking about shaving nanoseconds by making the atomics API 10x bigger. I am still willing to avoid the function call overhead. I am much less enthusiastic about adding new semantics. I don't think we should at all, and certainly not before Go 1.2. Put another way, I don't see why we should support so many more operations here than we do in package sync/atomic. Russ |
On 2014/01/06 20:24:59, rsc wrote: > I think we still need to figure out the API, no? We were having a long > discussion that fizzled out. > > Russ There are several things in the current API that I would like to change. But how much are you willing to change anything? My understanding was that you do not want to change anything. And, well, all others equal faster is still better. Labels changed: added release-go1.3maybe, removed priority-later, release-none. |
Has something changed elsewhere to make these not matter anymore? These are all heavily used by channels, which is the bread and butter of a lot of "real" Go programs. dvyukov's gut feeling that real programs might spend ~10% of their time in atomic ops is something I've observed in my profiles (attached). If the current runtime atomics can be made intrinsics without expanding the API, and this actually does represent significant advantage, I still sounds like it's worth it. Attachments:
|
This issue was closed.
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
The text was updated successfully, but these errors were encountered: