cmd/cc: atomic intrinsics #4947

dvyukov · 2013-03-01T01:11:45Z

Some critical pieces of runtime package extensively use atomic operations (e.g.
runtime.lock/unlock, semaphores, scheduler, memory allocator, GC). It can make sense to
add compiler intrinsics for atomic memory accesses -- ATOMIC_LOAD/STORE/XADD/XCHG/CAS
similar to PREFETCH intrinsic. The intrinsics should generate inline code and do not
affect register allocator as bad as function calls.
The implementation details are to be discussed.

minux · 2013-03-01T10:28:32Z

Comment 1:

agree.
perhaps related, i proposed that we move part of sync/atomic package
(the assembly part) to runtime package, so that 64-bit atomic instructions
on ARM could use native instructions if available.

rsc · 2013-03-04T14:55:28Z

Comment 2:

I'll put them in if someone can demonstrate that they make a significant
performance difference.

rsc · 2013-03-12T20:18:48Z

Comment 3:

[The time for maybe has passed.]

dvyukov · 2013-06-24T15:27:36Z

Comment 4:

Currently the profile for
./runtime.test -test.run=none
-test.bench="BenchmarkSyscall$|BenchmarkCreateGoroutines$|BenchmarkChanSync$"
no linux/amd64 looks as follows:
Events: 6K cycles                                                                       
                                                                                        
             
 25.52%  runtime.test  runtime.test       [.] runtime.xchg
  7.56%  runtime.test  runtime.test       [.] runtime.atomicstore
  5.50%  runtime.test  runtime.test       [.] runtime.cas
  4.36%  runtime.test  runtime.test       [.] runtime.exitsyscall
  3.73%  runtime.test  runtime.test       [.] runtime.lock
  3.66%  runtime.test  runtime.test       [.] runtime.entersyscall
  3.62%  runtime.test  runtime.test       [.] runtime_test.func·009
  3.61%  runtime.test  runtime.test       [.] runtime.newproc1
  3.42%  runtime.test  runtime.test       [.] runtime.chansend
  3.22%  runtime.test  runtime.test       [.] schedule
  3.11%  runtime.test  runtime.test       [.] runtime.unlock
  2.89%  runtime.test  runtime.test       [.] runtime.mcall
  2.76%  runtime.test  runtime.test       [.] runtime.memclr
  2.13%  runtime.test  runtime.test       [.] runtime.xadd64
atomic ops are:
 25.52%  runtime.test  runtime.test       [.] runtime.xchg
  7.56%  runtime.test  runtime.test       [.] runtime.atomicstore
  5.50%  runtime.test  runtime.test       [.] runtime.cas
  2.13%  runtime.test  runtime.test       [.] runtime.xadd64
That is, 40% is spent in atomic ops. Note that this is with GOMAXPROCS=1, so no
contention.
Of course this is synthetic benchmarks, but I believe that a real program can spend up
to 10% in atomic ops. Inlining can save a half of that.
Moreover, atomic ops will be more heavily used in chans and scheduler over time.
Moreover, we have some places in runtime where we especially avoid atomic ops due to
exactly call cost. This jeopardizes correctness. Not a good situation.

dvyukov · 2013-06-25T14:31:33Z

Comment 5:

To move forward with this, we need a prototype (for amd64).

rsc · 2013-07-11T16:52:47Z

Comment 6:

I will send you a prototype if you LGTM CL 10909045.

Owner changed to @rsc.

Status changed to Accepted.

dvyukov · 2013-07-15T22:08:43Z

Comment 7:

For x86 we only need separate STORE_RELEASE and STORE_SEQ_CST, the the former uses plain
MOV, while the latter uses XCHG.
All variations of LOAD just use MOV.
All LOCK prefixed RMW operations are SEQ_CST anyway.
For ARM we will need all variations of RELAXED/CONSUME/ACQUIRE/RELEASE/ACQ_REL/SEQ_CST,
this makes actual performance difference. But this is iff the prototype succeeds, and we
can have only SEQ_CST implementation for starters (all others forward to it).

rsc · 2013-07-15T22:41:34Z

Comment 8:

I propose not being so complex. Let the intrinsics be the same API we have
today (xchg64, load64, etc).
Russ

dvyukov · 2013-07-15T22:48:38Z

Comment 9:

MOV vs XCHG for STORE can make significant performance difference on x86. Currently we
have to use XCHG always.
RELAXED vs non-RELAXED can significantly affect register allocation, for RELAXED there
is no need to dump/restore registers. There are lots of fast-path checks that only need
RELAXED load (e.g. does global runq empty? does chan has any recv waiters?)
I am not saying about ARM that needs dmb; op; dmb for SEQ_CST operations.
Precise memory ordering constraints also improve code readability by making the intent
clear and will help with any formal verification.

rsc · 2013-07-15T22:54:06Z

Comment 10:

Can you please explain exactly what operations you are asking for, both
their full names and their semantics?
The API needs to be portable.
It is a non-starter for the scheduler to use different source code on
different architectures.
Russ

dvyukov · 2013-07-15T23:06:47Z

Comment 11:

T ATOMIC_FETCH_ADD_ORDER(T*, T);
T ATOMIC_EXCHANGE_ORDER(T*, T);
T ATOMIC_COMPARE_EXCHANGE_ORDER(T*, T, T);  // ideally (T*, T*, T) where second operand
is a pointer to CMP and is updated on failure
T ATOMIC_LOAD_ORDER(T*);
void ATOMIC_STORE_ORDER(T*, T);
Where ORDER is one of RELAXED/CONSUME/ACQUIRE/RELEASE/ACQ_REL/CEQ_CST.
It is portable (employed by C/C++).
RELAXED operations do not need to dump/restore registers.
On x86, CEQ_CST STORE is XCHG, other stores are MOV. All loads are MOV. All RMW
operations are LOCK prefixed and so SEQ_CST.

rsc · 2013-07-15T23:09:37Z

Comment 12:

So you are asking for 30 new intrinsics?

rsc · 2013-07-15T23:12:38Z

Comment 13:

Actually with int32 int64 and ptr it is really 60 or 90 new intrinsics. Are
you sure that's all of them?
I'm sorry, but that's not an API I am smart enough to program against. We
will never be able to write correct code if we have to reason through all
of that.

dvyukov · 2013-07-16T05:50:18Z

Comment 14:

>Actually with int32 int64 and ptr it is really 60 or 90 new intrinsics. Are
>you sure that's all of them?
We may also need FETCH_OR for GC to manipulate bitmasks, or maybe not.
I would expect ORDER to be represented as enum and type as Type (for codegen only size
is relevant so int32/uint32/int64/uint64/inptptr/uintptr is only 2 cases), so it's only
4 intrinsics -- LOAD, STORE, COMPARE_EXCHANGE and FETCH_OP.

dvyukov · 2013-07-16T06:10:01Z

Comment 15:

> I'm sorry, but that's not an API I am smart enough to program against. We
> will never be able to write correct code if we have to reason through all
> of that.
There are 3 orthogonal parts -- operation, type and order. Type and operation are
simple. The only remaining part is order.
Well, I agree, it's not trivial. The good news is that you can use SEQ_CST for all
operations and it will give you the current semantics of simple interleaving of
operations.
The remaining 2 types of synchronization (RELAXED and ACQUIRE-RELEASE) are more complex,
but that's what can provide additional performance benefit.
For x86, the only operations that would benefit from non-SEQ_CST memory order are:
STORE_RELEASE -- allows to replace XCHG with MOV
LOAD_RELAXED -- allows to not disturb register allocator
I am fine if we do only them besides SEQ_CST (which may be implied then, i.e.
ATOMIC_LOAD instead of ATOMIC_LOAD_SEQ_CST).

rsc · 2013-07-16T13:19:26Z

Comment 16:

I am still uncomfortable with all this complexity. No one is going to be
able to reason about and write correct code. I thought the purpose of this
issue was to avoid the function call overhead, and now instead we're
talking about shaving nanoseconds by making the atomics API 10x bigger. I
am still willing to avoid the function call overhead. I am much less
enthusiastic about adding new semantics. I don't think we should at all,
and certainly not before Go 1.2.
Put another way, I don't see why we should support so many more operations
here than we do in package sync/atomic.
Russ

rsc · 2013-11-27T18:49:06Z

Comment 17:

Labels changed: added go1.3maybe.

rsc · 2013-12-04T01:28:16Z

Comment 18:

Labels changed: added release-none, removed go1.3maybe.

rsc · 2013-12-04T01:43:14Z

Comment 19:

Labels changed: added repo-main.

dvyukov · 2014-01-08T06:52:27Z

Comment 20:

On 2014/01/06 20:24:59, rsc wrote:
> I think we still need to figure out the API, no? We were having a long
> discussion that fizzled out.
> 
> Russ
There are several things in the current API that I would like to change. But how much
are you willing to change anything? My understanding was that you do not want to change
anything. And, well, all others equal faster is still better.

Labels changed: added release-go1.3maybe, removed priority-later, release-none.

rsc · 2014-01-08T14:23:14Z

Comment 21:

I do not want to do something different from the package sync API. If that is
sufficient, great.

dvyukov · 2014-01-08T14:48:18Z

Comment 22:

Can we add few additions to it?
Scheduler, poller, channels are quite special and exacting in this regard. It's
unpleasant to lose there just because of inflexible API...

rsc · 2014-05-09T16:28:24Z

Comment 23:

Maybe for 1.4. I am not sure these matter anymore.

Labels changed: added release-go1.4, removed release-go1.3maybe.

gopherbot · 2014-05-09T16:48:09Z

Comment 24 by jlmoiron:

Has something changed elsewhere to make these not matter anymore?  
These are all heavily used by channels, which is the bread and butter of a lot of "real"
Go programs.  dvyukov's gut feeling that real programs might spend ~10% of their time in
atomic ops is something I've observed in my profiles (attached).  If the current runtime
atomics can be made intrinsics without expanding the API, and this actually does
represent significant advantage, I still sounds like it's worth it.

Attachments:

jswn395.png (99908 bytes)

rsc · 2014-05-09T20:25:50Z

Comment 25:

I thought the main place these were needed was the garbage collector, and it doesn't use
the atomics anymore. Maybe we'll just get rid of all the C code instead.

rsc · 2014-05-09T20:25:51Z

Comment 26:

I thought the main place these were needed was the garbage collector, and it doesn't use
the atomics anymore. Maybe we'll just get rid of all the C code instead.

ianlancetaylor · 2014-05-09T20:27:49Z

Comment 27:

Then Dmitriy will open a new issue asking for atomic intrinsics in cmd/gc.

dvyukov · 2014-05-10T05:41:52Z

Comment 28:

> Then Dmitriy will open a new issue asking for atomic intrinsics in cmd/gc.
Definitely I will :)

minux · 2014-05-10T14:29:01Z

Comment 29:

if the runtime were done in Go, I think more general approach would be to support
inlining
assembly routines.

rsc · 2014-09-15T15:22:08Z

Comment 30:

Re #29 Everyone talks about inlining assembly routines, but it just doesn't work, unless
you teach the compiler's optimizer about every possible assembly instruction. That's
madness.
The C code is going away.

Status changed to Invalid.

dvyukov added invalid Performance labels Sep 15, 2014

dvyukov assigned rsc Sep 15, 2014

rsc added this to the Go1.4 milestone Apr 14, 2015

rsc removed the release-go1.4 label Apr 14, 2015

golang locked and limited conversation to collaborators Jun 24, 2016

gopherbot added the FrozenDueToAge label Jun 24, 2016

rsc removed their assignment Jun 22, 2022

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/cc: atomic intrinsics #4947

cmd/cc: atomic intrinsics #4947

dvyukov commented Mar 1, 2013

minux commented Mar 1, 2013

rsc commented Mar 4, 2013

rsc commented Mar 12, 2013

dvyukov commented Jun 24, 2013

dvyukov commented Jun 25, 2013

rsc commented Jul 11, 2013

dvyukov commented Jul 15, 2013

rsc commented Jul 15, 2013

dvyukov commented Jul 15, 2013

rsc commented Jul 15, 2013

dvyukov commented Jul 15, 2013

rsc commented Jul 15, 2013

rsc commented Jul 15, 2013

dvyukov commented Jul 16, 2013

dvyukov commented Jul 16, 2013

rsc commented Jul 16, 2013

rsc commented Nov 27, 2013

rsc commented Dec 4, 2013

rsc commented Dec 4, 2013

dvyukov commented Jan 8, 2014

rsc commented Jan 8, 2014

dvyukov commented Jan 8, 2014

rsc commented May 9, 2014

gopherbot commented May 9, 2014

rsc commented May 9, 2014

rsc commented May 9, 2014

ianlancetaylor commented May 9, 2014

dvyukov commented May 10, 2014

minux commented May 10, 2014

rsc commented Sep 15, 2014

cmd/cc: atomic intrinsics #4947

cmd/cc: atomic intrinsics #4947

Comments

dvyukov commented Mar 1, 2013

minux commented Mar 1, 2013

rsc commented Mar 4, 2013

rsc commented Mar 12, 2013

dvyukov commented Jun 24, 2013

dvyukov commented Jun 25, 2013

rsc commented Jul 11, 2013

dvyukov commented Jul 15, 2013

rsc commented Jul 15, 2013

dvyukov commented Jul 15, 2013

rsc commented Jul 15, 2013

dvyukov commented Jul 15, 2013

rsc commented Jul 15, 2013

rsc commented Jul 15, 2013

dvyukov commented Jul 16, 2013

dvyukov commented Jul 16, 2013

rsc commented Jul 16, 2013

rsc commented Nov 27, 2013

rsc commented Dec 4, 2013

rsc commented Dec 4, 2013

dvyukov commented Jan 8, 2014

rsc commented Jan 8, 2014

dvyukov commented Jan 8, 2014

rsc commented May 9, 2014

gopherbot commented May 9, 2014

rsc commented May 9, 2014

rsc commented May 9, 2014

ianlancetaylor commented May 9, 2014

dvyukov commented May 10, 2014

minux commented May 10, 2014

rsc commented Sep 15, 2014