Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

cmd/cc: atomic intrinsics #4947

Closed
dvyukov opened this Issue Mar 1, 2013 · 30 comments

Comments

Projects
None yet
5 participants
Member

dvyukov commented Mar 1, 2013

Some critical pieces of runtime package extensively use atomic operations (e.g.
runtime.lock/unlock, semaphores, scheduler, memory allocator, GC). It can make sense to
add compiler intrinsics for atomic memory accesses -- ATOMIC_LOAD/STORE/XADD/XCHG/CAS
similar to PREFETCH intrinsic. The intrinsics should generate inline code and do not
affect register allocator as bad as function calls.
The implementation details are to be discussed.
Member

minux commented Mar 1, 2013

Comment 1:

agree.
perhaps related, i proposed that we move part of sync/atomic package
(the assembly part) to runtime package, so that 64-bit atomic instructions
on ARM could use native instructions if available.
Contributor

rsc commented Mar 4, 2013

Comment 2:

I'll put them in if someone can demonstrate that they make a significant
performance difference.
Contributor

rsc commented Mar 12, 2013

Comment 3:

[The time for maybe has passed.]
Member

dvyukov commented Jun 24, 2013

Comment 4:

Currently the profile for
./runtime.test -test.run=none
-test.bench="BenchmarkSyscall$|BenchmarkCreateGoroutines$|BenchmarkChanSync$"
no linux/amd64 looks as follows:
Events: 6K cycles                                                                       
                                                                                        
             
 25.52%  runtime.test  runtime.test       [.] runtime.xchg
  7.56%  runtime.test  runtime.test       [.] runtime.atomicstore
  5.50%  runtime.test  runtime.test       [.] runtime.cas
  4.36%  runtime.test  runtime.test       [.] runtime.exitsyscall
  3.73%  runtime.test  runtime.test       [.] runtime.lock
  3.66%  runtime.test  runtime.test       [.] runtime.entersyscall
  3.62%  runtime.test  runtime.test       [.] runtime_test.func·009
  3.61%  runtime.test  runtime.test       [.] runtime.newproc1
  3.42%  runtime.test  runtime.test       [.] runtime.chansend
  3.22%  runtime.test  runtime.test       [.] schedule
  3.11%  runtime.test  runtime.test       [.] runtime.unlock
  2.89%  runtime.test  runtime.test       [.] runtime.mcall
  2.76%  runtime.test  runtime.test       [.] runtime.memclr
  2.13%  runtime.test  runtime.test       [.] runtime.xadd64
atomic ops are:
 25.52%  runtime.test  runtime.test       [.] runtime.xchg
  7.56%  runtime.test  runtime.test       [.] runtime.atomicstore
  5.50%  runtime.test  runtime.test       [.] runtime.cas
  2.13%  runtime.test  runtime.test       [.] runtime.xadd64
That is, 40% is spent in atomic ops. Note that this is with GOMAXPROCS=1, so no
contention.
Of course this is synthetic benchmarks, but I believe that a real program can spend up
to 10% in atomic ops. Inlining can save a half of that.
Moreover, atomic ops will be more heavily used in chans and scheduler over time.
Moreover, we have some places in runtime where we especially avoid atomic ops due to
exactly call cost. This jeopardizes correctness. Not a good situation.
Member

dvyukov commented Jun 25, 2013

Comment 5:

To move forward with this, we need a prototype (for amd64).
Contributor

rsc commented Jul 11, 2013

Comment 6:

I will send you a prototype if you LGTM CL 10909045.

Owner changed to @rsc.

Status changed to Accepted.

Member

dvyukov commented Jul 15, 2013

Comment 7:

For x86 we only need separate STORE_RELEASE and STORE_SEQ_CST, the the former uses plain
MOV, while the latter uses XCHG.
All variations of LOAD just use MOV.
All LOCK prefixed RMW operations are SEQ_CST anyway.
For ARM we will need all variations of RELAXED/CONSUME/ACQUIRE/RELEASE/ACQ_REL/SEQ_CST,
this makes actual performance difference. But this is iff the prototype succeeds, and we
can have only SEQ_CST implementation for starters (all others forward to it).
Contributor

rsc commented Jul 15, 2013

Comment 8:

I propose not being so complex. Let the intrinsics be the same API we have
today (xchg64, load64, etc).
Russ
Member

dvyukov commented Jul 15, 2013

Comment 9:

MOV vs XCHG for STORE can make significant performance difference on x86. Currently we
have to use XCHG always.
RELAXED vs non-RELAXED can significantly affect register allocation, for RELAXED there
is no need to dump/restore registers. There are lots of fast-path checks that only need
RELAXED load (e.g. does global runq empty? does chan has any recv waiters?)
I am not saying about ARM that needs dmb; op; dmb for SEQ_CST operations.
Precise memory ordering constraints also improve code readability by making the intent
clear and will help with any formal verification.
Contributor

rsc commented Jul 15, 2013

Comment 10:

Can you please explain exactly what operations you are asking for, both
their full names and their semantics?
The API needs to be portable.
It is a non-starter for the scheduler to use different source code on
different architectures.
Russ
Member

dvyukov commented Jul 15, 2013

Comment 11:

T ATOMIC_FETCH_ADD_ORDER(T*, T);
T ATOMIC_EXCHANGE_ORDER(T*, T);
T ATOMIC_COMPARE_EXCHANGE_ORDER(T*, T, T);  // ideally (T*, T*, T) where second operand
is a pointer to CMP and is updated on failure
T ATOMIC_LOAD_ORDER(T*);
void ATOMIC_STORE_ORDER(T*, T);
Where ORDER is one of RELAXED/CONSUME/ACQUIRE/RELEASE/ACQ_REL/CEQ_CST.
It is portable (employed by C/C++).
RELAXED operations do not need to dump/restore registers.
On x86, CEQ_CST STORE is XCHG, other stores are MOV. All loads are MOV. All RMW
operations are LOCK prefixed and so SEQ_CST.
Contributor

rsc commented Jul 15, 2013

Comment 12:

So you are asking for 30 new intrinsics?
Contributor

rsc commented Jul 15, 2013

Comment 13:

Actually with int32 int64 and ptr it is really 60 or 90 new intrinsics. Are
you sure that's all of them?
I'm sorry, but that's not an API I am smart enough to program against. We
will never be able to write correct code if we have to reason through all
of that.
Member

dvyukov commented Jul 16, 2013

Comment 14:

>Actually with int32 int64 and ptr it is really 60 or 90 new intrinsics. Are
>you sure that's all of them?
We may also need FETCH_OR for GC to manipulate bitmasks, or maybe not.
I would expect ORDER to be represented as enum and type as Type (for codegen only size
is relevant so int32/uint32/int64/uint64/inptptr/uintptr is only 2 cases), so it's only
4 intrinsics -- LOAD, STORE, COMPARE_EXCHANGE and FETCH_OP.
Member

dvyukov commented Jul 16, 2013

Comment 15:

> I'm sorry, but that's not an API I am smart enough to program against. We
> will never be able to write correct code if we have to reason through all
> of that.
There are 3 orthogonal parts -- operation, type and order. Type and operation are
simple. The only remaining part is order.
Well, I agree, it's not trivial. The good news is that you can use SEQ_CST for all
operations and it will give you the current semantics of simple interleaving of
operations.
The remaining 2 types of synchronization (RELAXED and ACQUIRE-RELEASE) are more complex,
but that's what can provide additional performance benefit.
For x86, the only operations that would benefit from non-SEQ_CST memory order are:
STORE_RELEASE -- allows to replace XCHG with MOV
LOAD_RELAXED -- allows to not disturb register allocator
I am fine if we do only them besides SEQ_CST (which may be implied then, i.e.
ATOMIC_LOAD instead of ATOMIC_LOAD_SEQ_CST).
Contributor

rsc commented Jul 16, 2013

Comment 16:

I am still uncomfortable with all this complexity. No one is going to be
able to reason about and write correct code. I thought the purpose of this
issue was to avoid the function call overhead, and now instead we're
talking about shaving nanoseconds by making the atomics API 10x bigger. I
am still willing to avoid the function call overhead. I am much less
enthusiastic about adding new semantics. I don't think we should at all,
and certainly not before Go 1.2.
Put another way, I don't see why we should support so many more operations
here than we do in package sync/atomic.
Russ
Contributor

rsc commented Nov 27, 2013

Comment 17:

Labels changed: added go1.3maybe.

Contributor

rsc commented Dec 4, 2013

Comment 18:

Labels changed: added release-none, removed go1.3maybe.

Contributor

rsc commented Dec 4, 2013

Comment 19:

Labels changed: added repo-main.

Member

dvyukov commented Jan 8, 2014

Comment 20:

On 2014/01/06 20:24:59, rsc wrote:
> I think we still need to figure out the API, no? We were having a long
> discussion that fizzled out.
> 
> Russ
There are several things in the current API that I would like to change. But how much
are you willing to change anything? My understanding was that you do not want to change
anything. And, well, all others equal faster is still better.

Labels changed: added release-go1.3maybe, removed priority-later, release-none.

Contributor

rsc commented Jan 8, 2014

Comment 21:

I do not want to do something different from the package sync API. If that is
sufficient, great.
Member

dvyukov commented Jan 8, 2014

Comment 22:

Can we add few additions to it?
Scheduler, poller, channels are quite special and exacting in this regard. It's
unpleasant to lose there just because of inflexible API...
Contributor

rsc commented May 9, 2014

Comment 23:

Maybe for 1.4. I am not sure these matter anymore.

Labels changed: added release-go1.4, removed release-go1.3maybe.

Comment 24 by jlmoiron:

Has something changed elsewhere to make these not matter anymore?  
These are all heavily used by channels, which is the bread and butter of a lot of "real"
Go programs.  dvyukov's gut feeling that real programs might spend ~10% of their time in
atomic ops is something I've observed in my profiles (attached).  If the current runtime
atomics can be made intrinsics without expanding the API, and this actually does
represent significant advantage, I still sounds like it's worth it.

Attachments:

  1. jswn395.png (99908 bytes)
Contributor

rsc commented May 9, 2014

Comment 25:

I thought the main place these were needed was the garbage collector, and it doesn't use
the atomics anymore. Maybe we'll just get rid of all the C code instead.
Contributor

rsc commented May 9, 2014

Comment 26:

I thought the main place these were needed was the garbage collector, and it doesn't use
the atomics anymore. Maybe we'll just get rid of all the C code instead.
Contributor

ianlancetaylor commented May 9, 2014

Comment 27:

Then Dmitriy will open a new issue asking for atomic intrinsics in cmd/gc.
Member

dvyukov commented May 10, 2014

Comment 28:

> Then Dmitriy will open a new issue asking for atomic intrinsics in cmd/gc.
Definitely I will :)
Member

minux commented May 10, 2014

Comment 29:

if the runtime were done in Go, I think more general approach would be to support
inlining
assembly routines.
Contributor

rsc commented Sep 15, 2014

Comment 30:

Re #29 Everyone talks about inlining assembly routines, but it just doesn't work, unless
you teach the compiler's optimizer about every possible assembly instruction. That's
madness.
The C code is going away.

Status changed to Invalid.

@rsc rsc added this to the Go1.4 milestone Apr 14, 2015

@rsc rsc removed the release-go1.4 label Apr 14, 2015

@gopherbot gopherbot locked and limited conversation to collaborators Jun 24, 2016

This issue was closed.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.