Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Atomics with Serial Backend - Default should be Disable? #549

Closed
nmhamster opened this issue Nov 21, 2016 · 25 comments
Closed

Atomics with Serial Backend - Default should be Disable? #549

nmhamster opened this issue Nov 21, 2016 · 25 comments
Assignees
Labels
Enhancement Improve existing capability; will potentially require voting
Milestone

Comments

@nmhamster
Copy link
Contributor

I understand some of the background to why we have atomics enabled for serial backend but I think the default should be that atomic operations are disabled and result in standard load/stores when serial backend is used. This would assume a Kokkos CMake/make option to re-enable if needed. Given that a good number of codes are using the serial backend on MPI-only platforms still this seems an overhead which is not needed or necessary. Removing this would enable more optimization and possibly vectorization for this case.

@nmhamster nmhamster added Enhancement Improve existing capability; will potentially require voting Question For Kokkos internal and external contributors and users labels Nov 21, 2016
@hcedwar
Copy link
Contributor

hcedwar commented Nov 21, 2016

This would only be viable for a Serial-only configuration. Otherwise detected whether within a parallel region or not would swamp the cost.

@nmhamster
Copy link
Contributor Author

@hcedwar I think you tell people if they are using serial backend we won't compile thread safe atomics etc in there unless they use the override. There are definitely Trilinos builds which are serial-only.

@mhoemmen
Copy link
Contributor

What if people use Kokkos::Serial in thread-parallel tasks?

@crtrott
Copy link
Member

crtrott commented Nov 21, 2016

What Carter meant was that if we enable both OpenMP and Serial, atomics are still enabled.
If Serial is the ONLY enabled backend, then we shortcut the atomics.

@mhoemmen
Copy link
Contributor

Would it make sense for Kokkos::atomic_* to be functions of the execution space? Users would write, for example,

execution_space::atomic_add (&x[i], 42.0);

or for an execution space instance instance,

instance.atomic_add (&x[i], 42.0);

@hcedwar
Copy link
Contributor

hcedwar commented Nov 22, 2016

This would deviate way to much from the ISO/C++ standard, and from where the C++20 standard is headed for allowing atomic operations on non- std::atomic types.

Best we can do is when Kokkos is configured with ONLY the Kokkos::Serial back-end that the atomic implementations decay to non-atomic operations. This is still high-risk because an application may be using pthreads or OpenMP unknown to Kokkos, then use Kokkos atomics, then get silent data corruption, and then blame Kokkos.

Would be best to have a Kokkos configuration option that says each program process is guaranteed to be internally serial (a.k.a., MPI only for parallelism) so atomic operations should decay to non-atomic operations. Perhaps, KOKKOS_USING_SERIAL_PROCESS
Naming must conform to how issue #150 is resolved.

@crtrott
Copy link
Member

crtrott commented Nov 22, 2016

Yeah this is what I was thinking as well. And I agree with Carter that the default should be safe. So the option is to turn of atomics if and only if Serial is the only configured backend.

@nmhamster
Copy link
Contributor Author

Can we make that the default if only serial is configured so that all the MPI only codes running at the lab (which is quite a few right now) benefit?

@crtrott
Copy link
Member

crtrott commented Nov 22, 2016

So you think the scenario Carter brought up and I talked about earlier is unlikely enough to make it worth the risk?

@mhoemmen
Copy link
Contributor

Internal apps might very well enable OpenMP (the "one configuration to rule them all" approach), so in practice, it might be harder than this to turn off atomic updates.

@nmhamster
Copy link
Contributor Author

In general I think we should try to remove the overheads for MPI only configurations as much as possible. We are paying thousands of cycles which are not needed at all in the common (MPI only) case. My gut feeling is to say that if you enable Serial we make no guarantees unless you provide an override to force us to compile thread aware operations.

@mhoemmen
Copy link
Contributor

@nmhamster wrote:

In general I think we should try to remove the overheads for MPI only configurations as much as possible.

I agree :-) The issue is that the internal apps that want MPI only to perform well, may actually turn on OpenMP or Pthreads in Kokkos, even though they may not be using it. I'm not sure if we need to worry about that case, though. If they bother to turn on OpenMP, it's their fault if they don't use it. I don't think Kokkos can reasonably figure out globally, at configuration or compile time, whether or not users actually use an enabled execution space.

@ibaned
Copy link
Contributor

ibaned commented Nov 25, 2016

Is there any way to disable atomics currently ? If not it would be nice to have the option, though I don't know about defaults. In my case, I get Valgrind false positives from Kokkos atomic assign operations in Tpetra in serial runs (it compares an uninitialized value to itself). Being able to compile without atomics would make debugging easier.

@mhoemmen
Copy link
Contributor

@ibaned wrote:

In my case, I get Valgrind false positives from Kokkos atomic assign operations in Tpetra in serial runs (it compares an uninitialized value to itself).

If Kokkos::Serial is Tpetra's execution space, Tpetra does not use atomic_assign by default. Thus, these warnings are truly spurious. If Kokkos::OpenMP is Tpetra's execution space, but OpenMP is only using 1 thread, then Tpetra does use atomic_assign by default.

@hcedwar
Copy link
Contributor

hcedwar commented Jan 11, 2017

One avenue for pursuing issue #607

@hcedwar hcedwar added this to the Backlog milestone Feb 15, 2017
@stanmoore1
Copy link
Contributor

stanmoore1 commented Oct 9, 2017

Right now in LAMMPS and SPARTA we check if there is only a single thread and no GPUs, and if so we template out all the Kokkos atomics manually, which is a huge pain, but has to be done for performance. I wish Kokkos provided a more convenient way to do this and atomics were automatically disabled for the Serial backend.

@ibaned ibaned modified the milestones: Backlog, 2017 December Oct 9, 2017
@ibaned
Copy link
Contributor

ibaned commented Oct 9, 2017

Moving this out of Backlog and onto an upcoming deadline will put in on our radar and get it considered again. October release is happening now, so its on the December list.

@ibaned
Copy link
Contributor

ibaned commented Oct 11, 2017

So, the VPIC code would also like this feature.

@mhoemmen
Copy link
Contributor

It still confuses me why atomic operations are not a function on / of the execution space, but whatever ;-P.

@ibaned
Copy link
Contributor

ibaned commented Oct 11, 2017

why atomic operations are not a function on / of the execution space

Since the functions don't take any parameters indicating the execution space, I think they don't have a fast way to determine what space they're in which would prevent this. I suppose if they were templated on the execution space this would be trivial to implement.

@mhoemmen
Copy link
Contributor

@ibaned There's also no way to implement an MPI 3 RDMA back-end without atomics being a function / templated on the execution space. Even if the standard doesn't go that way, I wish Kokkos would. The two least idiomatic parts of Kokkos for me are the KOKKOS_*FUNCTION macros and atomics, precisely because they have no customization hooks.

C++ right now doesn't have a way to customize function attributes (e.g., to make them a function of a template parameter), so we're stuck with the macros. However, we could make atomics take an execution or memory space instance argument (that would cover the MPI RDMA case, among others).

@ibaned
Copy link
Contributor

ibaned commented Oct 12, 2017

C++ right now doesn't have a way to customize function attributes (e.g., to make them a function of a template parameter), so we're stuck with the macros

Yep, this is unfortunate indeed.

However, we could make atomics take an execution or memory space instance argument

I see no issue with doing the following: change atomic functions to take execution space in some way, and create backwards compatibility wrappers that call DefaultExecutionSpace variants.

@mhoemmen
Copy link
Contributor

@ibaned I would be happy to add the atomics overloads myself, if OK with the Kokkos folks.

@ibaned
Copy link
Contributor

ibaned commented Oct 12, 2017

@mhoemmen I'd be wary of spending much time on it, esp. if @hcedwar or @crtrott want a different design.

@hcedwar hcedwar modified the milestones: 2017 December, 2018 February Nov 1, 2017
@hcedwar hcedwar modified the milestones: 2017 December, 2018 February Nov 29, 2017
@ibaned ibaned assigned ibaned and unassigned dsunder Feb 7, 2018
ibaned added a commit to ibaned/kokkos that referenced this issue Feb 20, 2018
[kokkos#607] [kokkos#549]
A few details:
 - Accepting volatile pointers was necessary
   for compatibility with existing calls which
   pass in volatile pointers, hence the const_cast
 - Special implementations of atomic_increment
   were needed to get equal performance in the
   one application I tested (it was doing its
   own serial special cases before).
 - Compilers have a harder time matching templates
   as opposed to overloads, so some call sites
   had to be modified to specify the scalar
   type explicitly
@ibaned
Copy link
Contributor

ibaned commented Feb 20, 2018

I've just implemented this, will open a PR shortly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Improve existing capability; will potentially require voting
Projects
None yet
Development

No branches or pull requests

8 participants