Skip to content

OpenMP performance with atomic updates #1874

@ojschumann

Description

@ojschumann

I performed some scaling tests (OpenMP, no MPI) with up to 50 cores and found a degradation when using all cores. The runtime is even longer than with the optimal core count. I tracked down the root cause to be the atomic updates of some variables. I used OpenMC Version 0.12.2 for my tests, the compiler is gcc 9.3.0 (Ubuntu 20.4 lts) and I used no tallies in my problem (see below). I patched the code to be able to compare three different implementations. (patch attached)

a) atomic update (omp atomic)
b) std::vector with a size of omp_get_max_threads() and an update operator+= { data[omp_get_thread_num()]+= v; }
c) an additional omp threadprivate variable, that gets updated in the parallel section.

I benchmarked a eigenvalue and a fixed source calculation from a common simple geometry. Results from the three implementations are:

Atomic
eigenvalue : 71.61s speedup: 1.00
fixed source : 287.96s speedup: 1.00

std::vector
eigenvalue : 43.46s speedup: 1.65
fixed source : 41.33s speedup: 6.92

omp threadprivate
eigenvalue : 43.75s speedup: 1.64
fixed source : 35.21s speedup: 8.13

  • I guess the threadprivate solution is the best (on NUMA architectures, the variables might be allocated my OpenMP on a node-local memory, whereas the std::vector is allocated in one small memory block. This might even give cache issues).
  • The speedup for eigenvalue calculations is not as high as for fixed source because here the storage of source points in the thread aware array where also atomic operations serializes the threads. A solution could be a per thread (temporary) fission_bank.
  • Atomic operations also occur in the tally updates, which were not relevant in the tests here. A per thread accumulator might be too memory consuming (I already have a preliminary implementation of this and it works very well for "small" tallies.) Another approach might be a per thread tally event cache, that is only applied to the tally accumulator every N tally scores.
  • Also note, that the large speedup in my test for the fixed source problem is only achieved, if I define the static variable in IndependentSource::sample as threadprivate. Otherwise, the speedup is only 1.70 / 1.14. This in fact gives a different behavior of the code. (aka a bug!!) It is my guess, that OpenMP introduces an implicit atomic operation for the static variable. (at least the gcc implementation) I did not find that behavior described in the OpenMP specs. Other Compilers (icc, clang) might do it differently and if somebody has infos on that, I would like to hear about it.

Attached you find the patch to OpenMC 0.12.2 I used for experimenting (set the THREAD_LOCAL_MODE in openmc/include/threadprivate.h and recompile to change the modes) and my script for benchmarking.

openmc-0.12.2-atomic.patch.gz
sphere_base.py.gz

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions