OpenMP performance with atomic updates

I performed some scaling tests (OpenMP, no MPI) with up to 50 cores and found a degradation when using all cores. The runtime is even longer than with the optimal core count. I tracked down the root cause to be the atomic updates of some variables. I used OpenMC Version 0.12.2 for my tests, the compiler is gcc 9.3.0 (Ubuntu 20.4 lts) and I used no tallies in my problem (see below). I patched the code to be able to compare three different implementations. (patch attached)

a) atomic update (omp atomic)
b) std::vector<double> with a size of omp_get_max_threads() and an update operator+= { data[omp_get_thread_num()]+= v; }
c) an additional omp threadprivate variable, that gets updated in the parallel section.

I benchmarked a eigenvalue and a fixed source calculation from a common simple geometry. Results from the three implementations are: 


Atomic
eigenvalue          :  71.61s  speedup:  1.00
fixed source        : 287.96s  speedup:  1.00

std::vector<double>
eigenvalue          :  43.46s  speedup:  1.65
fixed source        :  41.33s  speedup:  6.92

omp threadprivate
eigenvalue          :  43.75s  speedup:  1.64
fixed source        :  35.21s  speedup:  8.13



- I guess the threadprivate solution is the best (on NUMA architectures, the variables might be allocated my OpenMP on a node-local memory, whereas the std::vector is allocated in one small memory block. This might even give cache issues). 
- The speedup for eigenvalue calculations is not as high as for fixed source because here the storage of source points in the thread aware array where also atomic operations serializes the threads. A solution could be a per thread (temporary) fission_bank. 
- Atomic operations also occur in the tally updates, which were not relevant in the tests here. A per thread accumulator might be too memory consuming (I already have a preliminary implementation of this and it works very well for "small" tallies.) Another approach might be a per thread tally event cache, that is only applied to the tally accumulator every N tally scores.
- Also note, that the large speedup in my test for the fixed source problem is only achieved, if I define the static variable in IndependentSource::sample as threadprivate. Otherwise, the speedup is only 1.70 / 1.14. This in fact gives a different behavior of the code. (aka a bug!!) It is my guess, that OpenMP introduces an implicit atomic operation for the static variable. (at least the gcc implementation) I did not find that behavior described in the OpenMP specs. Other Compilers (icc, clang) might do it differently and if somebody has infos on that, I would like to hear about it.

Attached you find the patch to OpenMC 0.12.2 I used for experimenting (set the THREAD_LOCAL_MODE in openmc/include/threadprivate.h and recompile to change the modes) and my script for benchmarking.

[openmc-0.12.2-atomic.patch.gz](https://github.com/openmc-dev/openmc/files/7031149/openmc-0.12.2-atomic.patch.gz)
[sphere_base.py.gz](https://github.com/openmc-dev/openmc/files/7031153/sphere_base.py.gz)










Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

OpenMP performance with atomic updates #1874

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

OpenMP performance with atomic updates #1874

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions