Share code registry across MPI processes #72

hfp · 2016-03-31T05:56:58Z

Share the code registry across (MPI-)processes to lower the total/per-process memory consumption, or to allow for a larger registry while still saving memory. This work does not introduce any dependency to the Message Passing Interface (MPI), but rather uses OS primitives to achieve this effect under MPI.

Initial work to centralize and unify the memory allocation incl. executable buffer under Microsoft Windows. This is also related to issue #72 (Share code registry across MPI processes).

jeffhammond · 2016-07-30T05:03:38Z

There are ways to use MPI for this that will make it portable without OS dependencies, if you are willing to assume MPI as a dependency. MPI_Win_allocate_shared is the MPI wrapper around POSIX shm_open and MPI_Comm_type_shared is the way to identify the processes that can share the code registry. I'm not sure what POSIX function provides something equivalent, but perhaps gethostname works.

hfp · 2016-07-30T14:01:08Z

I was planning to use shm_open (and the "equivalent" under Windows). This way I get full control, and for example the option to make the code registry persistent (map an actual file). Since you mention gethostname, I wanted to clarify that "sharing the registry across MPI processes" was meant to be on a per-node basis (and not to share a single instance across nodes). The latter is perhaps too costly in terms of synchronization (and no further savings in terms of local memory consumption).

My main motivation for this issue was/is to lower the memory consumption on a per-node basis. However, due to internal restructuring, the size of the code registry dropped meanwhile to only ~12 MB on a per-process basis incl. a "typical" amount of kernels (12 MB incl. actual code size; btw. a typical kernel code size is perhaps around 4 KB). So this issue is not as urgent as it was a while ago (early days: the registry was ~5x larger). Looking at KNL, I think there it's fine too since the registry is sort of latency bound and there is enough DDR4 memory (64 ranks on a single system would only[?] use ~768 MB).

jeffhammond · 2016-07-30T22:56:42Z

@hfp Yeah, MPI_Comm_type_shared is how you get a communicator for every node, and thus how you know what to pass to MPI_Win_allocate_shared. It is the MPI-3 portable way to get a shared memory slab on every node. There are, of course, other ways to achieve this that do not require MPI.

hfp · 2017-03-15T15:22:27Z

Moved this issue to https://github.com/hfp/libxsmm/wiki/Development#longer-term-issues.

hfp self-assigned this Mar 31, 2016

hfp added the enhancement label Mar 31, 2016

hfp added this to the 1.5 milestone Mar 31, 2016

hfp removed this from the 1.5 milestone Sep 28, 2016

hfp closed this as completed Mar 15, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Share code registry across MPI processes #72

Share code registry across MPI processes #72

hfp commented Mar 31, 2016

jeffhammond commented Jul 30, 2016

hfp commented Jul 30, 2016

jeffhammond commented Jul 30, 2016

hfp commented Mar 15, 2017

Share code registry across MPI processes #72

Share code registry across MPI processes #72

Comments

hfp commented Mar 31, 2016

jeffhammond commented Jul 30, 2016

hfp commented Jul 30, 2016

jeffhammond commented Jul 30, 2016

hfp commented Mar 15, 2017