Further squeeze memory consumption #2629

rhc54 · 2016-12-22T11:19:05Z

Further squeeze memory consumption by removing global storage of the HWLOC topology tree, and compressing the topology string when it is in PMIx storage. The memory footprint of the HWLOC topology tree gets to be rather large on complex architectures such as KNL and Power9. When multiplied by 100s of processes on a node, the aggregated memory footprint can reach into the gigabyte range.

Fortunately, we only use the full tree in three scenarios:

in the daemons for applying both processor and memory bindings. This happens only during job launch, and there is only one daemon per node. Thus, this PR retains the full-time copy of the topology tree in daemons. A further refinement for future work is to use the OPAL object nature of the orte_topology_t structure to release the topology tree when not needed.
in an MPI process during MPI_Init. Various BTL's use the topology during setup. This usage is momentary and outside any critical path. Thus, trading some time for memory footprint may be warranted.
in an MPI process when computing the topology graph in the ompi/mca/topo/treematch component. I don't fully understand this code, but it appears to construct trees during communicator creation in an attempt to optimize collective operations. If accurate, this happens outside of a critical path, and so trading time for footprint may be acceptable.

This PR represents a first cut at solving the memory footprint problem being seen at scale on Trinity by addressing the last two scenarios. The approach taken here is to recreate the topology tree upon demand, and release it upon completion of the immediate operation. This may not be an optimal solution - e.g., we may choose instead to retain a copy during MPI_Init, and then release it and recreate upon demand afterwards. I did look at reference counting the topology tree, but unfortunately the serial nature of the operations during MPI_Init caused the object to be destroyed and recreated each time. So retaining the copy would require that we obtain it at some appropriate point in MPI_Init, and then release it at the end of MPI_Init.

Before chasing optimization, it seemed prudent to first determine if this truly does resolve the memory problem. This PR is therefore intended as a testing platform - if it resolves the problem, we can use it as a starting point towards developing a more optimal solution.

From PMIx branch:
Add the ability to compress/decompress long strings per suggestion by @hjelmn. We can't do it exactly as he suggested as that approach only dealt with pack/unpack of buffers for communication. Our problem is more around the storage size, not the size of our messages. So check the length of strings that are about to be stored, and compress them if larger than a limit. Mark such strings with their decompressed size and a new PMIX_COMPRESSED_STRING data type so we know how to decompress them before delivery.

Signed-off-by: Ralph Castain rhc@open-mpi.org

rhc54 · 2016-12-22T11:22:33Z

MTT results:

+-------------+-----------------+-------------+----------+------+------+----------+------+
| Phase       | Section         | MPI Version | Duration | Pass | Fail | Time out | Skip |
+-------------+-----------------+-------------+----------+------+------+----------+------+
| MPI Install | my installation | 3.0.0a1     | 00:01    | 1    |      |          |      |
| Test Build  | trivial         | 3.0.0a1     | 00:02    | 1    |      |          |      |
| Test Build  | ibm             | 3.0.0a1     | 00:43    | 1    |      |          |      |
| Test Build  | intel           | 3.0.0a1     | 01:22    | 1    |      |          |      |
| Test Build  | java            | 3.0.0a1     | 00:01    | 1    |      |          |      |
| Test Build  | orte            | 3.0.0a1     | 00:01    | 1    |      |          |      |
| Test Run    | trivial         | 3.0.0a1     | 00:11    | 8    |      |          |      |
| Test Run    | ibm             | 3.0.0a1     | 13:21    | 490  |      | 1        |      | 
| Test Run    | spawn           | 3.0.0a1     | 00:25    | 7    |      |          |      |
| Test Run    | loopspawn       | 3.0.0a1     | 08:13    | 1    |      |          |      |
| Test Run    | intel           | 3.0.0a1     | 20:39    | 473  |      | 1        | 4    |
| Test Run    | intel_skip      | 3.0.0a1     | 16:49    | 430  |      | 1        | 47   |
| Test Run    | java            | 3.0.0a1     | 00:02    | 1    |      |          |      |
| Test Run    | orte            | 3.0.0a1     | 00:51    | 19   |      |          |      |
+-------------+-----------------+-------------+----------+------+------+----------+------+

rhc54 · 2016-12-22T11:23:38Z

@hjelmn Could you please try this on KNL for us and determine the memory footprint impact?

rhc54 · 2016-12-22T11:27:38Z

I also determined that we were storing the local peer string on every node for every process. IIRC, this was done so that someone could retrieve the local peers for a given rank without first having to lookup that rank's node. This was done because the rank's node might not be something easily looked up. I have removed this "feature" for the moment as I didn't see any place that obviously used it. I will look at options for providing that feature in a more memory-efficient manner.

rhc54 · 2016-12-22T15:02:07Z

One more observation: as stated previously, it is the BTL's that are utilizing the topology tree, not the MTL's. Large-scale systems generally rely on the MTL's for their communication, and so the memory impact may be mitigated in the scenario of most concern.

@hjelmn We are, however, seeing the BTL's being opened, and the openib BTL being active, even when the psm/2 and ofi MTL's are being used. IIRC, this was because you needed to ensure some one-sided support - true? If so, what would the MTL's need to provide in terms of support or a flag to indicate that the BTL's are not needed?

I also observed that the primary use of the topology tree was to calculate relative locality to other peers on the node. I believe there is a way for the daemon to provide that information - I will explore that option as a further optimization.

rhc54 · 2016-12-22T16:54:29Z

I see that the BTL selection issue is on the agenda for the Jan meeting, so I will add an item to discuss this overall topic.

bosilca · 2016-12-23T05:49:03Z

The HWLOC information is also used the collective and the topology support. It should also be in use for the non-MPI-conformant keys we support for MPI_Comm_split_type.

rhc54 · 2016-12-23T15:43:06Z

I'll check to see what exactly gets used - is it relative locality? As I noted, I found it in topo/treematch and in the coll/sm component, but not anywhere else

rhc54 · 2016-12-23T15:46:17Z

I should note, as I'm sure you already know, that topo/treematch used the actual topology tree, but coll/sm only used relative locality. Couple of questions:

Is there a critical path hit if topo/treematch has to rebuild the tree each time it is called?
When/how does that component get called? Is it something the user does, or is it something we call in the background when performing some other operation?

ggouaillardet · 2016-12-26T01:26:01Z

@rhc54 i incidentally found an other place where memory consumption could be reduced.
in orte_debugger_init_after_spawn(), MPIR_proctable is using a bunch of strdup() and not all of them (if any) are needed.
until when do we need a valid MPIR_proctable ?
depending on the answer, we might not even need any strdup() (e.g. the table will become populated with pointers to de-allocated memory but only when it is not needed any more)
if we need a valid MPIR_proctable at all time, then we could add a bit of logic so strdup() is invoked only once per unique string (e.g. if running n tasks on 1 node, only strdup(<nodename>) once

@hjelmn

…HWLOC topology tree, and compressing the topology string when it is in PMIx storage. From PMIx branch: Add the ability to compress/decompress long strings per suggestion by @hjelmn. We can't do it exactly as he suggested as that approach only dealt with pack/unpack of buffers for communication. Our problem is more around the storage size, not the size of our messages. So check the length of strings that are about to be stored, and compress them if larger than a limit. Mark such strings with their decompressed size and a new PMIX_COMPRESSED_STRING data type so we know how to decompress them before delivery. Signed-off-by: Ralph Castain <rhc@open-mpi.org> Convert remaining opal_hwloc_topology usage Signed-off-by: Ralph Castain <rhc@open-mpi.org> Sync to master, add memory leak frix from pmix Signed-off-by: Ralph Castain <rhc@open-mpi.org> Sync to PMIx PR Signed-off-by: Ralph Castain <rhc@open-mpi.org> Fix permissions Signed-off-by: Ralph Castain <rhc@open-mpi.org>

hjelmn · 2016-12-29T21:26:34Z

@rhc54 Things are looking better. The baselines are all down on knl. I am still seeing a massive spike in memory usage when running mpimemu. It looks like it is related to the number of processes on each node. Some graphs below.

mpimemu -s 10 -t 10 -w on knl XC40 nodes. ppn on x axis. average node (MemUsed - Cached) memory usage on y. Different lines are different numbers of nodes:

Same thing graphed with # MPI processes on the x axis:

See a good correlation in the usage spike on the graph vs ppn but none on the graph vs process count. I don't yet know the cause but I will continue to look into it.

rhc54 · 2016-12-29T22:00:20Z

I'll rebase this to master - will take a little while to sort it all out.

rhc54 · 2016-12-30T22:10:34Z

Replaced by #2649

rhc54 mentioned this pull request Dec 22, 2016

Further reduce memory footprint openpmix/openpmix#241

Merged

artpol84 mentioned this pull request Dec 30, 2016

dstor: fixed key overwrite openpmix/openpmix#245

Merged

rhc54 mentioned this pull request Dec 30, 2016

Update to latest PMIx master #2649

Merged

rhc54 closed this Dec 30, 2016

rhc54 deleted the topic/footprint branch December 30, 2016 22:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Further squeeze memory consumption #2629

Further squeeze memory consumption #2629

rhc54 commented Dec 22, 2016

rhc54 commented Dec 22, 2016

rhc54 commented Dec 22, 2016

rhc54 commented Dec 22, 2016

rhc54 commented Dec 22, 2016

rhc54 commented Dec 22, 2016

bosilca commented Dec 23, 2016

rhc54 commented Dec 23, 2016

rhc54 commented Dec 23, 2016

ggouaillardet commented Dec 26, 2016

hjelmn commented Dec 29, 2016 •

edited

rhc54 commented Dec 29, 2016

rhc54 commented Dec 30, 2016

Further squeeze memory consumption #2629

Further squeeze memory consumption #2629

Conversation

rhc54 commented Dec 22, 2016

rhc54 commented Dec 22, 2016

rhc54 commented Dec 22, 2016

rhc54 commented Dec 22, 2016

rhc54 commented Dec 22, 2016

rhc54 commented Dec 22, 2016

bosilca commented Dec 23, 2016

rhc54 commented Dec 23, 2016

rhc54 commented Dec 23, 2016

ggouaillardet commented Dec 26, 2016

hjelmn commented Dec 29, 2016 • edited

rhc54 commented Dec 29, 2016

rhc54 commented Dec 30, 2016

hjelmn commented Dec 29, 2016 •

edited