Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Further squeeze memory consumption #2629

Closed
wants to merge 1 commit into from
Closed

Further squeeze memory consumption #2629

wants to merge 1 commit into from

Conversation

rhc54
Copy link
Contributor

@rhc54 rhc54 commented Dec 22, 2016

Further squeeze memory consumption by removing global storage of the HWLOC topology tree, and compressing the topology string when it is in PMIx storage. The memory footprint of the HWLOC topology tree gets to be rather large on complex architectures such as KNL and Power9. When multiplied by 100s of processes on a node, the aggregated memory footprint can reach into the gigabyte range.

Fortunately, we only use the full tree in three scenarios:

  • in the daemons for applying both processor and memory bindings. This happens only during job launch, and there is only one daemon per node. Thus, this PR retains the full-time copy of the topology tree in daemons. A further refinement for future work is to use the OPAL object nature of the orte_topology_t structure to release the topology tree when not needed.

  • in an MPI process during MPI_Init. Various BTL's use the topology during setup. This usage is momentary and outside any critical path. Thus, trading some time for memory footprint may be warranted.

  • in an MPI process when computing the topology graph in the ompi/mca/topo/treematch component. I don't fully understand this code, but it appears to construct trees during communicator creation in an attempt to optimize collective operations. If accurate, this happens outside of a critical path, and so trading time for footprint may be acceptable.

This PR represents a first cut at solving the memory footprint problem being seen at scale on Trinity by addressing the last two scenarios. The approach taken here is to recreate the topology tree upon demand, and release it upon completion of the immediate operation. This may not be an optimal solution - e.g., we may choose instead to retain a copy during MPI_Init, and then release it and recreate upon demand afterwards. I did look at reference counting the topology tree, but unfortunately the serial nature of the operations during MPI_Init caused the object to be destroyed and recreated each time. So retaining the copy would require that we obtain it at some appropriate point in MPI_Init, and then release it at the end of MPI_Init.

Before chasing optimization, it seemed prudent to first determine if this truly does resolve the memory problem. This PR is therefore intended as a testing platform - if it resolves the problem, we can use it as a starting point towards developing a more optimal solution.

From PMIx branch:
Add the ability to compress/decompress long strings per suggestion by @hjelmn. We can't do it exactly as he suggested as that approach only dealt with pack/unpack of buffers for communication. Our problem is more around the storage size, not the size of our messages. So check the length of strings that are about to be stored, and compress them if larger than a limit. Mark such strings with their decompressed size and a new PMIX_COMPRESSED_STRING data type so we know how to decompress them before delivery.

Signed-off-by: Ralph Castain rhc@open-mpi.org

@rhc54
Copy link
Contributor Author

rhc54 commented Dec 22, 2016

MTT results:

+-------------+-----------------+-------------+----------+------+------+----------+------+
| Phase       | Section         | MPI Version | Duration | Pass | Fail | Time out | Skip |
+-------------+-----------------+-------------+----------+------+------+----------+------+
| MPI Install | my installation | 3.0.0a1     | 00:01    | 1    |      |          |      |
| Test Build  | trivial         | 3.0.0a1     | 00:02    | 1    |      |          |      |
| Test Build  | ibm             | 3.0.0a1     | 00:43    | 1    |      |          |      |
| Test Build  | intel           | 3.0.0a1     | 01:22    | 1    |      |          |      |
| Test Build  | java            | 3.0.0a1     | 00:01    | 1    |      |          |      |
| Test Build  | orte            | 3.0.0a1     | 00:01    | 1    |      |          |      |
| Test Run    | trivial         | 3.0.0a1     | 00:11    | 8    |      |          |      |
| Test Run    | ibm             | 3.0.0a1     | 13:21    | 490  |      | 1        |      | 
| Test Run    | spawn           | 3.0.0a1     | 00:25    | 7    |      |          |      |
| Test Run    | loopspawn       | 3.0.0a1     | 08:13    | 1    |      |          |      |
| Test Run    | intel           | 3.0.0a1     | 20:39    | 473  |      | 1        | 4    |
| Test Run    | intel_skip      | 3.0.0a1     | 16:49    | 430  |      | 1        | 47   |
| Test Run    | java            | 3.0.0a1     | 00:02    | 1    |      |          |      |
| Test Run    | orte            | 3.0.0a1     | 00:51    | 19   |      |          |      |
+-------------+-----------------+-------------+----------+------+------+----------+------+

@rhc54
Copy link
Contributor Author

rhc54 commented Dec 22, 2016

@hjelmn Could you please try this on KNL for us and determine the memory footprint impact?

@rhc54
Copy link
Contributor Author

rhc54 commented Dec 22, 2016

I also determined that we were storing the local peer string on every node for every process. IIRC, this was done so that someone could retrieve the local peers for a given rank without first having to lookup that rank's node. This was done because the rank's node might not be something easily looked up. I have removed this "feature" for the moment as I didn't see any place that obviously used it. I will look at options for providing that feature in a more memory-efficient manner.

@rhc54
Copy link
Contributor Author

rhc54 commented Dec 22, 2016

One more observation: as stated previously, it is the BTL's that are utilizing the topology tree, not the MTL's. Large-scale systems generally rely on the MTL's for their communication, and so the memory impact may be mitigated in the scenario of most concern.

@hjelmn We are, however, seeing the BTL's being opened, and the openib BTL being active, even when the psm/2 and ofi MTL's are being used. IIRC, this was because you needed to ensure some one-sided support - true? If so, what would the MTL's need to provide in terms of support or a flag to indicate that the BTL's are not needed?

I also observed that the primary use of the topology tree was to calculate relative locality to other peers on the node. I believe there is a way for the daemon to provide that information - I will explore that option as a further optimization.

@rhc54
Copy link
Contributor Author

rhc54 commented Dec 22, 2016

I see that the BTL selection issue is on the agenda for the Jan meeting, so I will add an item to discuss this overall topic.

@bosilca
Copy link
Member

bosilca commented Dec 23, 2016

The HWLOC information is also used the collective and the topology support. It should also be in use for the non-MPI-conformant keys we support for MPI_Comm_split_type.

@rhc54
Copy link
Contributor Author

rhc54 commented Dec 23, 2016

I'll check to see what exactly gets used - is it relative locality? As I noted, I found it in topo/treematch and in the coll/sm component, but not anywhere else

@rhc54
Copy link
Contributor Author

rhc54 commented Dec 23, 2016

I should note, as I'm sure you already know, that topo/treematch used the actual topology tree, but coll/sm only used relative locality. Couple of questions:

  • Is there a critical path hit if topo/treematch has to rebuild the tree each time it is called?

  • When/how does that component get called? Is it something the user does, or is it something we call in the background when performing some other operation?

@ggouaillardet
Copy link
Contributor

@rhc54 i incidentally found an other place where memory consumption could be reduced.
in orte_debugger_init_after_spawn(), MPIR_proctable is using a bunch of strdup() and not all of them (if any) are needed.
until when do we need a valid MPIR_proctable ?
depending on the answer, we might not even need any strdup() (e.g. the table will become populated with pointers to de-allocated memory but only when it is not needed any more)
if we need a valid MPIR_proctable at all time, then we could add a bit of logic so strdup() is invoked only once per unique string (e.g. if running n tasks on 1 node, only strdup(<nodename>) once

…HWLOC topology tree, and compressing the topology string when it is in PMIx storage.

From PMIx branch:
Add the ability to compress/decompress long strings per suggestion by @hjelmn. We can't do it exactly as he suggested as that approach only dealt with pack/unpack of buffers for communication. Our problem is more around the storage size, not the size of our messages. So check the length of strings that are about to be stored, and compress them if larger than a limit. Mark such strings with their decompressed size and a new PMIX_COMPRESSED_STRING data type so we know how to decompress them before delivery.

Signed-off-by: Ralph Castain <rhc@open-mpi.org>

Convert remaining opal_hwloc_topology usage

Signed-off-by: Ralph Castain <rhc@open-mpi.org>

Sync to master, add memory leak frix from pmix

Signed-off-by: Ralph Castain <rhc@open-mpi.org>

Sync to PMIx PR

Signed-off-by: Ralph Castain <rhc@open-mpi.org>

Fix permissions

Signed-off-by: Ralph Castain <rhc@open-mpi.org>
@hjelmn
Copy link
Member

hjelmn commented Dec 29, 2016

@rhc54 Things are looking better. The baselines are all down on knl. I am still seeing a massive spike in memory usage when running mpimemu. It looks like it is related to the number of processes on each node. Some graphs below.

mpimemu -s 10 -t 10 -w on knl XC40 nodes. ppn on x axis. average node (MemUsed - Cached) memory usage on y. Different lines are different numbers of nodes:

memu workload ppn

Same thing graphed with # MPI processes on the x axis:

memu workload processes

See a good correlation in the usage spike on the graph vs ppn but none on the graph vs process count. I don't yet know the cause but I will continue to look into it.

@rhc54
Copy link
Contributor Author

rhc54 commented Dec 29, 2016

I'll rebase this to master - will take a little while to sort it all out.

@rhc54
Copy link
Contributor Author

rhc54 commented Dec 30, 2016

Replaced by #2649

@rhc54 rhc54 closed this Dec 30, 2016
@rhc54 rhc54 deleted the topic/footprint branch December 30, 2016 22:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants