-
Notifications
You must be signed in to change notification settings - Fork 843
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Further squeeze memory consumption #2629
Conversation
MTT results: +-------------+-----------------+-------------+----------+------+------+----------+------+
| Phase | Section | MPI Version | Duration | Pass | Fail | Time out | Skip |
+-------------+-----------------+-------------+----------+------+------+----------+------+
| MPI Install | my installation | 3.0.0a1 | 00:01 | 1 | | | |
| Test Build | trivial | 3.0.0a1 | 00:02 | 1 | | | |
| Test Build | ibm | 3.0.0a1 | 00:43 | 1 | | | |
| Test Build | intel | 3.0.0a1 | 01:22 | 1 | | | |
| Test Build | java | 3.0.0a1 | 00:01 | 1 | | | |
| Test Build | orte | 3.0.0a1 | 00:01 | 1 | | | |
| Test Run | trivial | 3.0.0a1 | 00:11 | 8 | | | |
| Test Run | ibm | 3.0.0a1 | 13:21 | 490 | | 1 | |
| Test Run | spawn | 3.0.0a1 | 00:25 | 7 | | | |
| Test Run | loopspawn | 3.0.0a1 | 08:13 | 1 | | | |
| Test Run | intel | 3.0.0a1 | 20:39 | 473 | | 1 | 4 |
| Test Run | intel_skip | 3.0.0a1 | 16:49 | 430 | | 1 | 47 |
| Test Run | java | 3.0.0a1 | 00:02 | 1 | | | |
| Test Run | orte | 3.0.0a1 | 00:51 | 19 | | | |
+-------------+-----------------+-------------+----------+------+------+----------+------+ |
@hjelmn Could you please try this on KNL for us and determine the memory footprint impact? |
I also determined that we were storing the local peer string on every node for every process. IIRC, this was done so that someone could retrieve the local peers for a given rank without first having to lookup that rank's node. This was done because the rank's node might not be something easily looked up. I have removed this "feature" for the moment as I didn't see any place that obviously used it. I will look at options for providing that feature in a more memory-efficient manner. |
One more observation: as stated previously, it is the BTL's that are utilizing the topology tree, not the MTL's. Large-scale systems generally rely on the MTL's for their communication, and so the memory impact may be mitigated in the scenario of most concern. @hjelmn We are, however, seeing the BTL's being opened, and the openib BTL being active, even when the psm/2 and ofi MTL's are being used. IIRC, this was because you needed to ensure some one-sided support - true? If so, what would the MTL's need to provide in terms of support or a flag to indicate that the BTL's are not needed? I also observed that the primary use of the topology tree was to calculate relative locality to other peers on the node. I believe there is a way for the daemon to provide that information - I will explore that option as a further optimization. |
I see that the BTL selection issue is on the agenda for the Jan meeting, so I will add an item to discuss this overall topic. |
The HWLOC information is also used the collective and the topology support. It should also be in use for the non-MPI-conformant keys we support for MPI_Comm_split_type. |
I'll check to see what exactly gets used - is it relative locality? As I noted, I found it in topo/treematch and in the coll/sm component, but not anywhere else |
I should note, as I'm sure you already know, that topo/treematch used the actual topology tree, but coll/sm only used relative locality. Couple of questions:
|
@rhc54 i incidentally found an other place where memory consumption could be reduced. |
…HWLOC topology tree, and compressing the topology string when it is in PMIx storage. From PMIx branch: Add the ability to compress/decompress long strings per suggestion by @hjelmn. We can't do it exactly as he suggested as that approach only dealt with pack/unpack of buffers for communication. Our problem is more around the storage size, not the size of our messages. So check the length of strings that are about to be stored, and compress them if larger than a limit. Mark such strings with their decompressed size and a new PMIX_COMPRESSED_STRING data type so we know how to decompress them before delivery. Signed-off-by: Ralph Castain <rhc@open-mpi.org> Convert remaining opal_hwloc_topology usage Signed-off-by: Ralph Castain <rhc@open-mpi.org> Sync to master, add memory leak frix from pmix Signed-off-by: Ralph Castain <rhc@open-mpi.org> Sync to PMIx PR Signed-off-by: Ralph Castain <rhc@open-mpi.org> Fix permissions Signed-off-by: Ralph Castain <rhc@open-mpi.org>
@rhc54 Things are looking better. The baselines are all down on knl. I am still seeing a massive spike in memory usage when running mpimemu. It looks like it is related to the number of processes on each node. Some graphs below. mpimemu -s 10 -t 10 -w on knl XC40 nodes. ppn on x axis. average node (MemUsed - Cached) memory usage on y. Different lines are different numbers of nodes: Same thing graphed with # MPI processes on the x axis: See a good correlation in the usage spike on the graph vs ppn but none on the graph vs process count. I don't yet know the cause but I will continue to look into it. |
I'll rebase this to master - will take a little while to sort it all out. |
Replaced by #2649 |
Further squeeze memory consumption by removing global storage of the HWLOC topology tree, and compressing the topology string when it is in PMIx storage. The memory footprint of the HWLOC topology tree gets to be rather large on complex architectures such as KNL and Power9. When multiplied by 100s of processes on a node, the aggregated memory footprint can reach into the gigabyte range.
Fortunately, we only use the full tree in three scenarios:
in the daemons for applying both processor and memory bindings. This happens only during job launch, and there is only one daemon per node. Thus, this PR retains the full-time copy of the topology tree in daemons. A further refinement for future work is to use the OPAL object nature of the
orte_topology_t
structure to release the topology tree when not needed.in an MPI process during MPI_Init. Various BTL's use the topology during setup. This usage is momentary and outside any critical path. Thus, trading some time for memory footprint may be warranted.
in an MPI process when computing the topology graph in the ompi/mca/topo/treematch component. I don't fully understand this code, but it appears to construct trees during communicator creation in an attempt to optimize collective operations. If accurate, this happens outside of a critical path, and so trading time for footprint may be acceptable.
This PR represents a first cut at solving the memory footprint problem being seen at scale on Trinity by addressing the last two scenarios. The approach taken here is to recreate the topology tree upon demand, and release it upon completion of the immediate operation. This may not be an optimal solution - e.g., we may choose instead to retain a copy during MPI_Init, and then release it and recreate upon demand afterwards. I did look at reference counting the topology tree, but unfortunately the serial nature of the operations during MPI_Init caused the object to be destroyed and recreated each time. So retaining the copy would require that we obtain it at some appropriate point in MPI_Init, and then release it at the end of MPI_Init.
Before chasing optimization, it seemed prudent to first determine if this truly does resolve the memory problem. This PR is therefore intended as a testing platform - if it resolves the problem, we can use it as a starting point towards developing a more optimal solution.
From PMIx branch:
Add the ability to compress/decompress long strings per suggestion by @hjelmn. We can't do it exactly as he suggested as that approach only dealt with pack/unpack of buffers for communication. Our problem is more around the storage size, not the size of our messages. So check the length of strings that are about to be stored, and compress them if larger than a limit. Mark such strings with their decompressed size and a new PMIX_COMPRESSED_STRING data type so we know how to decompress them before delivery.
Signed-off-by: Ralph Castain rhc@open-mpi.org