-
Notifications
You must be signed in to change notification settings - Fork 913
Fix segv in OMPI_Affinity_str(). #8472
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ca25946
to
f00bf65
Compare
bot:aws:retest |
1 similar comment
bot:aws:retest |
I don't really grok what was done in the |
@jsquyres those changes protect the access of Technically I don't think we need to access userdata anymore since the bindings are propagated to via PMIX to |
Reviewing with @gpaulsen revealed some more gotcha's. Gonna push up a new change shortly. |
fed8d6b
to
f4da47d
Compare
Fixes this segv:
|
f91fc71
to
0a06f12
Compare
@@ -389,12 +401,7 @@ int opal_hwloc_base_get_topology(void) | |||
return OPAL_ERROR; | |||
} | |||
free(val); | |||
/* filter the cpus thru any default cpu set */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do need to pass thru "filter_cpus" in the case where we get the topology from other than shmem - was there some reason to remove it? I can see not doing so if you pull the topology from a file, and we obviously cannot do it if the topology is in shmem, but it is necessary otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the case where it gets the topology from the xml file that is set created by PMIx, whos topology should be the same out of shmem. If there's a reason it's needed here I can add it though.
The other case, when the user provided its own file, also didn't filter cpus.
The case where the topology is loaded manually is covered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do need it in this case (where the topology comes from the xml file) as the file is generally written by the daemon, which is not subject to the user's cpu specification. We don't do it in the case where the user provides the file because that is most commonly a developer debug use-case, not something we see done in practice, though one could make an argument that we should do it there too (so a developer could use one known topology and still test a variety of cgroup specifications against it).
We technically should do it even when we get the topology from shmem, but we can't due to the read-only nature of the shmem connection versus the way we chose to cache the results of the filter. This is something we'll have to address at some point (probably need a "shadow" tree that contains just the filtered information we used to store under "userdata").
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood. I updated the PR to put that block of code back. I might have time to take a stab at addressing the shmem issue in the coming week.
- 'userdata' is not available when the topology is distributed via shared memory. Fix some areas where it wasn't protected. - Allow users to pass in topo files via --mca opal_hwloc_base_topo_file. - Make sure the cache line is set wherever the topology is taken from. Signed-off-by: Austen Lauria <awlauria@us.ibm.com>
0a06f12
to
17f72b1
Compare
bot:ibm:retest |
bot:ibm:retest |
The IBM CI (GNU/Scale) build failed! Please review the log, linked below. Gist: https://gist.github.com/ibm-ompi/88b480fb98dfd2bdac939f023e1f8fe6 |
The IBM CI (PGI) build failed! Please review the log, linked below. Gist: https://gist.github.com/ibm-ompi/88b480fb98dfd2bdac939f023e1f8fe6 |
The IBM CI (XL) build failed! Please review the log, linked below. Gist: https://gist.github.com/ibm-ompi/88b480fb98dfd2bdac939f023e1f8fe6 |
bot:ibm:retest |
shared memory. Fix some areas where it wasn't protected.
filter cpus from PMIx either.
Signed-off-by: Austen Lauria awlauria@us.ibm.com