Skip to content

Conversation

@jjhursey
Copy link
Member

The mca_component_show_load_errors MCA parameter will display an error to stddiag when a dlopen fails. But it would be nice to format that information and report it inline in ompi_info.

This makes it easy to see the failed component in the list next to other components of the same framework. Handy if you are dropping a module into a build on a system and one of the shared libraries it depends on is not present. Instead of seeing no errors (say if the sysadmin turned off mca_component_show_load_errors in the default params file), they will see it listed. For example, dropping the LSF modules into a build on a system that doesn't have the necessary supporting libraries just yet.

shell$  ompi_info --show-failed | grep fail
                 MCA ess: lsf (failed to load) libbat.so: cannot open shared object file: No such file or directory
                 MCA plm: lsf (failed to load) libbat.so: cannot open shared object file: No such file or directory
                 MCA ras: lsf (failed to load) libbat.so: cannot open shared object file: No such file or directory

 * Add a path for failed component load information to be reported up.
 * This allows ompi_info to display this information inline to make it
   easier for folks to see if the component is present but failed for
   some reason. Most likely a missing library, but could be a libnl
   conflict.
 * Add MCA parameter to enable this feature:
   - `mca_base_component_track_load_errors` takes a boolean
   - Default: `false`

Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
 * `ompi_info --show-failed` will include the failed components along
   with information about why they failed.

Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
@jjhursey
Copy link
Member Author

jjhursey commented May 9, 2017

@jsquyres I think this is ready to commit. Since it navigates through the MCA data structures I wanted to get your 👀 on it before committing.

@jsquyres
Copy link
Member

bot:ompi:retest

@jsquyres jsquyres merged commit 23325c3 into open-mpi:master May 16, 2017
@jjhursey jjhursey deleted the topic/ompi_info_show_failed branch May 22, 2017 14:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants