Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Call
hipInit(0)
beforeMPI_Init
.Motivation and context
OLCF suggests this as a workaround for segmentation faults in
MPI_Init
: https://docs.olcf.ornl.gov/systems/frontier_user_guide.html#olcfdev-1655-occasional-seg-fault-during-mpi-init.How has this been tested?
I attempted to reproduce the failure. Prior to this change, i did observe:
very occasionally - approximately 3 times out of hundreds of launches ranging from 10 nodes to 500 nodes. This error occurred during both CPU-only and GPU jobs.
With ~20 submissions after this change, I did not observe the error. With such a low rate of occurrence, we will only be certain of the fix after long term testing with production runs. There is no harm in initializing the hip runtime early: https://rocm.docs.amd.com/projects/HIP/en/latest/.doxygen/docBin/html/group___driver.html#ga01baa652dda5815c594d047060496caa - it is normally initialized automatically on the first HIP call.
Change log
Checklist:
sphinx-doc/credits.rst
) in the pull request source branch.