-
Notifications
You must be signed in to change notification settings - Fork 858
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mca_io_ompio_file_write_at_all() failed during parallel write of PnetCDF lib #10297
Comments
This test needs to be run on supercomputers or Linux clusters as it requires 512 MPI tasks. Load openmpi module (if it is available) for testing. If the system does not have an available openmpi module, build one from source and then run Build and install PnetCDF lib with openmpi (if there is no available module), e.g. to /path/to/pnetcdf-installation Steps to build the test
Submit a batch job or use an interactive job to execute the following command
|
@dqwu, is there any error like |
@brminich Yes, the log file also contains this error message: The failing nodes can vary from run to run, as this issue was reproduced with both the debug queue and the compute queue. |
This sounds like it is turning into an UCX issue...? FYI @open-mpi/ucx |
hi it seems UCX detects error on network and reports it into OMPI. as result OMPI terminates failed rank, but neighbor rank (on same node) may try to communicate with terminated rank via SHM transport using CMA and gets error on is it possible to get logs from same run where thank you |
@hoopoepg I have rerun the test and the full log can be viewed at the following URL: |
hi @dqwu |
@hoopoepg test_openmpi.193176.out is the original output log file of the submitted slurm job to run this test. No changes have been made to this log file.
|
hmmm, timestamps are missing... not good. and check if error thank you |
Here is the latest log file: |
@dqwu can you pls try the following (separate) experiments to understand the problem:
|
All of the 3 experiments failed, please see the log files. [UCX_TLS=rc] [UCX_RC_MAX_RD_ATOMIC=1] [UCX_RC_PATH_MTU=1024] FW version
|
@dqwu can you pls try also with |
These two tests also failed. [UCX_TLS=sm,tcp] [UCX_TLS=tcp] [ibv_devinfo] |
Since it fails on TCP , seems not related to IB fabric
|
The test is run with SLURM srun instead of mpirun. Should I use environment variables? [-x UCX_TLS=tcp -mca coll ^hcoll] |
export OMPI_MCA_pml=ob1 export OMPI_MCA_coll=^hcoll |
These two tests still failed. [OMPI_MCA_pml=ob1 OMPI_MCA_btl=self,vader,tcp] [OMPI_MCA_coll=^hcoll UCX_TLS=tcp] |
Can you try two things as well to change some settings for the parallel I/O part?
|
[OMPI_MCA_fcoll=dynamic] [OMPI_MCA_io=^ompio] |
ok, thank you. Maybe you can use right now this last flag as a workaround. Is there a way to reproduce this issue with a smaller process count as well? It would help me a lot in trying to identify the problem. Since two fcoll components (vulcan, dynamic) show this issue, I doubt that the problem is in the fcoll component itself, but on another level (e.g. setting the file view or similar) |
|
Just checking, has this issue been fixed? (if not, do you have any approx timeline on when this issue will be fixed?) |
I am really sorry, but I don't have a platform to reproduce the issue, and parallel I/O is at this point also not part of my work description at my new job, so my resources are somewhat limited to resolve this issue. I would recommend to stick with the work around that was mentioned previously on the ticket. |
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
v4.1.3
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
spack installation
Please describe the system on which you are running
Details of the problem
This issue occurs at a machine used by E3SM (e3sm.org)
https://e3sm.org/model/running-e3sm/supported-machines/chrysalis-anl
modules used: gcc/9.2.0-ugetvbp openmpi/4.1.3-sxfyy4k parallel-netcdf/1.11.0-mirrcz7
The test to reproduce this issue was run with 512 MPI tasks, 8 nodes (64 tasks per node). This issue is also reproducible with modules built with intel compiler.
With the same test, this issue is not reproducible with intel MPI on the same machine.
Error messages and backtrace in the output log file:
A lock file was generated in addition to the expected output .nc file:
The text was updated successfully, but these errors were encountered: