mca_io_ompio_file_write_at_all() failed during parallel write of PnetCDF lib #10297

dqwu · 2022-04-19T20:50:59Z

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

v4.1.3

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

spack installation

Please describe the system on which you are running

Operating system/version: CentOS 8
Computer hardware: AMD Epyc 7532 processors (32 cores per CPU, 2.4 GHz)
Network type: N.A.

Details of the problem

This issue occurs at a machine used by E3SM (e3sm.org)
https://e3sm.org/model/running-e3sm/supported-machines/chrysalis-anl

modules used: gcc/9.2.0-ugetvbp openmpi/4.1.3-sxfyy4k parallel-netcdf/1.11.0-mirrcz7

The test to reproduce this issue was run with 512 MPI tasks, 8 nodes (64 tasks per node). This issue is also reproducible with modules built with intel compiler.

With the same test, this issue is not reproducible with intel MPI on the same machine.

Error messages and backtrace in the output log file:

[1650397594.556260] [chr-0496:1587709:0]         cma_ep.c:87   UCX  ERROR process_vm_readv(pid=1587708 length=524288) returned -1: Bad address
[1650397596.163612] [chr-0500:609796:0]         cma_ep.c:87   UCX  ERROR process_vm_readv(pid=609795 length=524288) returned -1: No such process
...
[1650397606.028410] [chr-0496:1587764:0]         cma_ep.c:87   UCX  ERROR process_vm_readv(pid=1587763 length=524288) returned -1: No such process
srun: error: chr-0494: tasks 0,9,11,25: Killed
...
[1650397612.452233] [chr-0500:609772:0]         cma_ep.c:87   UCX  ERROR process_vm_readv(pid=609771 length=333312) returned -1: No such process
==== backtrace (tid:3089585) ====
 0 0x0000000000055969 ucs_debug_print_backtrace()  ???:0
 1 0x00000000000200a9 uct_ib_mlx5_completion_with_err()  ???:0
 2 0x0000000000042972 uct_dc_mlx5_ep_handle_failure()  ???:0
 3 0x0000000000024c1a ucp_worker_progress()  ???:0
 4 0x0000000000232b77 mca_pml_ucx_send_nbr()  /tmp/svcbuilder/spack-stage-openmpi-4.1.3-sxfyy4knvddpewshfcc45heice7tzs7f/spack-src/ompi/mca/pml/ucx/pml_ucx.c:923
 5 0x0000000000232b77 mca_pml_ucx_send_nbr()  /tmp/svcbuilder/spack-stage-openmpi-4.1.3-sxfyy4knvddpewshfcc45heice7tzs7f/spack-src/ompi/mca/pml/ucx/pml_ucx.c:923
 6 0x0000000000232b77 mca_pml_ucx_send()  /tmp/svcbuilder/spack-stage-openmpi-4.1.3-sxfyy4knvddpewshfcc45heice7tzs7f/spack-src/ompi/mca/pml/ucx/pml_ucx.c:944
 7 0x00000000000d7c32 ompi_coll_base_sendrecv_actual()  /tmp/svcbuilder/spack-stage-openmpi-4.1.3-sxfyy4knvddpewshfcc45heice7tzs7f/spack-src/ompi/mca/coll/base/coll_base_util.c:58
 8 0x00000000000d707b ompi_coll_base_sendrecv()  /tmp/svcbuilder/spack-stage-openmpi-4.1.3-sxfyy4knvddpewshfcc45heice7tzs7f/spack-src/ompi/mca/coll/base/coll_base_util.h:133
 9 0x000000000010ced0 ompi_coll_tuned_allgatherv_intra_dec_fixed()  /tmp/svcbuilder/spack-stage-openmpi-4.1.3-sxfyy4knvddpewshfcc45heice7tzs7f/spack-src/ompi/mca/coll/tuned/coll_tuned_decision_fixed.c:1363
10 0x000000000016697a mca_fcoll_vulcan_file_write_all()  /tmp/svcbuilder/spack-stage-openmpi-4.1.3-sxfyy4knvddpewshfcc45heice7tzs7f/spack-src/ompi/mca/fcoll/vulcan/fcoll_vulcan_file_write_all.c:418
11 0x00000000000c2b39 mca_common_ompio_file_write_at_all()  /tmp/svcbuilder/spack-stage-openmpi-4.1.3-sxfyy4knvddpewshfcc45heice7tzs7f/spack-src/ompi/mca/common/ompio/common_ompio_file_write.c:452
12 0x00000000001aff57 mca_io_ompio_file_write_at_all()  /tmp/svcbuilder/spack-stage-openmpi-4.1.3-sxfyy4knvddpewshfcc45heice7tzs7f/spack-src/ompi/mca/io/ompio/io_ompio_file_write.c:174
13 0x00000000000aaaae PMPI_File_write_at_all()  /tmp/svcbuilder/spack-stage-openmpi-4.1.3-sxfyy4knvddpewshfcc45heice7tzs7f/spack-src/ompi/mpi/c/profile/pfile_write_at_all.c:75
14 0x000000000016fbc2 ncmpio_read_write()  ???:0
15 0x000000000016a7f6 mgetput()  ncmpio_wait.c:0
16 0x00000000001682bc req_aggregation()  ncmpio_wait.c:0
17 0x0000000000169e40 wait_getput()  ncmpio_wait.c:0
18 0x00000000001661a4 req_commit()  ncmpio_wait.c:0
19 0x0000000000166a0c ncmpio_wait()  ???:0
20 0x00000000000b727a ncmpi_wait_all()  ???:0
21 0x000000000046c5f9 flush_output_buffer()  ???:0
22 0x000000000042dc5e sync_file()  pio_file.c:0
23 0x000000000042df88 PIOc_closefile()  ???:0
24 0x0000000000414355 __piolib_mod_MOD_closefile()  ???:0
25 0x000000000040dd1c pioperformance_rearrtest.4019()  pioperformance_rearr.F90:0
26 0x000000000040ad15 MAIN__()  pioperformance_rearr.F90:0
27 0x0000000000411b49 main()  ???:0
28 0x00000000000237b3 __libc_start_main()  ???:0
29 0x000000000040a48e _start()  ???:0
=================================

A lock file was generated in addition to the expected output .nc file:

pioperf-rearr-2-ncomptasks-00512-niotasks-00512-stride-00001-iotype-1-nframes-00001-nvars-00500.nc
pioperf-rearr-2-ncomptasks-00512-niotasks-00512-stride-00001-iotype-1-nframes-00001-nvars-00500.nc-2399076352-875649.lock

The text was updated successfully, but these errors were encountered:

dqwu · 2022-04-19T21:46:10Z

This test needs to be run on supercomputers or Linux clusters as it requires 512 MPI tasks.

Load openmpi module (if it is available) for testing. If the system does not have an available openmpi module, build one from source and then run
export PATH=/path/to/openmpi-installation/bin:$PATH

Build and install PnetCDF lib with openmpi (if there is no available module), e.g. to /path/to/pnetcdf-installation

Steps to build the test

git clone --branch dqwu/test_openmpi https://github.com/E3SM-Project/scorpio.git
cd scorpio

tar zxf ne30_F_case_48602x72_512p.tar.gz

mkdir build
cd build

CC=mpicc CXX=mpicxx FC=mpifort \
cmake -Wno-dev \
-DWITH_NETCDF=OFF \
-DPnetCDF_PATH=/path/to/pnetcdf-installation \
-DPIO_USE_MALLOC=ON \
-DPIO_ENABLE_TESTS=ON \
..

make
cd tests/performance
make pioperf_rearr

cp ../../../ne30_F_case_48602x72_512p.dat ./

Submit a batch job or use an interactive job to execute the following command

mpirun -n 512 ./pioperf_rearr --pio-niotasks=512 --pio-nframes=1 --pio-nvars=500 \
--pio-decompfiles=\"./ne30_F_case_48602x72_512p.dat\" --pio-types=pnetcdf --pio-rearrangers=2

brminich · 2022-04-20T07:34:44Z

@dqwu, is there any error like Transport retry count exceeded on ... ?
Also are these errors seen on the specific nodes, or the failing nodes can vary from run to run?
In general looks like a netwworking issue (DC communication failed, which led to the process termination and CMA errors from its peers)

dqwu · 2022-04-20T14:48:27Z

@dqwu, is there any error like Transport retry count exceeded on ... ? Also are these errors seen on the specific nodes, or the failing nodes can vary from run to run? In general looks like a netwworking issue (DC communication failed, which led to the process termination and CMA errors from its peers)

@brminich Yes, the log file also contains this error message:
[chr-0495:3089585:0:3089585] ib_mlx5_log.c:143 Transport retry count exceeded on mlx5_0:1/IB (synd 0x15 vend 0x81 hw_synd 0/0)

The failing nodes can vary from run to run, as this issue was reproduced with both the debug queue and the compute queue.

jsquyres · 2022-04-20T16:59:07Z

This sounds like it is turning into an UCX issue...? FYI @open-mpi/ucx

hoopoepg · 2022-05-16T11:12:04Z

hi
thank you for bug report.

it seems UCX detects error on network and reports it into OMPI. as result OMPI terminates failed rank, but neighbor rank (on same node) may try to communicate with terminated rank via SHM transport using CMA and gets error on process_vm_readv call.

is it possible to get logs from same run where Transport retry count exceeded on and process_vm_readv messages appears?

thank you

dqwu · 2022-06-23T19:18:24Z

hi thank you for bug report.

it seems UCX detects error on network and reports it into OMPI. as result OMPI terminates failed rank, but neighbor rank (on same node) may try to communicate with terminated rank via SHM transport using CMA and gets error on process_vm_readv call.

is it possible to get logs from same run where Transport retry count exceeded on and process_vm_readv messages appears?

thank you

@hoopoepg I have rerun the test and the full log can be viewed at the following URL:
https://raw.githubusercontent.com/E3SM-Project/scorpio/dqwu/test_openmpi/test_openmpi.193176.out

hoopoepg · 2022-06-27T10:45:36Z

hi @dqwu
we don't see timestamps on error messages... was it cut by any grep-like utility?

dqwu · 2022-06-27T14:11:58Z

hi @dqwu we don't see timestamps on error messages... was it cut by any grep-like utility?

@hoopoepg test_openmpi.193176.out is the original output log file of the submitted slurm job to run this test. No changes have been made to this log file.

#!/bin/bash

#SBATCH -t 00:05:00
#SBATCH -A e3sm
#SBATCH --job-name=test_openmpi
#SBATCH --nodes=8
#SBATCH --output=test_openmpi.%j.out
#SBATCH --error=test_openmpi.%j.out
#SBATCH --partition=debug

cd $SLURM_SUBMIT_DIR

source /gpfs/fs1/soft/chrysalis/spack/opt/spack/linux-centos8-x86_64/gcc-9.3.0/lmod-8.3-5be73rg/lmod/lmod/init/sh
module purge 
module load gcc/9.2.0-ugetvbp openmpi/4.1.3-sxfyy4k parallel-netcdf/1.11.0-mirrcz7

srun -n 512 -N 8 ./pioperf_rearr --pio-niotasks=512 --pio-nframes=1 --pio-nvars=500 \
--pio-decompfiles=\"./ne30_F_case_48602x72_512p.dat\" --pio-types=pnetcdf --pio-rearrangers=2

hoopoepg · 2022-06-28T12:00:22Z

hmmm, timestamps are missing... not good.
ok, could you run application with variables:
UCX_TLS=ib
UCX_LOG_LEVEL=info

and check if error Transport retry count exceeded is present in logs (and post logs here)?

thank you

dqwu · 2022-06-28T15:58:29Z

UCX_LOG_LEVEL=info

Here is the latest log file:
https://raw.githubusercontent.com/E3SM-Project/scorpio/dqwu/test_openmpi/test_openmpi.195239.out

yosefe · 2022-06-29T13:47:53Z

@dqwu can you pls try the following (separate) experiments to understand the problem:

Set UCX_TLS=rc to disable DC transport; to see if the issue is DC-related
Set UCX_RC_MAX_RD_ATOMIC=1 to if it's related to exceeded max number of RDMA_READ operations on the QP
Set UCX_RC_PATH_MTU=1024 to see if it's related to different MTU settings
Check FW version by ibv_devinfo -v|grep fw_ver to see if FW upgrade could help

dqwu · 2022-06-29T14:53:36Z

@dqwu can you pls try the following (separate) experiments to understand the problem:

1. Set `UCX_TLS=rc` to disable DC transport; to see if the issue is DC-related

2. Set `UCX_RC_MAX_RD_ATOMIC=1` to if it's related to exceeded max number of RDMA_READ operations on the QP

3. Set `UCX_RC_PATH_MTU=1024` to see if it's related to different MTU settings

4. Check FW version by `ibv_devinfo -v|grep fw_ver` to see if FW upgrade could help

All of the 3 experiments failed, please see the log files.

[UCX_TLS=rc]
https://raw.githubusercontent.com/E3SM-Project/scorpio/dqwu/test_openmpi/test_openmpi.195609.out

[UCX_RC_MAX_RD_ATOMIC=1]
https://raw.githubusercontent.com/E3SM-Project/scorpio/dqwu/test_openmpi/test_openmpi.195610.out

[UCX_RC_PATH_MTU=1024]
https://raw.githubusercontent.com/E3SM-Project/scorpio/dqwu/test_openmpi/test_openmpi.195611.out

FW version

ibv_devinfo -v | grep fw_ver
	fw_ver:				20.29.2002
	fw_ver:				16.27.2008
	fw_ver:				16.27.2008

yosefe · 2022-06-29T15:07:23Z

@dqwu can you pls try also with UCX_TLS=sm,tcp and UCX_TLS=tcp, and post the full output from ibv_devinfo -v?

dqwu · 2022-06-29T15:24:04Z

ibv_devinfo

These two tests also failed.

[UCX_TLS=sm,tcp]
https://raw.githubusercontent.com/E3SM-Project/scorpio/dqwu/test_openmpi/test_openmpi.195622.out

[UCX_TLS=tcp]
https://raw.githubusercontent.com/E3SM-Project/scorpio/dqwu/test_openmpi/test_openmpi.195628.out

[ibv_devinfo]
https://raw.githubusercontent.com/E3SM-Project/scorpio/dqwu/test_openmpi/ibv_devinfo.txt

yosefe · 2022-06-29T15:30:12Z

Since it fails on TCP , seems not related to IB fabric
can you try these separate experiments:

-mca pml ob1 -mca btl self,vader,tcp -mca pml ^ucx to run OpenMPI without UCX component
In this case, perhaps will need to add -mca btl_tcp_if_include <netif> to specify the Ethernet device to use
-x UCX_TLS=tcp -mca coll ^hcoll to disable collective component

dqwu · 2022-06-29T16:00:36Z

The test is run with SLURM srun instead of mpirun. Should I use environment variables?
[-mca pml ob1 -mca btl self,vader,tcp -mca pml ^ucx]
export OMPI_MCA_pml=ob1,^ucx
export OMPI_MCA_btl=self,vader,tcp
It this correct?

[-x UCX_TLS=tcp -mca coll ^hcoll]
export UCX_TLS=tcp
export OMPI_MCA_coll=^hcoll
Is this correct?

yosefe · 2022-06-29T16:30:26Z

export OMPI_MCA_pml=ob1
export OMPI_MCA_btl=self,vader,tcp

export OMPI_MCA_coll=^hcoll
export UCX_TLS=tcp

dqwu · 2022-06-29T16:57:31Z

export OMPI_MCA_pml=ob1 export OMPI_MCA_btl=self,vader,tcp

export OMPI_MCA_coll=^hcoll export UCX_TLS=tcp

These two tests still failed.

[OMPI_MCA_pml=ob1 OMPI_MCA_btl=self,vader,tcp]
https://raw.githubusercontent.com/E3SM-Project/scorpio/dqwu/test_openmpi/test_openmpi.195638.out

[OMPI_MCA_coll=^hcoll UCX_TLS=tcp]
https://raw.githubusercontent.com/E3SM-Project/scorpio/dqwu/test_openmpi/test_openmpi.195640.out

edgargabriel · 2022-06-29T17:43:09Z

Can you try two things as well to change some settings for the parallel I/O part?

Force using a different fcoll component, e.g.
export OMPI_MCA_fcoll=dynamic
Force using a different io component, e.g.
export OMPI_MCA_io=^ompio

dqwu · 2022-06-29T19:19:09Z

OMPI_MCA_fcoll=dynamic

[OMPI_MCA_fcoll=dynamic]
This test still failed
https://raw.githubusercontent.com/E3SM-Project/scorpio/dqwu/test_openmpi/test_openmpi.195680.out

[OMPI_MCA_io=^ompio]
This test passed
https://raw.githubusercontent.com/E3SM-Project/scorpio/dqwu/test_openmpi/test_openmpi.195718.out

edgargabriel · 2022-06-29T19:31:02Z

ok, thank you. Maybe you can use right now this last flag as a workaround. Is there a way to reproduce this issue with a smaller process count as well? It would help me a lot in trying to identify the problem. Since two fcoll components (vulcan, dynamic) show this issue, I doubt that the problem is in the fcoll component itself, but on another level (e.g. setting the file view or similar)

dqwu · 2022-06-29T20:04:28Z

ok, thank you. Maybe you can use right now this last flag as a workaround. Is there a way to reproduce this issue with a smaller process count as well? It would help me a lot in trying to identify the problem. Since two fcoll components (vulcan, dynamic) show this issue, I doubt that the problem is in the fcoll component itself, but on another level (e.g. setting the file view or similar)
@edgargabriel
I do not have a simpler reproducer so far. You can try to follow the detailed steps provided in the 1st comment.

jayeshkrishna · 2022-11-15T16:34:10Z

Just checking, has this issue been fixed? (if not, do you have any approx timeline on when this issue will be fixed?)

edgargabriel · 2022-11-15T16:51:17Z

I am really sorry, but I don't have a platform to reproduce the issue, and parallel I/O is at this point also not part of my work description at my new job, so my resources are somewhat limited to resolve this issue. I would recommend to stick with the work around that was mentioned previously on the ticket.

dqwu mentioned this issue Apr 19, 2022

Read failure with OpenMPI lock files generated on GPFS file system of Chrysalis E3SM-Project/E3SM#4819

Open

jsquyres added the Target: v4.1.x label Apr 20, 2022

jsquyres added this to the v4.1.4 milestone Apr 20, 2022

jsquyres assigned brminich Apr 20, 2022

dqwu mentioned this issue Apr 21, 2022

Lock files generated during parallel read of a NetCDF file #10053

Open

brminich assigned hoopoepg May 16, 2022

bwbarrett modified the milestones: v4.1.4, v4.1.5 May 25, 2022

edgargabriel self-assigned this Jul 8, 2022

bwbarrett modified the milestones: v4.1.5, v4.1.6 Feb 23, 2023

bwbarrett modified the milestones: v4.1.6, v4.1.7 Sep 30, 2023

ptim0626 mentioned this issue Jul 16, 2024

OpenMPI setting for saving compressed hdf5 chunks DiamondLightSource/httomo#389

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mca_io_ompio_file_write_at_all() failed during parallel write of PnetCDF lib #10297

mca_io_ompio_file_write_at_all() failed during parallel write of PnetCDF lib #10297

dqwu commented Apr 19, 2022

dqwu commented Apr 19, 2022

brminich commented Apr 20, 2022

dqwu commented Apr 20, 2022

jsquyres commented Apr 20, 2022

hoopoepg commented May 16, 2022

dqwu commented Jun 23, 2022

hoopoepg commented Jun 27, 2022

dqwu commented Jun 27, 2022

hoopoepg commented Jun 28, 2022

dqwu commented Jun 28, 2022

yosefe commented Jun 29, 2022

dqwu commented Jun 29, 2022

yosefe commented Jun 29, 2022

dqwu commented Jun 29, 2022

yosefe commented Jun 29, 2022

dqwu commented Jun 29, 2022 •

edited

Loading

yosefe commented Jun 29, 2022

dqwu commented Jun 29, 2022

edgargabriel commented Jun 29, 2022

dqwu commented Jun 29, 2022

edgargabriel commented Jun 29, 2022

dqwu commented Jun 29, 2022

jayeshkrishna commented Nov 15, 2022

edgargabriel commented Nov 15, 2022 •

edited

Loading

mca_io_ompio_file_write_at_all() failed during parallel write of PnetCDF lib #10297

mca_io_ompio_file_write_at_all() failed during parallel write of PnetCDF lib #10297

Comments

dqwu commented Apr 19, 2022

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Please describe the system on which you are running

Details of the problem

dqwu commented Apr 19, 2022

brminich commented Apr 20, 2022

dqwu commented Apr 20, 2022

jsquyres commented Apr 20, 2022

hoopoepg commented May 16, 2022

dqwu commented Jun 23, 2022

hoopoepg commented Jun 27, 2022

dqwu commented Jun 27, 2022

hoopoepg commented Jun 28, 2022

dqwu commented Jun 28, 2022

yosefe commented Jun 29, 2022

dqwu commented Jun 29, 2022

yosefe commented Jun 29, 2022

dqwu commented Jun 29, 2022

yosefe commented Jun 29, 2022

dqwu commented Jun 29, 2022 • edited Loading

yosefe commented Jun 29, 2022

dqwu commented Jun 29, 2022

edgargabriel commented Jun 29, 2022

dqwu commented Jun 29, 2022

edgargabriel commented Jun 29, 2022

dqwu commented Jun 29, 2022

jayeshkrishna commented Nov 15, 2022

edgargabriel commented Nov 15, 2022 • edited Loading

dqwu commented Jun 29, 2022 •

edited

Loading

edgargabriel commented Nov 15, 2022 •

edited

Loading