Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mca_io_ompio_file_write_at_all() failed during parallel write of PnetCDF lib #10297

Open
dqwu opened this issue Apr 19, 2022 · 24 comments
Open
Assignees
Milestone

Comments

@dqwu
Copy link

dqwu commented Apr 19, 2022

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

v4.1.3

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

spack installation

Please describe the system on which you are running

  • Operating system/version: CentOS 8
  • Computer hardware: AMD Epyc 7532 processors (32 cores per CPU, 2.4 GHz)
  • Network type: N.A.

Details of the problem

This issue occurs at a machine used by E3SM (e3sm.org)
https://e3sm.org/model/running-e3sm/supported-machines/chrysalis-anl

modules used: gcc/9.2.0-ugetvbp openmpi/4.1.3-sxfyy4k parallel-netcdf/1.11.0-mirrcz7

The test to reproduce this issue was run with 512 MPI tasks, 8 nodes (64 tasks per node). This issue is also reproducible with modules built with intel compiler.

With the same test, this issue is not reproducible with intel MPI on the same machine.

Error messages and backtrace in the output log file:

[1650397594.556260] [chr-0496:1587709:0]         cma_ep.c:87   UCX  ERROR process_vm_readv(pid=1587708 length=524288) returned -1: Bad address
[1650397596.163612] [chr-0500:609796:0]         cma_ep.c:87   UCX  ERROR process_vm_readv(pid=609795 length=524288) returned -1: No such process
...
[1650397606.028410] [chr-0496:1587764:0]         cma_ep.c:87   UCX  ERROR process_vm_readv(pid=1587763 length=524288) returned -1: No such process
srun: error: chr-0494: tasks 0,9,11,25: Killed
...
[1650397612.452233] [chr-0500:609772:0]         cma_ep.c:87   UCX  ERROR process_vm_readv(pid=609771 length=333312) returned -1: No such process
==== backtrace (tid:3089585) ====
 0 0x0000000000055969 ucs_debug_print_backtrace()  ???:0
 1 0x00000000000200a9 uct_ib_mlx5_completion_with_err()  ???:0
 2 0x0000000000042972 uct_dc_mlx5_ep_handle_failure()  ???:0
 3 0x0000000000024c1a ucp_worker_progress()  ???:0
 4 0x0000000000232b77 mca_pml_ucx_send_nbr()  /tmp/svcbuilder/spack-stage-openmpi-4.1.3-sxfyy4knvddpewshfcc45heice7tzs7f/spack-src/ompi/mca/pml/ucx/pml_ucx.c:923
 5 0x0000000000232b77 mca_pml_ucx_send_nbr()  /tmp/svcbuilder/spack-stage-openmpi-4.1.3-sxfyy4knvddpewshfcc45heice7tzs7f/spack-src/ompi/mca/pml/ucx/pml_ucx.c:923
 6 0x0000000000232b77 mca_pml_ucx_send()  /tmp/svcbuilder/spack-stage-openmpi-4.1.3-sxfyy4knvddpewshfcc45heice7tzs7f/spack-src/ompi/mca/pml/ucx/pml_ucx.c:944
 7 0x00000000000d7c32 ompi_coll_base_sendrecv_actual()  /tmp/svcbuilder/spack-stage-openmpi-4.1.3-sxfyy4knvddpewshfcc45heice7tzs7f/spack-src/ompi/mca/coll/base/coll_base_util.c:58
 8 0x00000000000d707b ompi_coll_base_sendrecv()  /tmp/svcbuilder/spack-stage-openmpi-4.1.3-sxfyy4knvddpewshfcc45heice7tzs7f/spack-src/ompi/mca/coll/base/coll_base_util.h:133
 9 0x000000000010ced0 ompi_coll_tuned_allgatherv_intra_dec_fixed()  /tmp/svcbuilder/spack-stage-openmpi-4.1.3-sxfyy4knvddpewshfcc45heice7tzs7f/spack-src/ompi/mca/coll/tuned/coll_tuned_decision_fixed.c:1363
10 0x000000000016697a mca_fcoll_vulcan_file_write_all()  /tmp/svcbuilder/spack-stage-openmpi-4.1.3-sxfyy4knvddpewshfcc45heice7tzs7f/spack-src/ompi/mca/fcoll/vulcan/fcoll_vulcan_file_write_all.c:418
11 0x00000000000c2b39 mca_common_ompio_file_write_at_all()  /tmp/svcbuilder/spack-stage-openmpi-4.1.3-sxfyy4knvddpewshfcc45heice7tzs7f/spack-src/ompi/mca/common/ompio/common_ompio_file_write.c:452
12 0x00000000001aff57 mca_io_ompio_file_write_at_all()  /tmp/svcbuilder/spack-stage-openmpi-4.1.3-sxfyy4knvddpewshfcc45heice7tzs7f/spack-src/ompi/mca/io/ompio/io_ompio_file_write.c:174
13 0x00000000000aaaae PMPI_File_write_at_all()  /tmp/svcbuilder/spack-stage-openmpi-4.1.3-sxfyy4knvddpewshfcc45heice7tzs7f/spack-src/ompi/mpi/c/profile/pfile_write_at_all.c:75
14 0x000000000016fbc2 ncmpio_read_write()  ???:0
15 0x000000000016a7f6 mgetput()  ncmpio_wait.c:0
16 0x00000000001682bc req_aggregation()  ncmpio_wait.c:0
17 0x0000000000169e40 wait_getput()  ncmpio_wait.c:0
18 0x00000000001661a4 req_commit()  ncmpio_wait.c:0
19 0x0000000000166a0c ncmpio_wait()  ???:0
20 0x00000000000b727a ncmpi_wait_all()  ???:0
21 0x000000000046c5f9 flush_output_buffer()  ???:0
22 0x000000000042dc5e sync_file()  pio_file.c:0
23 0x000000000042df88 PIOc_closefile()  ???:0
24 0x0000000000414355 __piolib_mod_MOD_closefile()  ???:0
25 0x000000000040dd1c pioperformance_rearrtest.4019()  pioperformance_rearr.F90:0
26 0x000000000040ad15 MAIN__()  pioperformance_rearr.F90:0
27 0x0000000000411b49 main()  ???:0
28 0x00000000000237b3 __libc_start_main()  ???:0
29 0x000000000040a48e _start()  ???:0
=================================

A lock file was generated in addition to the expected output .nc file:

pioperf-rearr-2-ncomptasks-00512-niotasks-00512-stride-00001-iotype-1-nframes-00001-nvars-00500.nc
pioperf-rearr-2-ncomptasks-00512-niotasks-00512-stride-00001-iotype-1-nframes-00001-nvars-00500.nc-2399076352-875649.lock
@dqwu
Copy link
Author

dqwu commented Apr 19, 2022

This test needs to be run on supercomputers or Linux clusters as it requires 512 MPI tasks.

Load openmpi module (if it is available) for testing. If the system does not have an available openmpi module, build one from source and then run
export PATH=/path/to/openmpi-installation/bin:$PATH

Build and install PnetCDF lib with openmpi (if there is no available module), e.g. to /path/to/pnetcdf-installation

Steps to build the test

git clone --branch dqwu/test_openmpi https://github.com/E3SM-Project/scorpio.git
cd scorpio

tar zxf ne30_F_case_48602x72_512p.tar.gz

mkdir build
cd build

CC=mpicc CXX=mpicxx FC=mpifort \
cmake -Wno-dev \
-DWITH_NETCDF=OFF \
-DPnetCDF_PATH=/path/to/pnetcdf-installation \
-DPIO_USE_MALLOC=ON \
-DPIO_ENABLE_TESTS=ON \
..

make
cd tests/performance
make pioperf_rearr

cp ../../../ne30_F_case_48602x72_512p.dat ./

Submit a batch job or use an interactive job to execute the following command

mpirun -n 512 ./pioperf_rearr --pio-niotasks=512 --pio-nframes=1 --pio-nvars=500 \
--pio-decompfiles=\"./ne30_F_case_48602x72_512p.dat\" --pio-types=pnetcdf --pio-rearrangers=2

@brminich
Copy link
Member

@dqwu, is there any error like Transport retry count exceeded on ... ?
Also are these errors seen on the specific nodes, or the failing nodes can vary from run to run?
In general looks like a netwworking issue (DC communication failed, which led to the process termination and CMA errors from its peers)

@dqwu
Copy link
Author

dqwu commented Apr 20, 2022

@dqwu, is there any error like Transport retry count exceeded on ... ? Also are these errors seen on the specific nodes, or the failing nodes can vary from run to run? In general looks like a netwworking issue (DC communication failed, which led to the process termination and CMA errors from its peers)

@brminich Yes, the log file also contains this error message:
[chr-0495:3089585:0:3089585] ib_mlx5_log.c:143 Transport retry count exceeded on mlx5_0:1/IB (synd 0x15 vend 0x81 hw_synd 0/0)

The failing nodes can vary from run to run, as this issue was reproduced with both the debug queue and the compute queue.

@jsquyres
Copy link
Member

This sounds like it is turning into an UCX issue...? FYI @open-mpi/ucx

@hoopoepg
Copy link
Contributor

hi
thank you for bug report.

it seems UCX detects error on network and reports it into OMPI. as result OMPI terminates failed rank, but neighbor rank (on same node) may try to communicate with terminated rank via SHM transport using CMA and gets error on process_vm_readv call.

is it possible to get logs from same run where Transport retry count exceeded on and process_vm_readv messages appears?

thank you

@bwbarrett bwbarrett modified the milestones: v4.1.4, v4.1.5 May 25, 2022
@dqwu
Copy link
Author

dqwu commented Jun 23, 2022

hi thank you for bug report.

it seems UCX detects error on network and reports it into OMPI. as result OMPI terminates failed rank, but neighbor rank (on same node) may try to communicate with terminated rank via SHM transport using CMA and gets error on process_vm_readv call.

is it possible to get logs from same run where Transport retry count exceeded on and process_vm_readv messages appears?

thank you

@hoopoepg I have rerun the test and the full log can be viewed at the following URL:
https://raw.githubusercontent.com/E3SM-Project/scorpio/dqwu/test_openmpi/test_openmpi.193176.out

@hoopoepg
Copy link
Contributor

hi @dqwu
we don't see timestamps on error messages... was it cut by any grep-like utility?

@dqwu
Copy link
Author

dqwu commented Jun 27, 2022

hi @dqwu we don't see timestamps on error messages... was it cut by any grep-like utility?

@hoopoepg test_openmpi.193176.out is the original output log file of the submitted slurm job to run this test. No changes have been made to this log file.

#!/bin/bash

#SBATCH -t 00:05:00
#SBATCH -A e3sm
#SBATCH --job-name=test_openmpi
#SBATCH --nodes=8
#SBATCH --output=test_openmpi.%j.out
#SBATCH --error=test_openmpi.%j.out
#SBATCH --partition=debug

cd $SLURM_SUBMIT_DIR

source /gpfs/fs1/soft/chrysalis/spack/opt/spack/linux-centos8-x86_64/gcc-9.3.0/lmod-8.3-5be73rg/lmod/lmod/init/sh
module purge 
module load gcc/9.2.0-ugetvbp openmpi/4.1.3-sxfyy4k parallel-netcdf/1.11.0-mirrcz7

srun -n 512 -N 8 ./pioperf_rearr --pio-niotasks=512 --pio-nframes=1 --pio-nvars=500 \
--pio-decompfiles=\"./ne30_F_case_48602x72_512p.dat\" --pio-types=pnetcdf --pio-rearrangers=2

@hoopoepg
Copy link
Contributor

hmmm, timestamps are missing... not good.
ok, could you run application with variables:
UCX_TLS=ib
UCX_LOG_LEVEL=info

and check if error Transport retry count exceeded is present in logs (and post logs here)?

thank you

@dqwu
Copy link
Author

dqwu commented Jun 28, 2022

UCX_LOG_LEVEL=info

Here is the latest log file:
https://raw.githubusercontent.com/E3SM-Project/scorpio/dqwu/test_openmpi/test_openmpi.195239.out

@yosefe
Copy link
Contributor

yosefe commented Jun 29, 2022

@dqwu can you pls try the following (separate) experiments to understand the problem:

  1. Set UCX_TLS=rc to disable DC transport; to see if the issue is DC-related
  2. Set UCX_RC_MAX_RD_ATOMIC=1 to if it's related to exceeded max number of RDMA_READ operations on the QP
  3. Set UCX_RC_PATH_MTU=1024 to see if it's related to different MTU settings
  4. Check FW version by ibv_devinfo -v|grep fw_ver to see if FW upgrade could help

@dqwu
Copy link
Author

dqwu commented Jun 29, 2022

@dqwu can you pls try the following (separate) experiments to understand the problem:

1. Set `UCX_TLS=rc` to disable DC transport; to see if the issue is DC-related

2. Set `UCX_RC_MAX_RD_ATOMIC=1` to if it's related to exceeded max number of RDMA_READ operations on the QP

3. Set `UCX_RC_PATH_MTU=1024` to see if it's related to different MTU settings

4. Check FW version by `ibv_devinfo -v|grep fw_ver` to see if FW upgrade could help

All of the 3 experiments failed, please see the log files.

[UCX_TLS=rc]
https://raw.githubusercontent.com/E3SM-Project/scorpio/dqwu/test_openmpi/test_openmpi.195609.out

[UCX_RC_MAX_RD_ATOMIC=1]
https://raw.githubusercontent.com/E3SM-Project/scorpio/dqwu/test_openmpi/test_openmpi.195610.out

[UCX_RC_PATH_MTU=1024]
https://raw.githubusercontent.com/E3SM-Project/scorpio/dqwu/test_openmpi/test_openmpi.195611.out

FW version

ibv_devinfo -v | grep fw_ver
	fw_ver:				20.29.2002
	fw_ver:				16.27.2008
	fw_ver:				16.27.2008

@yosefe
Copy link
Contributor

yosefe commented Jun 29, 2022

@dqwu can you pls try also with UCX_TLS=sm,tcp and UCX_TLS=tcp, and post the full output from ibv_devinfo -v?

@yosefe
Copy link
Contributor

yosefe commented Jun 29, 2022

Since it fails on TCP , seems not related to IB fabric
can you try these separate experiments:

  1. -mca pml ob1 -mca btl self,vader,tcp -mca pml ^ucx to run OpenMPI without UCX component
    In this case, perhaps will need to add -mca btl_tcp_if_include <netif> to specify the Ethernet device to use
  2. -x UCX_TLS=tcp -mca coll ^hcoll to disable collective component

@dqwu
Copy link
Author

dqwu commented Jun 29, 2022

The test is run with SLURM srun instead of mpirun. Should I use environment variables?
[-mca pml ob1 -mca btl self,vader,tcp -mca pml ^ucx]
export OMPI_MCA_pml=ob1,^ucx
export OMPI_MCA_btl=self,vader,tcp
It this correct?

[-x UCX_TLS=tcp -mca coll ^hcoll]
export UCX_TLS=tcp
export OMPI_MCA_coll=^hcoll
Is this correct?

@yosefe
Copy link
Contributor

yosefe commented Jun 29, 2022

export OMPI_MCA_pml=ob1
export OMPI_MCA_btl=self,vader,tcp

export OMPI_MCA_coll=^hcoll
export UCX_TLS=tcp

@dqwu
Copy link
Author

dqwu commented Jun 29, 2022

export OMPI_MCA_pml=ob1 export OMPI_MCA_btl=self,vader,tcp

export OMPI_MCA_coll=^hcoll export UCX_TLS=tcp

These two tests still failed.

[OMPI_MCA_pml=ob1 OMPI_MCA_btl=self,vader,tcp]
https://raw.githubusercontent.com/E3SM-Project/scorpio/dqwu/test_openmpi/test_openmpi.195638.out

[OMPI_MCA_coll=^hcoll UCX_TLS=tcp]
https://raw.githubusercontent.com/E3SM-Project/scorpio/dqwu/test_openmpi/test_openmpi.195640.out

@edgargabriel
Copy link
Member

Can you try two things as well to change some settings for the parallel I/O part?

  1. Force using a different fcoll component, e.g.
    export OMPI_MCA_fcoll=dynamic

  2. Force using a different io component, e.g.
    export OMPI_MCA_io=^ompio

@dqwu
Copy link
Author

dqwu commented Jun 29, 2022

@edgargabriel
Copy link
Member

ok, thank you. Maybe you can use right now this last flag as a workaround. Is there a way to reproduce this issue with a smaller process count as well? It would help me a lot in trying to identify the problem. Since two fcoll components (vulcan, dynamic) show this issue, I doubt that the problem is in the fcoll component itself, but on another level (e.g. setting the file view or similar)

@dqwu
Copy link
Author

dqwu commented Jun 29, 2022

ok, thank you. Maybe you can use right now this last flag as a workaround. Is there a way to reproduce this issue with a smaller process count as well? It would help me a lot in trying to identify the problem. Since two fcoll components (vulcan, dynamic) show this issue, I doubt that the problem is in the fcoll component itself, but on another level (e.g. setting the file view or similar)
@edgargabriel
I do not have a simpler reproducer so far. You can try to follow the detailed steps provided in the 1st comment.

@edgargabriel edgargabriel self-assigned this Jul 8, 2022
@jayeshkrishna
Copy link

Just checking, has this issue been fixed? (if not, do you have any approx timeline on when this issue will be fixed?)

@edgargabriel
Copy link
Member

edgargabriel commented Nov 15, 2022

I am really sorry, but I don't have a platform to reproduce the issue, and parallel I/O is at this point also not part of my work description at my new job, so my resources are somewhat limited to resolve this issue. I would recommend to stick with the work around that was mentioned previously on the ticket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants