Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

romio: corruption with a simple test using ADIOS2 BP5 engine #6182

Closed
dqwu opened this issue Sep 19, 2022 · 21 comments
Closed

romio: corruption with a simple test using ADIOS2 BP5 engine #6182

dqwu opened this issue Sep 19, 2022 · 21 comments
Assignees

Comments

@dqwu
Copy link

dqwu commented Sep 19, 2022

This issue is only reproducible with MPICH 4.x (not reproducible with MPICH 3.x or OpenMPI).
Below are detailed steps to reproduce it.

[Build and install MPICH 4.0.2]

wget https://www.mpich.org/static/downloads/4.0.2/mpich-4.0.2.tar.gz
tar zxf mpich-4.0.2.tar.gz
cd mpich-4.0.2
CFLAGS="-g" ./configure --prefix=/path/to/mpich/installation
make -j4
make install

[Add installed MPICH to $PATH]
export PATH=/path/to/mpich/installation/bin:$PATH

[Build and install ADIOS2 2.8.3]

wget https://github.com/ornladios/ADIOS2/archive/refs/tags/v2.8.3.tar.gz
tar zxf v2.8.3.tar.gz
cd ADIOS2-2.8.3
mkdir build
cd build
CC=mpicc CXX=mpicxx FC=mpif90 \
CFLAGS="-O2 -g" CPPFLAGS="-O2 -g" CXXFLAGS="-O2 -g" FCFLAGS="-O2 -g" \
cmake -DCMAKE_INSTALL_PREFIX=/path/to/adios2/installation \
-DBUILD_SHARED_LIBS=OFF -DADIOS2_BUILD_EXAMPLES=OFF -DBUILD_TESTING=OFF \
-DADIOS2_USE_Blosc=OFF \
-DADIOS2_USE_BZip2=OFF \
-DADIOS2_USE_ZFP=OFF \
-DADIOS2_USE_SZ=OFF \
-DADIOS2_USE_MGARD=OFF \
-DADIOS2_USE_PNG=OFF \
-DADIOS2_USE_DataMan=OFF \
-DADIOS2_USE_DataSpaces=OFF \
-DADIOS2_USE_MHS=OFF \
-DADIOS2_USE_ZeroMQ=OFF \
-DADIOS2_USE_HDF5=OFF \
-DADIOS2_USE_Python=OFF \
-DADIOS2_USE_Fortran=OFF \
-DADIOS2_USE_Profiling=OFF \
..
make -j4
make install

[Build and run the test program]

mpicxx test_adios_mpiio.c -O0 -g -Wall \
-DADIOS2_USE_MPI \
-I/path/to/adios2/installation/include \
`/path/to/adios2/installation/bin/adios2-config --c-libs`

mpiexec -n 4 ./a.out

[Example output]

[rank = 3], read unexpected value at index 0, expected: 30, actual: 3
[rank = 3], read unexpected value at index 1, expected: 30, actual: 0
[rank = 3], read unexpected value at index 2, expected: 30, actual: 0
[rank = 3], read unexpected value at index 3, expected: 30, actual: 0
[rank = 3], read unexpected value at index 4, expected: 30, actual: 0

Content of the test program (test_adios_mpiio.c)

#include <stdio.h>
#include <mpi.h>
#include <adios2_c.h>

int main(int argc, char* argv[])
{
  int rank;
  const int element_per_pe = 5;
  int write_buffer[element_per_pe];
  int read_buffer[element_per_pe];
  int i;

  MPI_Init(&argc, &argv);

  MPI_Comm_rank(MPI_COMM_WORLD, &rank);

  for (i = 0; i < element_per_pe; i++) {
    write_buffer[i] = rank * 10;
    read_buffer[i] = -1;
  }

  MPI_Comm dummy_comm1;
  MPI_Comm_dup(MPI_COMM_WORLD, &dummy_comm1);

  adios2_adios *adiosH = adios2_init_mpi(MPI_COMM_WORLD);
  adios2_io *ioH = adios2_declare_io(adiosH, "test_adios.nc.bp0");
  adios2_set_engine(ioH, "BP5"); /* This issue is not reproducible if we change BP5 to BP4 */
  adios2_engine *engineH = adios2_open(ioH, "test_adios.nc.bp", adios2_mode_write);
  adios2_close(engineH);

  MPI_Comm_free(&dummy_comm1); /* This issue is not reproducible if we comment out this line */

  MPI_Comm dummy_comm2;
  MPI_Comm_dup(MPI_COMM_WORLD, &dummy_comm2);

  MPI_Comm io_comm1;
  MPI_Comm_dup(MPI_COMM_WORLD, &io_comm1);
  MPI_File fh1;
  MPI_File_open(io_comm1, "test_mpi_file1", MPI_MODE_CREATE | MPI_MODE_WRONLY, MPI_INFO_NULL, &fh1);

  MPI_Comm io_comm2;
  MPI_Comm_dup(MPI_COMM_WORLD, &io_comm2);
  MPI_File fh2;
  MPI_Info info;
  MPI_Info_create(&info);
  MPI_Info_set(info, "romio_cb_write", "enable");
  MPI_File_open(io_comm2, "test_mpi_file2", MPI_MODE_CREATE | MPI_MODE_WRONLY, info, &fh2);

  /* This issue is only reproducible with MPICH 4.x (MPICH 3.x or OpenMPI works fine)
     In src/mpi/romio/adio/common/ad_write_coll.c, fd->comm (duplicated from io_comm2)
     seems to be corrupted for MPI_Isend/MPI_Irecv, such that fd->io_buf on rank 0 fails
     to receive expected data from the last rank. If we replace fd->comm with MPI_COMM_WORLD,
     all results are correct.
   */
  MPI_File_write_at_all(fh2, rank * element_per_pe * sizeof(int), write_buffer, element_per_pe, MPI_INT, MPI_STATUS_IGNORE);
  MPI_File_close(&fh2);
  MPI_Info_free(&info);
  MPI_Comm_free(&io_comm2);

  MPI_File_close(&fh1);
  MPI_Comm_free(&io_comm1);

  MPI_Comm_free(&dummy_comm2);

  MPI_File fh3;
  MPI_File_open(MPI_COMM_WORLD, "test_mpi_file2", MPI_MODE_RDONLY, MPI_INFO_NULL, &fh3);
  MPI_File_read_at_all(fh3, rank * element_per_pe * sizeof(int), read_buffer, element_per_pe, MPI_INT, MPI_STATUS_IGNORE);
  for (i = 0; i < element_per_pe; i++) {
    if (read_buffer[i] != write_buffer[i])
      printf("[rank = %d], read unexpected value at index %d, expected: %d, actual: %d\n", rank, i, write_buffer[i], read_buffer[i]);
  }
  MPI_File_close(&fh3);

  adios2_finalize(adiosH); /* This issue is not reproducible if we call adios2_finalize() immediately after adios2_close() */

  MPI_Finalize();

  return 0;
}
@dqwu
Copy link
Author

dqwu commented Sep 19, 2022

@roblatham00 This is the simplest reproducer I have so far, which still requires linking to ADIOS2 lib. I will try to replace the ADIOS2 code in the test program with all-MPI code if I can.

  • It might be a bug with ADIOS2 BP5 engine code (not reprodubile with BP4) which creates and frees a lot of MPI communicators.
  • It might be a bug with MPICH 4.x (not reprodubile with MPICH 3.x or OpenMPI), assuming that ADIOS2 BP5 engine code calls MPI correctly.

@dqwu
Copy link
Author

dqwu commented Sep 19, 2022

Valgrind output on ANL GCE nodes (Ubuntu 20, GCC 9.4.0)

==662465== Syscall param pwrite64(buf) points to uninitialised byte(s)
==662465==    at 0x489CD1F: __libc_pwrite64 (pwrite64.c:29)
==662465==    by 0x489CD1F: pwrite (pwrite64.c:27)
==662465==    by 0x6DA550B: ADIOI_GEN_WriteContig (ad_write.c:75)
==662465==    by 0x6DA6C2F: ADIOI_Exch_and_write (ad_write_coll.c:477)
==662465==    by 0x6DA5D07: ADIOI_GEN_WriteStridedColl (ad_write_coll.c:185)
==662465==    by 0x6D7CD38: MPIOI_File_write_all (write_all.c:168)
==662465==    by 0x6D7D5F4: PMPI_File_write_at_all (write_atall.c:71)
==662465==    by 0x1733FF: main (test_adios_mpiio.c:55)
==662465==  Address 0x12135080 is 64 bytes inside a block of size 16,777,216 alloc'd
==662465==    at 0x483B7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==662465==    by 0x6DB4286: MPL_malloc (mpl_trmem.h:280)
==662465==    by 0x6DB4327: ADIOI_Malloc_fn (malloc.c:41)
==662465==    by 0x6D9CB5A: ADIO_Open (ad_open.c:135)
==662465==    by 0x6D774ED: PMPI_File_open (open.c:143)
==662465==    by 0x1733C8: main (test_adios_mpiio.c:47)

@roblatham00 roblatham00 self-assigned this Sep 19, 2022
@hzhou
Copy link
Contributor

hzhou commented Sep 19, 2022

@dqwu Could you try reproducing this with 4.0.x romio + 3.4.x mpich or vice versa? I think you can do this by manually swapping the src/mpi/romio folder in the source tree. This should narrow it down to whether ROMIO or MPICH contains the breaking changes.

Is the creating of fh1 critical in the reproducer?

@dqwu
Copy link
Author

dqwu commented Sep 19, 2022

@dqwu Could you try reproducing this with 4.0.x romio + 3.4.x mpich or vice versa? I think you can do this by manually swapping the src/mpi/romio folder in the source tree. This should narrow it down to whether ROMIO or MPICH contains the breaking changes.

Is the creating of fh1 critical in the reproducer?

Thanks for the suggestion, and I will test it later. Yes, the empty file created with fh1 is required to reproduce this issue.

@dqwu
Copy link
Author

dqwu commented Sep 19, 2022

@hzhou
I tried to build MPICH 4.0.2 with ROMIO copied from MPICH 3.4.3, but there are some build errors:

...
make[3]: Entering directory '/scratch/wuda/source_install/mpich-4.0.2/src/mpi/romio'
  CC       mpi-io/close.lo
  CC       mpi-io/delete.lo
  CC       mpi-io/file_c2f.lo
  CC       mpi-io/file_f2c.lo
In file included from ./adio/include/adio.h:61:0,
                 from mpi-io/mpioimpl.h:13,
                 from mpi-io/delete.c:6:
./adio/include/romioconf.h:10:10: fatal error: mplconfig.h: No such file or directory
 #include <mplconfig.h>
          ^~~~~~~~~~~~~
compilation terminated.

mplconfig.h can be found in the src directory, though.

find . -name 'mplconfig.h'
./src/mpl/include/mplconfig.h

@hzhou
Copy link
Contributor

hzhou commented Sep 19, 2022

Let's try some hacking --

$ cd src/mpi/romio 
$ rm -rf mpl && cp -a ../../mpl .
$ make CPPFLAGS="-Impl/include -DMPL_FALLTHROUGH="
$ cd ../../..
$ make

@dqwu
Copy link
Author

dqwu commented Sep 19, 2022

Let's try some hacking --

$ cd src/mpi/romio 
$ rm -rf mpl && cp -a ../../mpl .
$ make CPPFLAGS="-Impl/include -DMPL_FALLTHROUGH="
$ cd ../../..
$ make

This workaround works, but there are some linking errors:

lib/.libs/libmpi.so: undefined reference to `MPI_File_iwrite_c'
lib/.libs/libmpi.so: undefined reference to `MPI_File_iwrite_at_all_c'
...
lib/.libs/libmpi.so: undefined reference to `MPI_File_write_shared_c'
lib/.libs/libmpi.so: undefined reference toMakefile:14110: recipe for target 'src/env/mpichversion' failed
 `make[2]: *** [src/env/mpichversion] Error 1

@hzhou
Copy link
Contributor

hzhou commented Sep 19, 2022

More hacking:

$ pushd src/glue/romio
$  ./all_romio_symbols ../../mpi/romio/include/mpio.h.in
$ popd
$ make

@dqwu
Copy link
Author

dqwu commented Sep 19, 2022

$ pushd src/glue/romio
$ ./all_romio_symbols ../../mpi/romio/include/mpio.h.in
$ popd
$ make

Thanks, I was able to build MPICH 4.0.2 with ROMIO of MPICH 3.4.3, and this issue is not reproducible.

I will try MPICH 3.4.3 with ROMIO of MPICH 4.0.2 later.

@hzhou
Copy link
Contributor

hzhou commented Sep 19, 2022

Thanks, I was able to build MPICH 4.0.2 with ROMIO of MPICH 3.4.3, and this issue is not reproducible.

I guess that points the fault to the changes in ROMIO

@dqwu
Copy link
Author

dqwu commented Sep 19, 2022

Thanks, I was able to build MPICH 4.0.2 with ROMIO of MPICH 3.4.3, and this issue is not reproducible.

I guess that points the fault to the changes in ROMIO

I tried to build MPICH 3.4.3 with ROMIO of MPICH 4.0.2, the build errors are:

make[3]: Entering directory '/scratch/wuda/source_install/mpich-3.4.3/src/mpi/romio'
  CC       mpi-io/delete.lo
  CC       mpi-io/close.lo
  CC       mpi-io/file_c2f.lo
  CC       mpi-io/file_f2c.lo
In file included from /scratch/wuda/source_install/mpich-3.4.3/src/include/mpi.h:2225:0,
                 from ./adio/include/adio.h:65,
                 from mpi-io/mpioimpl.h:13,
                 from mpi-io/delete.c:6:
/scratch/wuda/source_install/mpich-3.4.3/src/include/mpio.h:340:49: error: unknown type name ‘MPI_Datarep_conversion_function_c’; did you mean ‘MPI_Datarep_conversion_function’?
 int MPI_Register_datarep_c(const char *datarep, MPI_Datarep_conversion_function_c *read_conversion_fn,
                                                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                 MPI_Datarep_conversion_function
...

Any hacking for this build issue?

@hzhou
Copy link
Contributor

hzhou commented Sep 19, 2022

Yeah, paste in src/include/mpi.h after the line #define MPI_CONVERSION_FN_NULL ((MPI_Datarep_conversion_function *)0)

typedef int (MPI_Datarep_conversion_function_c)(void *, MPI_Datatype, MPI_Count,
             void *, MPI_Offset, void *);
#define MPI_CONVERSION_FN_NULL_C ((MPI_Datarep_conversion_function_c *)0)

@dqwu
Copy link
Author

dqwu commented Sep 20, 2022

Yeah, paste in src/include/mpi.h after the line #define MPI_CONVERSION_FN_NULL ((MPI_Datarep_conversion_function *)0)

typedef int (MPI_Datarep_conversion_function_c)(void *, MPI_Datatype, MPI_Count,
             void *, MPI_Offset, void *);
#define MPI_CONVERSION_FN_NULL_C ((MPI_Datarep_conversion_function_c *)0)

Thanks, this works to build MPICH 3.4.3 with ROMIO of MPICH 4.0.2.

Unfortunately, MPICH 3.4.3 with ROMIO of MPICH 4.0.2 cannot reproduce this issue.

In summary (tested on my laptop with Ubuntu 18 and GCC 7.4.0):
MPICH 4.0.2 + default ROMIO: reproducible
MPICH 4.0.2 + ROMIO of MPICH 3.4.3: not reproducible
MPICH 3.4.3 + default ROMIO: not reproducible
MPICH 3.4.3 + ROMIO of MPICH 4.0.2: not reproducible

@dqwu
Copy link
Author

dqwu commented Sep 20, 2022

@hzhou @roblatham00
Actually MPICH 4.0b1 cannot reproduce this issue but MPICH 4.0rc1 can.

MPICH 4.0b1 + default ROMIO: not reproducible
MPICH 4.0rc1 + default ROMIO: reproducible
MPICH 4.0b1 + ROMIO of MPICH 4.0rc1: reproducible
MPICH 4.0rc1 + ROMIO of MPICH 4.0b1: not reproducible

This points the fault to the changes in ROMIO (between 4.0b1 and 4.0rc1, number of different files: 16).

It seems that this issue is caused by PR #5660. For MPICH 4.0.2, if I revert ADIOI_COLL_TAG macros (constant tag) back to myrank + i + 100 * iter or myrank + p + 100 * iter in ad_write_coll.c, this issue is not reproducible.

@dqwu
Copy link
Author

dqwu commented Sep 21, 2022

@hzhou @roblatham00
Finally I was able to find a potential bug inside ADIOS2 BP5 engine code

void MPIShmChain::HandshakeLinks_Start(helper::Comm &comm, HandshakeStruct &hs)
{
    ...
    int rank = comm.Rank();
    hs.sendToken = rank;
    ...
    if (comm.Rank() > 0) // receive from previous
    {
        hs.recvRequest = comm.Irecv(
            &hs.recvToken, 1, rank - 1, 0,
            "Irecv handshake with neighbor, MPIChain aggregator, at Open");
    }
    else // rank 0 receives from last
    {
        hs.recvRequest = comm.Irecv(
            &hs.recvToken, 1, rank - 1, 0,
            "Irecv handshake with neighbor, MPIChain aggregator, at Open");
    }
    ...
}

For the 2nd comm.Irecv call, the argument "rank - 1" should be "comm.Size() - 1" (the source parameter of MPI_Irecv).

Due to this bug (maybe a typo), rank 0 will not receive the integer sendToken from rank 3 (the value is set to 3 on rank 3), and this pending data will be received later inside the ROMIO write function (on rank 3, the first int value written is always fixed as 3, which is the previous sendToken on rank 3 lost inside ADIOS2 BP5 engine code).

@dqwu
Copy link
Author

dqwu commented Sep 21, 2022

@hzhou @roblatham00
I have a simple all-MPI test program to reproduce this issue now (only reproducible with MPICH 4.0rc1 or later when a constant tag 0 is used for collective write, since ADIOS2 BP5 engine code or this test program also uses tag 0 for MPI_Irecv/MPI_Irecv).

I think this issue can be closed later, but it will be helpful that newer versions of MPICH can detect an invalid source argument (e.g. -1) in MPI_Irecv and maybe give users a warning.

#include <stdio.h>
#include <mpi.h>

void simulate_ADIOS2_BP5_engine_code()
{
  int rank;
  int size;
  MPI_Request send_request;
  MPI_Request recv_request;
  MPI_Status status1, status2;

  MPI_Comm dupcomm1;
  MPI_Comm_dup(MPI_COMM_WORLD, &dupcomm1);

  MPI_Comm dupcomm2;
  MPI_Comm_dup(MPI_COMM_WORLD, &dupcomm2);

  MPI_Comm dupcomm3;
  MPI_Comm_dup(MPI_COMM_WORLD, &dupcomm3);

  MPI_Comm dupcomm4;
  MPI_Comm_dup(MPI_COMM_WORLD, &dupcomm4);

  MPI_Comm_rank(dupcomm4, &rank);
  MPI_Comm_size(dupcomm4, &size);

  /* Simulate MPIShmChain::HandshakeLinks_Start() */
  int sendToken = rank * 1000; /* 3000 on rank 3 */
  int recvToken = -1;

  if (rank < size - 1) /* send to next */
    MPI_Isend(&sendToken, 1, MPI_INT, rank + 1, 0, dupcomm4, &send_request);
  else /* send to 0 to close the loop */
    MPI_Isend(&sendToken, 1, MPI_INT, 0, 0, dupcomm4, &send_request);

  if (rank > 0) /* receive from previous */
    MPI_Irecv(&recvToken, 1, MPI_INT, rank - 1, 0, dupcomm4, &recv_request);
  else /* rank 0 receives from last */
    /* Here is the bug, "rank - 1" should be "size - 1", maybe a typo?
       Since rank is 0, "rank - 1" points to an invalid source -1, such that
       sendToken from last rank will not be received in this function, which
       will be received inside MPI_File_write_at_all() below instead:
       [rank = 3], read unexpected value at index 0, expected: 30, actual: 3000
     */
    MPI_Irecv(&recvToken, 1, MPI_INT, rank - 1, 0, dupcomm4, &recv_request);

  /* Simulate MPIShmChain::HandshakeLinks_Complete() */
  MPI_Waitall(1, &send_request, &status1);
  MPI_Waitall(1, &recv_request, &status2);

  MPI_Comm_free(&dupcomm1);
  MPI_Comm_free(&dupcomm2);
  MPI_Comm_free(&dupcomm3);
  MPI_Comm_free(&dupcomm4);
}

int main(int argc, char* argv[])
{
  int rank;
  const int element_per_pe = 5;
  int write_buffer[element_per_pe];
  int read_buffer[element_per_pe];
  int i;

  MPI_Init(&argc, &argv);

  MPI_Comm_rank(MPI_COMM_WORLD, &rank);

  for (i = 0; i < element_per_pe; i++) {
    write_buffer[i] = rank * 10;
    read_buffer[i] = -1;
  }

  MPI_Comm dummy_comm1;
  MPI_Comm_dup(MPI_COMM_WORLD, &dummy_comm1);

  /* Simulation of ADIOS2 BP5 engine code start */
  simulate_ADIOS2_BP5_engine_code();
  /* Simulation of ADIOS2 BP5 engine code end */

  MPI_Comm_free(&dummy_comm1); /* This issue is not reproducible if we comment out this line */

  MPI_Comm dummy_comm2;
  MPI_Comm_dup(MPI_COMM_WORLD, &dummy_comm2);

  MPI_Comm io_comm1;
  MPI_Comm_dup(MPI_COMM_WORLD, &io_comm1);
  MPI_File fh1;
  MPI_File_open(io_comm1, "test_mpi_file1", MPI_MODE_CREATE | MPI_MODE_WRONLY, MPI_INFO_NULL, &fh1);

  MPI_Comm io_comm2;
  MPI_Comm_dup(MPI_COMM_WORLD, &io_comm2);
  MPI_File fh2;
  MPI_Info info;
  MPI_Info_create(&info);
  MPI_Info_set(info, "romio_cb_write", "enable");
  MPI_File_open(io_comm2, "test_mpi_file2", MPI_MODE_CREATE | MPI_MODE_WRONLY, info, &fh2);

  MPI_File_write_at_all(fh2, rank * element_per_pe * sizeof(int), write_buffer, element_per_pe, MPI_INT, MPI_STATUS_IGNORE);
  MPI_File_close(&fh2);
  MPI_Info_free(&info);
  MPI_Comm_free(&io_comm2);

  MPI_File_close(&fh1);
  MPI_Comm_free(&io_comm1);

  MPI_Comm_free(&dummy_comm2);

  MPI_File fh3;
  MPI_File_open(MPI_COMM_WORLD, "test_mpi_file2", MPI_MODE_RDONLY, MPI_INFO_NULL, &fh3);
  MPI_File_read_at_all(fh3, rank * element_per_pe * sizeof(int), read_buffer, element_per_pe, MPI_INT, MPI_STATUS_IGNORE);
  for (i = 0; i < element_per_pe; i++) {
    if (read_buffer[i] != write_buffer[i])
      printf("[rank = %d], read unexpected value at index %d, expected: %d, actual: %d\n", rank, i, write_buffer[i], read_buffer[i]);
  }
  MPI_File_close(&fh3);

  MPI_Finalize();

  return 0;
}

@hzhou
Copy link
Contributor

hzhou commented Sep 21, 2022

@dqwu Good job locating the bug!

but it will be helpful that newer versions of MPICH can detect an invalid source argument (e.g. -1) in MPI_Irecv and maybe give users a warning.

Unfortunately -1 is MPI_PROC_NULL in MPICH, and MPI_PROC_NULL is valid use by MPI standard.

@dqwu
Copy link
Author

dqwu commented Sep 21, 2022

@dqwu Good job locating the bug!

but it will be helpful that newer versions of MPICH can detect an invalid source argument (e.g. -1) in MPI_Irecv and maybe give users a warning.

Unfortunately -1 is MPI_PROC_NULL in MPICH, and MPI_PROC_NULL is valid use by MPI standard.

Thanks for the information. As you can see, this bug is not easy to reproduce with MPICH 4.x. Actually, if we delete dupcomm1, dupcomm2, or dupcomm3 (any of them) in the latest test program, this bug cannot be reproduced (the ADIOS2 BP5 engine code creates a lot of MPI comms).

@dqwu
Copy link
Author

dqwu commented Sep 21, 2022

@hzhou @roblatham00
This issue can be closed since it turns out to be a bug of ADIOS2 BP5 engine code. PS, it has been fixed on ADIOS2 master branch, with added error checks in Comm send/recv wrapper functions to throw an error in case of bad source/dest, see ornladios/ADIOS2#3345

@hzhou
Copy link
Contributor

hzhou commented Sep 21, 2022

@dqwu Thanks for the all-MPI reproducer. It did reveal an issue in mpich. What happens is -- there is a pending send message in dupcomm4 from simulate_ADIOS2_BP5_engine_code. It lingers because there is currently no mechanism for canceling sent messages. The later io_comm2 reuses the same internal context_id since dupcomm4 was freed already. The extra communicators in the reproducer are to make that context_id collision happen.

I am not sure how to fix this yet, but ideally traffic from different communicators should never interfere.

@hzhou
Copy link
Contributor

hzhou commented Sep 21, 2022

Close this issue. The lingering send message issue is tracked in new issue #6184

@hzhou hzhou closed this as completed Sep 21, 2022
@hzhou hzhou changed the title [MPICH4] fd->comm in ADIOI_W_Exchange_data might be corrupted with a simple test using ADIOS2 BP5 engine romio: corruption with a simple test using ADIOS2 BP5 engine Sep 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants