MPI_File_read_at_all doesn't read all values when certain datatypes are used #10546

fortnern · 2022-07-07T19:47:42Z

Background information

While testing a new feature in HDF5, we noticed that OpenMPI sometimes fails in a test that generates random I/O patterns in HDF5. I've tried to create a minimal pure MPI example that shows the failure.

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

Tested on 4.1.1, 4.1.2, 4.1.3, 4.1.4

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

From source

If you are building/installing from a git clone, please copy-n-paste the output from `git submodule status`.

N/A

Please describe the system on which you are running

Tested on multiple Linux machines, running on a single node.

Details of the problem

The test program here should be run with 3 ranks. It consists of 2 phases: first all 3 ranks participate in a collective write, with rank 0 writing 5 integers to the file and ranks 1 and 2 writing nothing. Next the file is closed and reopened, then all 3 ranks participate in a collective read. Rank 0 reads all 5 integers without constructing a datatype. Rank 2 reads nothing. Rank 1 constructs a datatype using MPI_Type_contiguous, MPI_Type_vector, MPI_Type_create_hindexed, then MPI_Type_create_resized to select the middle 3 integers (this series of calls mimics how HDF5 constructs a datatype for hyperslab selections). Ranks 1 and 2 see the expected data, but rank 0 does not read the last integer. This test passes in MPICH.

I have verified that the file contains the correct data. The test also fails if you run it once to create the file (check it with a hex editor if you want), then comment out the write section and run again. If the program is run with 2 ranks it passes (even though the rank that is excluded by doing this (rank 2) does nothing but participate in collective calls). If rank 2 is modified to read the entire file like rank 0, then both ranks 0 and 2 fail to read the last integer.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <assert.h>
#include <mpi.h>

#define TEST_ERROR \
do { \
    printf("FAILED at line %d\n", __LINE__); \
    abort(); \
} while(0)

/* Writes data to a shared file, with the data in the file consisting of 5
 * contiguous integers, then reads it back collectively, and verifies that the
 * data read is correct */
/* Run with 3 ranks */

int main(int argc, char **argv) {
    MPI_Datatype elmt_type;
    MPI_Datatype new_type;
    MPI_Datatype tmp_type;
    MPI_File fh;
    MPI_Status mpi_stat;
    MPI_Count bytes_written;
    MPI_Count bytes_read;
    MPI_Count type_size;
    int buf[5];
    int rank;
    int count;
    int i;
    int ret_val = 0;

    MPI_Init(&argc, &argv);

    if(MPI_SUCCESS != MPI_Comm_rank(MPI_COMM_WORLD, &rank))
        TEST_ERROR;

    if(MPI_SUCCESS != MPI_File_open(MPI_COMM_WORLD, "topenmpi.mpio", MPI_MODE_CREATE | MPI_MODE_RDWR, MPI_INFO_NULL, &fh))
        TEST_ERROR;

    /* Create type for int */
    if (MPI_SUCCESS != MPI_Type_contiguous((int)sizeof(int), MPI_BYTE, &elmt_type))
        TEST_ERROR;

    /* Fill write buffer */
    for(i = 0; i < 5; i++)
        buf[i] = i + 1;

    /* Set up datatypes for write */
    if(rank == 0) {
        count = 5 * sizeof(int);
    }
    else {
        assert(rank == 1 || rank == 2);
        count = 0;
    }

    /* Perform collective write operation */
    if (MPI_SUCCESS != MPI_File_write_at_all(fh, 0, buf, count, MPI_BYTE, &mpi_stat))
        TEST_ERROR;

    /* Check the number of bytes written */
    if (MPI_SUCCESS != MPI_Get_elements_x(&mpi_stat, MPI_BYTE, &bytes_written))
        TEST_ERROR;
    if(bytes_written != count)
        TEST_ERROR;

    /*-------- CLOSE/REOPEN --------*/

    if(MPI_SUCCESS != MPI_File_close(&fh))
        TEST_ERROR;

    MPI_Barrier(MPI_COMM_WORLD);

    if(MPI_SUCCESS != MPI_File_open(MPI_COMM_WORLD, "topenmpi.mpio", MPI_MODE_RDWR, MPI_INFO_NULL, &fh))
        TEST_ERROR;

    /*-------- READ --------*/

    /* Set up datatypes for read */
    if(rank == 0) {
        new_type = MPI_BYTE;
        count = 5 * sizeof(int);
    }
    else if(rank == 1) {
        int      block_len = 1;
        MPI_Aint start_disp = 1 * sizeof(int);

        /* Single block of length 3 (x sizeof(int)) */
        if(MPI_SUCCESS != MPI_Type_vector(1, /* count */
                                       3, /* blocklength */
                                       1, /* stride */
                                       elmt_type,        /* old type */
                                       &tmp_type))      /* new type */
            TEST_ERROR;

        /* Offset the block to the correct location (1 int to the right) */
        if(MPI_SUCCESS != MPI_Type_create_hindexed(1, &block_len, &start_disp, tmp_type, &new_type))
            TEST_ERROR;
        MPI_Type_free(&tmp_type);
        tmp_type = new_type;

        /* Resize the type to cover the entire data block (5 ints) */
        if(MPI_SUCCESS != MPI_Type_create_resized(tmp_type, 0, 5 * sizeof(int), &new_type))
            TEST_ERROR;
        MPI_Type_free(&tmp_type);

        if (MPI_SUCCESS != MPI_Type_commit(&new_type))
            TEST_ERROR;
        count = 1;
    }
    else {
        new_type = MPI_BYTE;
        count = 0 * sizeof(int);
    }

    /* Set file view */
    if (MPI_SUCCESS != MPI_File_set_view(fh, 0, MPI_BYTE, new_type,
                                                         "native", MPI_INFO_NULL))
        TEST_ERROR;

    /* Perform collective read operation */
    memset(buf, 0, sizeof(buf));
    if (MPI_SUCCESS != MPI_File_read_at_all(fh, 0, buf, count, new_type, &mpi_stat))
        TEST_ERROR;

    /* Check the number of bytes read */
    if (MPI_SUCCESS != MPI_Get_elements_x(&mpi_stat, new_type, &bytes_read))
        TEST_ERROR;
    if (MPI_SUCCESS != MPI_Type_size_x(new_type, &type_size))
        TEST_ERROR;
    if(bytes_read != count * type_size)
        TEST_ERROR;

    /* Free the derived datatype */
    if(rank == 1)
        MPI_Type_free(&new_type);

    /*-------- VERIFY --------*/

    if(rank == 0) {
        printf("rank 0 expected: {1, 2, 3, 4, 5} read: {%d, %d, %d, %d, %d}\n", buf[0], buf[1], buf[2], buf[3], buf[4]);
        if(buf[0] != 1 || buf[1] != 2 || buf[2] != 3 || buf[3] != 4 || buf[4] != 5) {
            printf("rank 0 verification FAILED\n");
            ret_val = 1;
        }
        else
            printf("rank 0 verification SUCCEEDED\n");
    }

    MPI_Barrier(MPI_COMM_WORLD);

    if(rank == 1) {
        printf("rank 1 expected: {0, 2, 3, 4, 0} read: {%d, %d, %d, %d, %d}\n", buf[0], buf[1], buf[2], buf[3], buf[4]);
        if(buf[0] != 0 || buf[1] != 2 || buf[2] != 3 || buf[3] != 4 || buf[4] != 0) {
            printf("rank 1 verification FAILED\n");
            ret_val = 1;
        }
        else
            printf("rank 1 verification SUCCEEDED\n");
    }

    MPI_Barrier(MPI_COMM_WORLD);

    if(rank == 2) {
        printf("rank 2 expected: {0, 0, 0, 0, 0} read: {%d, %d, %d, %d, %d}\n", buf[0], buf[1], buf[2], buf[3], buf[4]);
        if(buf[0] != 0 || buf[1] != 0 || buf[2] != 0 || buf[3] != 0 || buf[4] != 0) {
            printf("rank 2 verification FAILED\n");
            ret_val = 1;
        }
        else
            printf("rank 2 verification SUCCEEDED\n");
    }

    /*-------- CLOSE --------*/

    if(MPI_SUCCESS != MPI_File_close(&fh))
        TEST_ERROR;

    MPI_Type_free(&elmt_type);

    MPI_Finalize();
    return ret_val;
}

Note: If you include verbatim output (or a code block), please use a GitHub Markdown code block like below:

shell$ mpirun -n 2 ./hello_world

The text was updated successfully, but these errors were encountered:

fortnern · 2022-07-07T19:59:45Z

Forgot to add: If all ranks read the entire file then the program passes. This implies that the complicated datatype on rank 1 is interfering with I/O on other ranks (but not the rank with the complicated datatype).

jsquyres · 2022-07-07T22:03:53Z

@edgargabriel Can you have a look?

ggouaillardet · 2022-07-07T23:15:22Z

@fortnern thanks for the report and the reproducer!

meanwhile, you can force using the ROMIO component instead of the (native) ompio one:

mpirun --mca io ^ompio -n 3 a.out

edgargabriel · 2022-07-08T13:14:24Z

I will have a look, but it might take a couple of days until I get to it.

fortnern · 2022-07-08T14:07:19Z

mpirun --mca io ^ompio -n 3 a.out

Test passes with ROMIO, looks like the problem is with OMPIO. Thanks!

This commit fixes the calculation of the buffer length that needs to be read when using data sieving. The original code implicitely assumed that the ub of an iov at index j+1 is larger than the ub of the iov at index j. This is not necessarily the case for read operations. Hence, the code needs to keep track of the max. ub found. Fixes issue open-mpi#10546 Signed-off-by: Edgar Gabriel <edgar.gabriel1@outlook.com>

edgargabriel · 2022-07-12T22:00:26Z

@fortnern thank you for the bug report, I can confirm that it was a bug in ompio, introduced with the data sieving feature in the 4.1 release series. I will file a pr for 4.1 and the 5.0 releases once the code has been committed to the main repository. The 4.0 release or previous ompi releases did not have this feature.

Just for documentation purposes, it is possible to circumvent this issue in the 4.1 release by setting the fbtl_posix_read_datasieving to 0, e.g.

 mpirun --mca fbtl_posix_read_datasieving 0 -np 3 ./testcode

This commit fixes the calculation of the buffer length that needs to be read when using data sieving. The original code implicitely assumed that the ub of an iov at index j+1 is larger than the ub of the iov at index j. This is not necessarily the case for read operations. Hence, the code needs to keep track of the max. ub found. Fixes issue open-mpi#10546 Signed-off-by: Edgar Gabriel <edgar.gabriel1@outlook.com>

This commit fixes the calculation of the buffer length that needs to be read when using data sieving. The original code implicitely assumed that the ub of an iov at index j+1 is larger than the ub of the iov at index j. This is not necessarily the case for read operations. Hence, the code needs to keep track of the max. ub found. Fixes issue open-mpi#10546 Signed-off-by: Edgar Gabriel <edgar.gabriel1@outlook.com> (cherry picked from commit 6891cee)

jsquyres · 2022-07-21T21:27:01Z

@fortnern Looks like the fix was merged into the v4.1.x branch -- could you test the latest v4.1.x snapshot to see if the issue is fixed for you? https://www.open-mpi.org/nightly/v4.1.x/

jhendersonHDF · 2022-07-28T16:40:59Z

Hi all, I just tested @fortnern's test program here, as well as some internal test programs, with the v4.1.x-202207280243-88bb627 snapshot. It runs successfully for me on a Slackware Linux system, whereas the programs failed for my 4.1.1 through 4.1.4 installations. We're planning to do some more internal testing but at first glance it seems @edgargabriel's PR fixed this for us. Thanks!

fortnern · 2022-07-28T16:53:49Z

We've confirmed the test program works on the latest snapshot. Thanks! We'll test the HDF5 branch in question with it soon to verify.

fortnern · 2022-07-28T19:14:41Z

The multi dataset branch of HDF5 that was failing with 4.1.2 passes with the latest master. Thanks again!

jsquyres added question Target: v4.1.x labels Jul 7, 2022

jsquyres added this to the v4.1.5 milestone Jul 7, 2022

edgargabriel self-assigned this Jul 8, 2022

edgargabriel mentioned this issue Jul 12, 2022

fbtl/posix: fix length calculation for data sieving #10562

Merged

This was referenced Jul 18, 2022

fbtl/posix: fix length calculation for data sieving #10578

Merged

fbtl/posix: fix length calculation for data sieving #10579

Merged

edgargabriel closed this as completed Dec 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MPI_File_read_at_all doesn't read all values when certain datatypes are used #10546

MPI_File_read_at_all doesn't read all values when certain datatypes are used #10546

fortnern commented Jul 7, 2022 •

edited by jsquyres

Loading

fortnern commented Jul 7, 2022

jsquyres commented Jul 7, 2022

ggouaillardet commented Jul 7, 2022

edgargabriel commented Jul 8, 2022

fortnern commented Jul 8, 2022

edgargabriel commented Jul 12, 2022

jsquyres commented Jul 21, 2022

jhendersonHDF commented Jul 28, 2022

fortnern commented Jul 28, 2022

fortnern commented Jul 28, 2022

MPI_File_read_at_all doesn't read all values when certain datatypes are used #10546

MPI_File_read_at_all doesn't read all values when certain datatypes are used #10546

Comments

fortnern commented Jul 7, 2022 • edited by jsquyres Loading

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

Please describe the system on which you are running

Details of the problem

fortnern commented Jul 7, 2022

jsquyres commented Jul 7, 2022

ggouaillardet commented Jul 7, 2022

edgargabriel commented Jul 8, 2022

fortnern commented Jul 8, 2022

edgargabriel commented Jul 12, 2022

jsquyres commented Jul 21, 2022

jhendersonHDF commented Jul 28, 2022

fortnern commented Jul 28, 2022

fortnern commented Jul 28, 2022

fortnern commented Jul 7, 2022 •

edited by jsquyres

Loading

If you are building/installing from a git clone, please copy-n-paste the output from `git submodule status`.