Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI_File_read_at_all doesn't read all values when certain datatypes are used #10546

Closed
fortnern opened this issue Jul 7, 2022 · 10 comments
Closed
Assignees
Milestone

Comments

@fortnern
Copy link

fortnern commented Jul 7, 2022

Background information

While testing a new feature in HDF5, we noticed that OpenMPI sometimes fails in a test that generates random I/O patterns in HDF5. I've tried to create a minimal pure MPI example that shows the failure.

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

Tested on 4.1.1, 4.1.2, 4.1.3, 4.1.4

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

From source

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

N/A

Please describe the system on which you are running

Tested on multiple Linux machines, running on a single node.


Details of the problem

The test program here should be run with 3 ranks. It consists of 2 phases: first all 3 ranks participate in a collective write, with rank 0 writing 5 integers to the file and ranks 1 and 2 writing nothing. Next the file is closed and reopened, then all 3 ranks participate in a collective read. Rank 0 reads all 5 integers without constructing a datatype. Rank 2 reads nothing. Rank 1 constructs a datatype using MPI_Type_contiguous, MPI_Type_vector, MPI_Type_create_hindexed, then MPI_Type_create_resized to select the middle 3 integers (this series of calls mimics how HDF5 constructs a datatype for hyperslab selections). Ranks 1 and 2 see the expected data, but rank 0 does not read the last integer. This test passes in MPICH.

I have verified that the file contains the correct data. The test also fails if you run it once to create the file (check it with a hex editor if you want), then comment out the write section and run again. If the program is run with 2 ranks it passes (even though the rank that is excluded by doing this (rank 2) does nothing but participate in collective calls). If rank 2 is modified to read the entire file like rank 0, then both ranks 0 and 2 fail to read the last integer.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <assert.h>
#include <mpi.h>

#define TEST_ERROR \
do { \
    printf("FAILED at line %d\n", __LINE__); \
    abort(); \
} while(0)

/* Writes data to a shared file, with the data in the file consisting of 5
 * contiguous integers, then reads it back collectively, and verifies that the
 * data read is correct */
/* Run with 3 ranks */

int main(int argc, char **argv) {
    MPI_Datatype elmt_type;
    MPI_Datatype new_type;
    MPI_Datatype tmp_type;
    MPI_File fh;
    MPI_Status mpi_stat;
    MPI_Count bytes_written;
    MPI_Count bytes_read;
    MPI_Count type_size;
    int buf[5];
    int rank;
    int count;
    int i;
    int ret_val = 0;

    MPI_Init(&argc, &argv);

    if(MPI_SUCCESS != MPI_Comm_rank(MPI_COMM_WORLD, &rank))
        TEST_ERROR;

    if(MPI_SUCCESS != MPI_File_open(MPI_COMM_WORLD, "topenmpi.mpio", MPI_MODE_CREATE | MPI_MODE_RDWR, MPI_INFO_NULL, &fh))
        TEST_ERROR;

    /* Create type for int */
    if (MPI_SUCCESS != MPI_Type_contiguous((int)sizeof(int), MPI_BYTE, &elmt_type))
        TEST_ERROR;

    /* Fill write buffer */
    for(i = 0; i < 5; i++)
        buf[i] = i + 1;

    /* Set up datatypes for write */
    if(rank == 0) {
        count = 5 * sizeof(int);
    }
    else {
        assert(rank == 1 || rank == 2);
        count = 0;
    }

    /* Perform collective write operation */
    if (MPI_SUCCESS != MPI_File_write_at_all(fh, 0, buf, count, MPI_BYTE, &mpi_stat))
        TEST_ERROR;

    /* Check the number of bytes written */
    if (MPI_SUCCESS != MPI_Get_elements_x(&mpi_stat, MPI_BYTE, &bytes_written))
        TEST_ERROR;
    if(bytes_written != count)
        TEST_ERROR;

    /*-------- CLOSE/REOPEN --------*/

    if(MPI_SUCCESS != MPI_File_close(&fh))
        TEST_ERROR;

    MPI_Barrier(MPI_COMM_WORLD);

    if(MPI_SUCCESS != MPI_File_open(MPI_COMM_WORLD, "topenmpi.mpio", MPI_MODE_RDWR, MPI_INFO_NULL, &fh))
        TEST_ERROR;

    /*-------- READ --------*/

    /* Set up datatypes for read */
    if(rank == 0) {
        new_type = MPI_BYTE;
        count = 5 * sizeof(int);
    }
    else if(rank == 1) {
        int      block_len = 1;
        MPI_Aint start_disp = 1 * sizeof(int);

        /* Single block of length 3 (x sizeof(int)) */
        if(MPI_SUCCESS != MPI_Type_vector(1, /* count */
                                       3, /* blocklength */
                                       1, /* stride */
                                       elmt_type,        /* old type */
                                       &tmp_type))      /* new type */
            TEST_ERROR;

        /* Offset the block to the correct location (1 int to the right) */
        if(MPI_SUCCESS != MPI_Type_create_hindexed(1, &block_len, &start_disp, tmp_type, &new_type))
            TEST_ERROR;
        MPI_Type_free(&tmp_type);
        tmp_type = new_type;

        /* Resize the type to cover the entire data block (5 ints) */
        if(MPI_SUCCESS != MPI_Type_create_resized(tmp_type, 0, 5 * sizeof(int), &new_type))
            TEST_ERROR;
        MPI_Type_free(&tmp_type);

        if (MPI_SUCCESS != MPI_Type_commit(&new_type))
            TEST_ERROR;
        count = 1;
    }
    else {
        new_type = MPI_BYTE;
        count = 0 * sizeof(int);
    }

    /* Set file view */
    if (MPI_SUCCESS != MPI_File_set_view(fh, 0, MPI_BYTE, new_type,
                                                         "native", MPI_INFO_NULL))
        TEST_ERROR;

    /* Perform collective read operation */
    memset(buf, 0, sizeof(buf));
    if (MPI_SUCCESS != MPI_File_read_at_all(fh, 0, buf, count, new_type, &mpi_stat))
        TEST_ERROR;

    /* Check the number of bytes read */
    if (MPI_SUCCESS != MPI_Get_elements_x(&mpi_stat, new_type, &bytes_read))
        TEST_ERROR;
    if (MPI_SUCCESS != MPI_Type_size_x(new_type, &type_size))
        TEST_ERROR;
    if(bytes_read != count * type_size)
        TEST_ERROR;

    /* Free the derived datatype */
    if(rank == 1)
        MPI_Type_free(&new_type);

    /*-------- VERIFY --------*/

    if(rank == 0) {
        printf("rank 0 expected: {1, 2, 3, 4, 5} read: {%d, %d, %d, %d, %d}\n", buf[0], buf[1], buf[2], buf[3], buf[4]);
        if(buf[0] != 1 || buf[1] != 2 || buf[2] != 3 || buf[3] != 4 || buf[4] != 5) {
            printf("rank 0 verification FAILED\n");
            ret_val = 1;
        }
        else
            printf("rank 0 verification SUCCEEDED\n");
    }

    MPI_Barrier(MPI_COMM_WORLD);

    if(rank == 1) {
        printf("rank 1 expected: {0, 2, 3, 4, 0} read: {%d, %d, %d, %d, %d}\n", buf[0], buf[1], buf[2], buf[3], buf[4]);
        if(buf[0] != 0 || buf[1] != 2 || buf[2] != 3 || buf[3] != 4 || buf[4] != 0) {
            printf("rank 1 verification FAILED\n");
            ret_val = 1;
        }
        else
            printf("rank 1 verification SUCCEEDED\n");
    }

    MPI_Barrier(MPI_COMM_WORLD);

    if(rank == 2) {
        printf("rank 2 expected: {0, 0, 0, 0, 0} read: {%d, %d, %d, %d, %d}\n", buf[0], buf[1], buf[2], buf[3], buf[4]);
        if(buf[0] != 0 || buf[1] != 0 || buf[2] != 0 || buf[3] != 0 || buf[4] != 0) {
            printf("rank 2 verification FAILED\n");
            ret_val = 1;
        }
        else
            printf("rank 2 verification SUCCEEDED\n");
    }

    /*-------- CLOSE --------*/

    if(MPI_SUCCESS != MPI_File_close(&fh))
        TEST_ERROR;

    MPI_Type_free(&elmt_type);

    MPI_Finalize();
    return ret_val;
}

Note: If you include verbatim output (or a code block), please use a GitHub Markdown code block like below:

shell$ mpirun -n 2 ./hello_world
@fortnern
Copy link
Author

fortnern commented Jul 7, 2022

Forgot to add: If all ranks read the entire file then the program passes. This implies that the complicated datatype on rank 1 is interfering with I/O on other ranks (but not the rank with the complicated datatype).

@jsquyres
Copy link
Member

jsquyres commented Jul 7, 2022

@edgargabriel Can you have a look?

@ggouaillardet
Copy link
Contributor

@fortnern thanks for the report and the reproducer!

meanwhile, you can force using the ROMIO component instead of the (native) ompio one:

mpirun --mca io ^ompio -n 3 a.out

@edgargabriel edgargabriel self-assigned this Jul 8, 2022
@edgargabriel
Copy link
Member

I will have a look, but it might take a couple of days until I get to it.

@fortnern
Copy link
Author

fortnern commented Jul 8, 2022

mpirun --mca io ^ompio -n 3 a.out

Test passes with ROMIO, looks like the problem is with OMPIO. Thanks!

edgargabriel added a commit to edgargabriel/ompi that referenced this issue Jul 12, 2022
This commit fixes the calculation of the buffer length that
needs to be read when using data sieving. The original code implicitely
assumed that the ub of an iov at index j+1 is larger than
the ub of the iov at index j. This is not necessarily the case
for read operations. Hence, the code needs to keep track of the max.
ub found.

Fixes issue open-mpi#10546

Signed-off-by: Edgar Gabriel <edgar.gabriel1@outlook.com>
@edgargabriel
Copy link
Member

@fortnern thank you for the bug report, I can confirm that it was a bug in ompio, introduced with the data sieving feature in the 4.1 release series. I will file a pr for 4.1 and the 5.0 releases once the code has been committed to the main repository. The 4.0 release or previous ompi releases did not have this feature.

Just for documentation purposes, it is possible to circumvent this issue in the 4.1 release by setting the fbtl_posix_read_datasieving to 0, e.g.

 mpirun --mca fbtl_posix_read_datasieving 0 -np 3 ./testcode

edgargabriel added a commit to edgargabriel/ompi that referenced this issue Jul 12, 2022
This commit fixes the calculation of the buffer length that
needs to be read when using data sieving. The original code implicitely
assumed that the ub of an iov at index j+1 is larger than
the ub of the iov at index j. This is not necessarily the case
for read operations. Hence, the code needs to keep track of the max.
ub found.

Fixes issue open-mpi#10546

Signed-off-by: Edgar Gabriel <edgar.gabriel1@outlook.com>
edgargabriel added a commit to edgargabriel/ompi that referenced this issue Jul 13, 2022
This commit fixes the calculation of the buffer length that
needs to be read when using data sieving. The original code implicitely
assumed that the ub of an iov at index j+1 is larger than
the ub of the iov at index j. This is not necessarily the case
for read operations. Hence, the code needs to keep track of the max.
ub found.

Fixes issue open-mpi#10546

Signed-off-by: Edgar Gabriel <edgar.gabriel1@outlook.com>
edgargabriel added a commit to edgargabriel/ompi that referenced this issue Jul 18, 2022
This commit fixes the calculation of the buffer length that
needs to be read when using data sieving. The original code implicitely
assumed that the ub of an iov at index j+1 is larger than
the ub of the iov at index j. This is not necessarily the case
for read operations. Hence, the code needs to keep track of the max.
ub found.

Fixes issue open-mpi#10546

Signed-off-by: Edgar Gabriel <edgar.gabriel1@outlook.com>
(cherry picked from commit 6891cee)
edgargabriel added a commit to edgargabriel/ompi that referenced this issue Jul 18, 2022
This commit fixes the calculation of the buffer length that
needs to be read when using data sieving. The original code implicitely
assumed that the ub of an iov at index j+1 is larger than
the ub of the iov at index j. This is not necessarily the case
for read operations. Hence, the code needs to keep track of the max.
ub found.

Fixes issue open-mpi#10546

Signed-off-by: Edgar Gabriel <edgar.gabriel1@outlook.com>
(cherry picked from commit 6891cee)
@jsquyres
Copy link
Member

@fortnern Looks like the fix was merged into the v4.1.x branch -- could you test the latest v4.1.x snapshot to see if the issue is fixed for you? https://www.open-mpi.org/nightly/v4.1.x/

@jhendersonHDF
Copy link

Hi all, I just tested @fortnern's test program here, as well as some internal test programs, with the v4.1.x-202207280243-88bb627 snapshot. It runs successfully for me on a Slackware Linux system, whereas the programs failed for my 4.1.1 through 4.1.4 installations. We're planning to do some more internal testing but at first glance it seems @edgargabriel's PR fixed this for us. Thanks!

@fortnern
Copy link
Author

We've confirmed the test program works on the latest snapshot. Thanks! We'll test the HDF5 branch in question with it soon to verify.

@fortnern
Copy link
Author

The multi dataset branch of HDF5 that was failing with 4.1.2 passes with the latest master. Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants