Skip to content

Conversation

@ggouaillardet
Copy link
Contributor

No description provided.

@ggouaillardet
Copy link
Contributor Author

@edgargabriel could you please review this ?

here is a test file

#include <stdio.h>
#include <sys/types.h>
#include <fcntl.h>

#include <mpi.h>

int buf1[4] = {0, 1, 2, 3};
int buf2[10] = {0, 1, 2, 3, 4, 5, 6, 7, 8 , 9};
int buf[4] = {-1, -1, -1, -1};

int main (int argc, char *argv[]) {
    int rank, size;
    int fd;
    MPI_File file;
    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    fd = open("f1.data", O_WRONLY|O_TRUNC|O_CREAT, 0600);
    write (fd, buf1, 4*sizeof(int));
    close(fd);
    fd = open("f2.data", O_WRONLY|O_TRUNC|O_CREAT, 0600);
    write (fd, buf2, 10*sizeof(int));
    close(fd);
    MPI_File_open(MPI_COMM_WORLD, "f1.data", MPI_MODE_RDONLY, MPI_INFO_NULL, &file);
    MPI_File_set_view(file, 0, MPI_INT, MPI_INT, "native", MPI_INFO_NULL);
    MPI_File_read_ordered(file,buf,1, MPI_INT,MPI_STATUS_IGNORE);
    MPI_File_close(&file);
    if (rank != buf[0]) {
        fprintf(stderr, "%d/%d: got %d instead of %d\n", rank, size, buf[0], rank);
    }
    MPI_File_open(MPI_COMM_WORLD, "f2.data", MPI_MODE_RDONLY, MPI_INFO_NULL, &file);
    MPI_File_set_view(file, 0, MPI_INT, MPI_INT, "native", MPI_INFO_NULL);
    MPI_File_read_ordered(file,buf,rank+1, MPI_INT,MPI_STATUS_IGNORE);
    MPI_File_close(&file);
    switch(rank) {
        case 0:
            if (0 != buf[0])
                fprintf(stderr, "%d/%d: got %d instead of %d\n", rank, size, buf[0], 0);
            break;
        case 1:
            if (1 != buf[0] || 2 != buf[1])
                fprintf(stderr, "%d/%d: got %d,%d instead of %d,%d\n", rank, size, buf[0], buf[1], 1, 2);
            break;
        case 2:
            if (3 != buf[0] || 4 != buf[1] || 5 != buf[2])
                fprintf(stderr, "%d/%d: got %d,%d,%d instead of %d,%d,%d\n", rank, size, buf[0], buf[1], buf[2], 3, 4, 5);
            break;
        case 3:
            if (6 != buf[0] || 7 != buf[1] || 8 != buf[2] || 9 != buf[3])
                fprintf(stderr, "%d/%d: got %d,%d,%d,%d instead of %d,%d,%d,%d\n", rank, size, buf[0], buf[1], buf[2], buf[3], 6, 7, 8, 9);
            break;
    }
    MPI_Finalize();
    return 0;
}

first commit fixes mpirun -np 1 a.out
second commit fixes mpirun -np 2 a.out

as a side note, ompio is failing several test from the mpi_test_suite (error or hang)
mpirun -np xxx mpi_test_suite -t io -d MPI_INT -c MPI_COMM_WORLD

@edgargabriel
Copy link
Member

@ggouaillardet can I ask what the motivation for both fixes are? I originally deliberatly set the individual component to also be selected for the size=2 case (after some performance evaluation done on two separate systems). And for the shared file pointer fix, was there a bug, i.e. the test shown above does not work? I will look into it. I will still look into them to see whether there is something new, we used to pass everything that was reasonable.

Regarding the mpi_test_suite bugs, I debugged them during the spring quite a bit, and unfortunately the tests themselves are buggy and cannot be used for correctness evaluation in many cases, e..g the test_suite sets a file view with gaps, and reads them back with the default file view (i.e. without gaps) and expects the result to be correct. That cannot work.

@edgargabriel
Copy link
Member

I see the problem. Your first fix is mostly correct, the only comment that I have is that the same change needs to be done to the sm and the other sharedfp components too.

The second fix does not truly solve the problem, it circumvents it only. There is a fundamental issue when using the individual fcoll component form shared file pointer operations. Let me think there about a solution before we commit something

@edgargabriel
Copy link
Member

ok, now I need the help of a git/gihub expert. I know that its easy to add another commit to the pr , e.g. the ones that are required for changing the other sharedfp components. We can also add a commit for the individual component (I found a simple solution). But the question is how to remove your second commit from this pr?

@jsquyres
Copy link
Member

jsquyres commented Aug 5, 2015

@edgargabriel and I are meeting via webex in a few hours to go through this stuff.

@edgargabriel
Copy link
Member

@ggouaillardet I created a new pr #779 please have a look at it, it contains your commit 3d1780f . If it works, I would suggest to close this one.

@ggouaillardet
Copy link
Contributor Author

@edgargabriel
the first commit sets a field shfileHandle->f_fh that is used by the individual fcoll module,
and that would be left uninitialized otherwise

i misunderstood about the second commit, and naively thought individual module was only good for communicators of size 1. so you are right, the second commit just hide the issue, and you fixed it in #779

yes, mpi_test_suite is broken for "non dense" datatypes (e.g. MPI_DOUBLE_INT, MPI_CHAR_INT, ...).
that is why i suggested you restrict to the MPI_INT type, (and MPI_COMM_WORLD at first) when running the io test suite

mpirun -np <n> mpi_test_suite -t io -d MPI_INT -c MPI_COMM_WORLD

with romio (e.g. --mca io_romio314_priority 100) this test completes (i tried np from 1 to 4)
with ompio, the test crashes with np=1 and hangs with np>1

from now, i will close this PR and merge #779

fwiw, here is how i use git to review PR

# git remote add edgargabriel https://github.com/edgargabriel/ompi.git
# git remote update
# git checkout topic/fcoll_fixes

in this case, it did not work because i already have topic/fcoll_fixes branches in my local and remote (ggouaillardet) repository.
so i had to do it manually ...

# git branch -D topic/fcoll_fixes
# git log edgargabriel/topic/fcoll_fixes -> you branched from master at commit 6265aaa35478652078dd837ce3e18f7118e74f44
# git checkout -b topic/fcoll_fixes 6265aaa35478652078dd837ce3e18f7118e74f44
# git merge --ff-only edgargabriel/topic/fcoll_fixes

@ggouaillardet ggouaillardet deleted the topic/fcoll_fixes branch August 7, 2015 02:06
jsquyres pushed a commit to jsquyres/ompi that referenced this pull request Sep 19, 2016
…atus

yalla: fix valgrind error due to uninitialized status field.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants