Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New MPI_SCATTER bug in OpenMPI5 #12383

Closed
dl1ycf opened this issue Feb 29, 2024 · 9 comments · Fixed by #12415
Closed

New MPI_SCATTER bug in OpenMPI5 #12383

dl1ycf opened this issue Feb 29, 2024 · 9 comments · Fixed by #12415

Comments

@dl1ycf
Copy link

dl1ycf commented Feb 29, 2024

OpenMPI 5.0.2

An MPI_SCATTER fails if we have more than one node each running more than 1 process, and if recvcount on the slave
does not match sendcount on the sender. The following FORTRAN program demonstrates the error: if run as follows

rank 0 on node1
rank 1 on node1
rank 2 on node2
rank 3 on node2
rank 4 on node3
rank 5 on node3

the result on ranks 2,3,4,5 is wrong. If run with one rank per node it is OK.
When changing the second parameter of the second (client) MPI_SCATTER from 0 to 1000 it works
in either case.

P.S.: this was no problem with OpenMPI 4

Yours,

Christoph van Wüllen

c=========================================================
      program  testmpi
      use mpi
      implicit none
c 
      integer rank, size
      integer status(MPI_STATUS_SIZE) 
      character(len=MPI_MAX_PROCESSOR_NAME) host
 c
      real*8, dimension(:), allocatable :: source
      real*8, dimension(:), allocatable :: receive
      real*8  dum(1)
c
      integer i, k, len, ierr
cvw
cvw   Start-up
cvw
      rank=-1
      size=-1
cvw
cvw   initialize MPI, check how many workers there are (size)
cvw   and who we are (rank)
cvw   note rank may take the values 0, 1, 2, ..., size-1
cvw
      call MPI_INIT( ierr )
      call MPI_COMM_RANK( MPI_COMM_WORLD, rank, ierr )
      call MPI_COMM_SIZE( MPI_COMM_WORLD, size, ierr )
cvw
cvw
cvw
      if (rank .eq. 0) then
cvw
cvw     we are the 'master'
cvw
        allocate(source(size*1000), receive(1000))
        do i=1,size*1000
          source(i)=i
        end do
      else
cvw
cvw     we are slave, allocate only the receive buffer
cvw
        allocate(receive(1000))
      endif
cvw
cvw   Perform and check MPI_SCATTER. Note "source" only exists on the master
cvw
      if (rank .eq. 0) then 
         call MPI_SCATTER(source, 1000, MPI_REAL8, receive, 1000,
     &                    MPI_REAL8, 0, MPI_COMM_WORLD, ierr)
         if (ierr .ne. 0) stop 'Scatter:Master'
      else
cvw
cvw      my understanding is that the first three parameter here do not
cvw      matter!
cvw
         call MPI_SCATTER(dum,       0, MPI_REAL8, receive, 1000,
     &                    MPI_REAL8, 0, MPI_COMM_WORLD, ierr)
         if (ierr .ne. 0) stop 'Scatter:Slave'
      endif
cvw
cvw   If the scatter succeeds, then rank #i holds in its receive buffer
cvw   receive(k) = (1000*i)+k
cvw
      ierr=0
      do k=1,1000
        if (abs(receive(k) - 1000*rank -k) .gt. 1d-8) ierr=1
      end do
      if (ierr .ne. 0) then
        write (6,120) rank
      else
        write (6,130) rank
      endif
120   format('SCATTER WRONG: ', i4)
130   format('SCATTER CORRECT: ', i4)
      call MPI_FINALIZE( ierr ) 
      end 

@ggouaillardet
Copy link
Contributor

Thanks for the report!

meanwhile, you can mpirun --mca coll ^han ...

@dl1ycf
Copy link
Author

dl1ycf commented Feb 29, 2024

Please drop a note here whether you can confirm the bug or not. My understanding is that for the non-root MPI procs,
the first two arguments of MPI_SCATTER should not even looked at, but here they make a day-and-night difference!

@ggouaillardet
Copy link
Contributor

The issue can be evidenced with 2 nodes and 4 mpi tasks.
Indeed, non-root ranks might access sendcount and sendtype when performing scatter on sub-communicators.

the inline patch below is just enough to fix this issue, but similar ones likely occur in other parts of the module

diff --git a/ompi/mca/coll/han/coll_han_scatter.c b/ompi/mca/coll/han/coll_han_scatter.c
index 6f65d7d427..ce180d6ff9 100644
--- a/ompi/mca/coll/han/coll_han_scatter.c
+++ b/ompi/mca/coll/han/coll_han_scatter.c
@@ -242,9 +242,16 @@ int mca_coll_han_scatter_ls_task(void *task_args)
                          t->w_rank));
     OBJ_RELEASE(t->cur_task);
 
-    t->low_comm->c_coll->coll_scatter((char *) t->sbuf, t->scount, t->sdtype, (char *) t->rbuf,
-                                      t->rcount, t->rdtype, t->root_low_rank, t->low_comm,
-                                      t->low_comm->c_coll->coll_scatter_module);
+    if (t->root == t->w_rank) {
+        t->low_comm->c_coll->coll_scatter((char *) t->sbuf, t->scount, t->sdtype, (char *) t->rbuf,
+                                          t->rcount, t->rdtype, t->root_low_rank, t->low_comm,
+                                          t->low_comm->c_coll->coll_scatter_module);
+    } else {
+        t->low_comm->c_coll->coll_scatter((char *) t->sbuf, t->rcount, t->rdtype, (char *) t->rbuf,
+                                          t->rcount, t->rdtype, t->root_low_rank, t->low_comm,
+                                          t->low_comm->c_coll->coll_scatter_module);
+    }
+
 
     if (t->sbuf_inter_free != NULL && t->noop != true) {
         free(t->sbuf_inter_free);

@ggouaillardet
Copy link
Contributor

@dl1ycf yes, I was able to reproduce the issue. I share your analysis: non-root ranks should ignore send buffer/count/type, so this is a bug in the coll/han component.

@dl1ycf
Copy link
Author

dl1ycf commented Feb 29, 2024

Do you assume that recvcount on the non-root ranks equals sendcount on the root rank? I read the manual such that recvcount could be larger (if recvbuf is longer than needed).

But I can confirm that coll=^han saves me.

@ggouaillardet
Copy link
Contributor

no, and that would be an incorrect assumption. The correct one is (count,type) match, this is why I used rcount instead of scount and rdtype instead of stype

@wenduwan
Copy link
Contributor

wenduwan commented Mar 6, 2024

Related to MPI_Scatter, AWS has a pending implementation for MPI_Scatterv #12376

@jiaxiyan Could you please confirm that the scatterv impl does not have this issue?

@wenduwan
Copy link
Contributor

wenduwan commented Mar 6, 2024

@ggouaillardet Since you have already figured out the issue, would it be possible for you to open a patch for this? That would be wonderful :)

@bosilca
Copy link
Member

bosilca commented Mar 6, 2024

@ggouaillardet nailed it. The issue here is that leafs promoted leaders in the node-level comms should receive the data as rdtype/rcount and then propagate it as rdtype/rcount, with the exception of the original root, who will use the original buffer and thus must use sdtype/scount.

wenduwan pushed a commit to wenduwan/ompi that referenced this issue Mar 18, 2024
Fixes open-mpi#12383

Thanks to Christoph van Wüllen for reporting the issue.

Signed-off-by: Wenduo Wang <wenduwan@amazon.com>
wenduwan pushed a commit to wenduwan/ompi that referenced this issue Mar 20, 2024
Fixes open-mpi#12383

Thanks to Christoph van Wüllen for reporting the issue.

Signed-off-by: Wenduo Wang <wenduwan@amazon.com>
wenduwan pushed a commit to wenduwan/ompi that referenced this issue Mar 22, 2024
Fixes open-mpi#12383

Thanks to Christoph van Wüllen for reporting the issue.

Signed-off-by: Wenduo Wang <wenduwan@amazon.com>
(cherry picked from commit a58e884)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants