Skip to content

Conversation

FlorentGermain-Bull
Copy link
Contributor

@FlorentGermain-Bull FlorentGermain-Bull commented Jun 20, 2022

Coll/base/coll_base_alltoall.c:ompi_coll_base_alltoall_intra_bruck
FIX: switch tmpbuf from sdtype to rdtype representation to avoid illegal use of sdtype
on rbuf and wrong behaviour when sdtype and rdtype are different

OPTIM: use ompi_datatype_create_indexed instead of
ompi_datatype_create_indexed_block to reduce the number of temporary buffers used

Signed-off-by: Florent Germain florent.germain@atos.net

@ompiteam-bot
Copy link

Can one of the admins verify this patch?

@ggouaillardet
Copy link
Contributor

ok to test

@ggouaillardet
Copy link
Contributor

@FlorentGermain-Bull Thanks for the patch!
can you please share a test program that evidences the issue?
I will then add it to our test suite.

@FlorentGermain-Bull
Copy link
Contributor Author

There is a bug in the modified bruck alltoall algorithm in coll/base module. It is the algorithm 3 when using tuned dynamic choice and tuned choose it be default on some configurations.

Here is the symptom:
When sdtype and rdtype have a different type map:

  • The result is wrong
  • Some data not covered by the described rbuf can be altered during the collective communication

Here is a reproducer that triggers this bug:

#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>

int main(int argc, char* argv[])
{
    MPI_Init(&argc, &argv);
 
    int size;
    MPI_Comm_size(MPI_COMM_WORLD, &size);
 
    int my_rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
    int my_values[size*3];
    for(int i = 0; i < size; i++)
    {   
        my_values[3*i] =  my_rank;
        my_values[3*i+1] = -my_rank;
        my_values[3*i+2] = i;
    }   
    if (my_rank == 1) {
        for(int i=0; i<size; i++) {
            printf("Process %d, SEND = [%d %d %d].\n", my_rank, my_values[3*i], my_values[3*i+1], my_values[3*i+2]);
        }   
    }   
    int buffer_recv[size*2];
    MPI_Datatype vector;
    MPI_Type_vector(2, 1, 2, MPI_INT, &vector);
    MPI_Type_commit(&vector);
    MPI_Alltoall(&my_values, 1, vector, buffer_recv, 2, MPI_INT, MPI_COMM_WORLD);
    if (my_rank == 1) {
        for(int i=0; i<size; i++) {
            printf("Process %d, RECV = [%d %d].\n", my_rank, buffer_recv[2*i], buffer_recv[2*i +1]);
        }   
    }   
    MPI_Finalize();
 
    return EXIT_SUCCESS;
}

The sendbuf is a vector of 2 elements of 1 int: sender -1 receiver
The recvbuf is 2 contiguous int: sender receiver

Before the fix:

$ for i in 1 2 3 4
do
    echo -e "\nAlgo $i"
    OMPI_MCA_coll_tuned_use_dynamic_rules=true OMPI_MCA_coll_tuned_alltoall_algorithm=$i srun -N2 -n4 -prome7402 ./a.out
done

Algo 1
Process 1, SEND = [1 -1 0].
Process 1, SEND = [1 -1 1].
Process 1, SEND = [1 -1 2].
Process 1, SEND = [1 -1 3].
Process 1, RECV = [0 1].
Process 1, RECV = [1 1].
Process 1, RECV = [2 1].
Process 1, RECV = [3 1].

Algo 2
Process 1, SEND = [1 -1 0].
Process 1, SEND = [1 -1 1].
Process 1, SEND = [1 -1 2].
Process 1, SEND = [1 -1 3].
Process 1, RECV = [0 1].
Process 1, RECV = [1 1].   
Process 1, RECV = [2 1].   
Process 1, RECV = [3 1].   

Algo 3
Process 1, SEND = [1 -1 0].
Process 1, SEND = [1 -1 1].
Process 1, SEND = [1 -1 2].
Process 1, SEND = [1 -1 3].
Process 1, RECV = [1 0].   
Process 1, RECV = [1 0].   
Process 1, RECV = [3 32765].
Process 1, RECV = [0 1].   

Algo 4
Process 1, SEND = [1 -1 0].
Process 1, SEND = [1 -1 1].
Process 1, SEND = [1 -1 2].
Process 1, SEND = [1 -1 3].
Process 1, RECV = [0 1].   
Process 1, RECV = [1 1].   
Process 1, RECV = [2 1].   
Process 1, RECV = [3 1].   

After the fix:

$ for i in 1 2 3 4
do
    echo -e "\nAlgo $i"
    OMPI_MCA_coll_tuned_use_dynamic_rules=true OMPI_MCA_coll_tuned_alltoall_algorithm=$i srun -N2 -n4 -prome7402 ./a.out
done

Algo 1
Process 1, SEND = [1 -1 0].
Process 1, SEND = [1 -1 1].
Process 1, SEND = [1 -1 2].
Process 1, SEND = [1 -1 3].
Process 1, RECV = [0 1].
Process 1, RECV = [1 1].
Process 1, RECV = [2 1].
Process 1, RECV = [3 1].

Algo 2
Process 1, SEND = [1 -1 0].
Process 1, SEND = [1 -1 1].
Process 1, SEND = [1 -1 2].
Process 1, SEND = [1 -1 3].
Process 1, RECV = [0 1].
Process 1, RECV = [1 1].   
Process 1, RECV = [2 1].   
Process 1, RECV = [3 1].   

Algo 3
Process 1, SEND = [1 -1 0].
Process 1, SEND = [1 -1 1].
Process 1, SEND = [1 -1 2].
Process 1, SEND = [1 -1 3].
Process 1, RECV = [0 1].   
Process 1, RECV = [1 1].   
Process 1, RECV = [2 1].   
Process 1, RECV = [3 1].   

Algo 4
Process 1, SEND = [1 -1 0].
Process 1, SEND = [1 -1 1].
Process 1, SEND = [1 -1 2].
Process 1, SEND = [1 -1 3].
Process 1, RECV = [0 1].   
Process 1, RECV = [1 1].   
Process 1, RECV = [2 1].   
Process 1, RECV = [3 1]. 

Detailed bug behaviour:
To save memory, classic bruck exchanges are done between tmpbuf and rbuf.
Tmpbuf was using sdtype datatype. Exchanged were done through an indexed datatype derivated on sdtype.
This indexed (new_ddt) was used both on tmpbuf and rbuf (illegal use of sdtype on rbuf (symptom 1))
At the end of the collective communication, the copy from tmpbuf to rbuf is done using rdtype on both sides (illegal use of rdtype on tmpbuf (symptom 2))

Correction:
Use rdtype representation for tmpbuf.
All the communications are based on rdtype.
Use ompi_datatype_sndrcv instead of ompi_datatype_copy_content_same_ddt to perform copy from sbuf to tmpbuf.

@FlorentGermain-Bull FlorentGermain-Bull force-pushed the fix_coll_base_alltoall_bruck branch from cd11939 to 0c82187 Compare June 20, 2022 10:23
@FlorentGermain-Bull FlorentGermain-Bull marked this pull request as ready for review June 20, 2022 10:32
@awlauria
Copy link
Contributor

bot:aws:retest

@awlauria
Copy link
Contributor

This is almost certainly an issue on v5.0.x. What about v4.1.x and v4.0.x?

@FlorentGermain-Bull
Copy link
Contributor Author

This is almost certainly an issue on v5.0.x. What about v4.1.x and v4.0.x?

The bug is already there in v2.0.4 version.
I think this fix can be applied on all supported versions

@bosilca
Copy link
Member

bosilca commented Jun 22, 2022

There are 2 things addressed in this patch, one optimization and one correctness.

  1. correctness the incorrect use of the rdtype on the step 3. A simpler fix would have been to replace the use in Step 3 of ompi_datatype_copy_content_same_ddt with ompi_datatype_sndrcv (similar with what is proposed here, but without changing the logic of the implementation from using sender type to using receiver type on the temporary buffer).
  2. optimization the move from ompi_datatype_create_indexed to ompi_datatype_create_indexed_block, allowing to reduce the number of temporary buffers used in the operation.

Please change the text of the PR and the log of the commit to reflect the proposed changes.

@FlorentGermain-Bull
Copy link
Contributor Author

There are 2 things addressed in this patch, one optimization and one correctness.

  1. correctness the incorrect use of the rdtype on the step 3. A simpler fix would have been to replace the use in Step 3 of ompi_datatype_copy_content_same_ddt with ompi_datatype_sndrcv (similar with what is proposed here, but without changing the logic of the implementation from using sender type to using receiver type on the temporary buffer).
  2. optimization the move from ompi_datatype_create_indexed to ompi_datatype_create_indexed_block, allowing to reduce the number of temporary buffers used in the operation.

Please change the text of the PR and the log of the commit to reflect the proposed changes.

For the point 1. the use of a sdtype based datatype on rbuf on step 2 is also erroneous. If we keep tmpbuf as sdtypes, we also need to create a second new_ddt based on rdtype and replace the ompi_datatype_copy_content_same_ddt with ompi_datatype_sndrcv in the loop which would lead to a performance drop.

@FlorentGermain-Bull FlorentGermain-Bull force-pushed the fix_coll_base_alltoall_bruck branch from 0c82187 to 56ce62d Compare June 23, 2022 07:10
@bosilca
Copy link
Member

bosilca commented Jun 24, 2022

I have another optimization that basically remove the need for the displ array. Instead of using MPI level functions to create the intermediary datatypes, it used OMPI capabilities to create specialized datatypes with a gap in the beginning. Unfortunately I do not have write access to your repo I cannot push.

Here is a quick manual path: replace the create_indexed_block call with the following code and remove the displs array.

        int vlen = (distance * 2) < size ? distance : size - distance;
        struct ompi_datatype_t *ddt;
        ompi_datatype_create_vector( 1,
                                     vlen,
                                     vlen * 2,
                                     sdtype, &ddt);
        new_ddt = ompi_datatype_create( ddt->super.desc.used );
        ompi_datatype_add(new_ddt, ddt, (size + distance - 1) / (distance * 2), distance * sext, vlen * 2 * sext);

        /* Commit the new datatype */
        err = ompi_datatype_commit(&new_ddt);
        if (err != MPI_SUCCESS) { line = __LINE__; goto err_hndl;  }

Using these routines we can create datatype in a more flexible way than using the MPI API, which means we can build the datatype representing the bruck exchange buffer on the fly, and skip the Step 1 (local rotation).

    FIX: switch tmpbuf from sdtype to rdtype representation to avoid illegal use of sdtype
         on rbuf and wrong behaviour when sdtype and rdtype are different

    OPTIM: build custom datatype through ompi_datatype_add instead of using
           ompi_datatype_create_indexed_block to reduce the number of temporary buffers used

Signed-off-by: Florent Germain <florent.germain@atos.net>
@FlorentGermain-Bull FlorentGermain-Bull force-pushed the fix_coll_base_alltoall_bruck branch from 56ce62d to 99ca572 Compare July 5, 2022 05:58
@awlauria
Copy link
Contributor

awlauria commented Jul 5, 2022

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@awlauria
Copy link
Contributor

awlauria commented Jul 5, 2022

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@awlauria
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jsquyres
Copy link
Member

@FlorentGermain-Bull @bosilca @ggouaillardet Is this PR ready to merge?

@FlorentGermain-Bull
Copy link
Contributor Author

FlorentGermain-Bull commented Jul 28, 2022

@FlorentGermain-Bull @bosilca @ggouaillardet Is this PR ready to merge?

Yes, it is ok for me

@bosilca is the change in the datatype building ok for you?

@gpaulsen
Copy link
Member

@bosilca ^^^

@markalle
Copy link
Contributor

I like sndrcv() for the data copying, and I wish an analogue of ompi_datatype_add() was part of the MPI standard, that's a nice API option to have. I think it looks good to me with one caveat: shouldn't new_ddt get freed (or deleted/destroyed, but I'm thinking freed in this context)?

@bosilca bosilca merged commit 7cb3ec0 into open-mpi:main Aug 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants