Coll/base: FIX bruck when sdtype and rdtype are different #10488

FlorentGermain-Bull · 2022-06-20T10:00:24Z

Coll/base/coll_base_alltoall.c:ompi_coll_base_alltoall_intra_bruck
FIX: switch tmpbuf from sdtype to rdtype representation to avoid illegal use of sdtype
on rbuf and wrong behaviour when sdtype and rdtype are different

OPTIM: use ompi_datatype_create_indexed instead of
ompi_datatype_create_indexed_block to reduce the number of temporary buffers used

Signed-off-by: Florent Germain florent.germain@atos.net

ompiteam-bot · 2022-06-20T10:00:27Z

Can one of the admins verify this patch?

ggouaillardet · 2022-06-20T10:05:20Z

ok to test

ompi/mca/coll/base/coll_base_alltoall.c

ggouaillardet · 2022-06-20T10:07:33Z

@FlorentGermain-Bull Thanks for the patch!
can you please share a test program that evidences the issue?
I will then add it to our test suite.

FlorentGermain-Bull · 2022-06-20T10:18:13Z

There is a bug in the modified bruck alltoall algorithm in coll/base module. It is the algorithm 3 when using tuned dynamic choice and tuned choose it be default on some configurations.

Here is the symptom:
When sdtype and rdtype have a different type map:

The result is wrong
Some data not covered by the described rbuf can be altered during the collective communication

Here is a reproducer that triggers this bug:

#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>

int main(int argc, char* argv[])
{
    MPI_Init(&argc, &argv);
 
    int size;
    MPI_Comm_size(MPI_COMM_WORLD, &size);
 
    int my_rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
    int my_values[size*3];
    for(int i = 0; i < size; i++)
    {   
        my_values[3*i] =  my_rank;
        my_values[3*i+1] = -my_rank;
        my_values[3*i+2] = i;
    }   
    if (my_rank == 1) {
        for(int i=0; i<size; i++) {
            printf("Process %d, SEND = [%d %d %d].\n", my_rank, my_values[3*i], my_values[3*i+1], my_values[3*i+2]);
        }   
    }   
    int buffer_recv[size*2];
    MPI_Datatype vector;
    MPI_Type_vector(2, 1, 2, MPI_INT, &vector);
    MPI_Type_commit(&vector);
    MPI_Alltoall(&my_values, 1, vector, buffer_recv, 2, MPI_INT, MPI_COMM_WORLD);
    if (my_rank == 1) {
        for(int i=0; i<size; i++) {
            printf("Process %d, RECV = [%d %d].\n", my_rank, buffer_recv[2*i], buffer_recv[2*i +1]);
        }   
    }   
    MPI_Finalize();
 
    return EXIT_SUCCESS;
}

The sendbuf is a vector of 2 elements of 1 int: sender -1 receiver
The recvbuf is 2 contiguous int: sender receiver

Before the fix:

$ for i in 1 2 3 4
do
    echo -e "\nAlgo $i"
    OMPI_MCA_coll_tuned_use_dynamic_rules=true OMPI_MCA_coll_tuned_alltoall_algorithm=$i srun -N2 -n4 -prome7402 ./a.out
done

Algo 1
Process 1, SEND = [1 -1 0].
Process 1, SEND = [1 -1 1].
Process 1, SEND = [1 -1 2].
Process 1, SEND = [1 -1 3].
Process 1, RECV = [0 1].
Process 1, RECV = [1 1].
Process 1, RECV = [2 1].
Process 1, RECV = [3 1].

Algo 2
Process 1, SEND = [1 -1 0].
Process 1, SEND = [1 -1 1].
Process 1, SEND = [1 -1 2].
Process 1, SEND = [1 -1 3].
Process 1, RECV = [0 1].
Process 1, RECV = [1 1].   
Process 1, RECV = [2 1].   
Process 1, RECV = [3 1].   

Algo 3
Process 1, SEND = [1 -1 0].
Process 1, SEND = [1 -1 1].
Process 1, SEND = [1 -1 2].
Process 1, SEND = [1 -1 3].
Process 1, RECV = [1 0].   
Process 1, RECV = [1 0].   
Process 1, RECV = [3 32765].
Process 1, RECV = [0 1].   

Algo 4
Process 1, SEND = [1 -1 0].
Process 1, SEND = [1 -1 1].
Process 1, SEND = [1 -1 2].
Process 1, SEND = [1 -1 3].
Process 1, RECV = [0 1].   
Process 1, RECV = [1 1].   
Process 1, RECV = [2 1].   
Process 1, RECV = [3 1].

After the fix:

$ for i in 1 2 3 4
do
    echo -e "\nAlgo $i"
    OMPI_MCA_coll_tuned_use_dynamic_rules=true OMPI_MCA_coll_tuned_alltoall_algorithm=$i srun -N2 -n4 -prome7402 ./a.out
done

Algo 1
Process 1, SEND = [1 -1 0].
Process 1, SEND = [1 -1 1].
Process 1, SEND = [1 -1 2].
Process 1, SEND = [1 -1 3].
Process 1, RECV = [0 1].
Process 1, RECV = [1 1].
Process 1, RECV = [2 1].
Process 1, RECV = [3 1].

Algo 2
Process 1, SEND = [1 -1 0].
Process 1, SEND = [1 -1 1].
Process 1, SEND = [1 -1 2].
Process 1, SEND = [1 -1 3].
Process 1, RECV = [0 1].
Process 1, RECV = [1 1].   
Process 1, RECV = [2 1].   
Process 1, RECV = [3 1].   

Algo 3
Process 1, SEND = [1 -1 0].
Process 1, SEND = [1 -1 1].
Process 1, SEND = [1 -1 2].
Process 1, SEND = [1 -1 3].
Process 1, RECV = [0 1].   
Process 1, RECV = [1 1].   
Process 1, RECV = [2 1].   
Process 1, RECV = [3 1].   

Algo 4
Process 1, SEND = [1 -1 0].
Process 1, SEND = [1 -1 1].
Process 1, SEND = [1 -1 2].
Process 1, SEND = [1 -1 3].
Process 1, RECV = [0 1].   
Process 1, RECV = [1 1].   
Process 1, RECV = [2 1].   
Process 1, RECV = [3 1].

Detailed bug behaviour:
To save memory, classic bruck exchanges are done between tmpbuf and rbuf.
Tmpbuf was using sdtype datatype. Exchanged were done through an indexed datatype derivated on sdtype.
This indexed (new_ddt) was used both on tmpbuf and rbuf (illegal use of sdtype on rbuf (symptom 1))
At the end of the collective communication, the copy from tmpbuf to rbuf is done using rdtype on both sides (illegal use of rdtype on tmpbuf (symptom 2))

Correction:
Use rdtype representation for tmpbuf.
All the communications are based on rdtype.
Use ompi_datatype_sndrcv instead of ompi_datatype_copy_content_same_ddt to perform copy from sbuf to tmpbuf.

awlauria · 2022-06-21T16:37:09Z

bot:aws:retest

awlauria · 2022-06-21T16:42:29Z

This is almost certainly an issue on v5.0.x. What about v4.1.x and v4.0.x?

FlorentGermain-Bull · 2022-06-22T07:30:23Z

This is almost certainly an issue on v5.0.x. What about v4.1.x and v4.0.x?

The bug is already there in v2.0.4 version.
I think this fix can be applied on all supported versions

bosilca · 2022-06-22T22:59:03Z

There are 2 things addressed in this patch, one optimization and one correctness.

correctness the incorrect use of the rdtype on the step 3. A simpler fix would have been to replace the use in Step 3 of ompi_datatype_copy_content_same_ddt with ompi_datatype_sndrcv (similar with what is proposed here, but without changing the logic of the implementation from using sender type to using receiver type on the temporary buffer).
optimization the move from ompi_datatype_create_indexed to ompi_datatype_create_indexed_block, allowing to reduce the number of temporary buffers used in the operation.

Please change the text of the PR and the log of the commit to reflect the proposed changes.

FlorentGermain-Bull · 2022-06-23T06:56:13Z

There are 2 things addressed in this patch, one optimization and one correctness.

correctness the incorrect use of the rdtype on the step 3. A simpler fix would have been to replace the use in Step 3 of ompi_datatype_copy_content_same_ddt with ompi_datatype_sndrcv (similar with what is proposed here, but without changing the logic of the implementation from using sender type to using receiver type on the temporary buffer).

optimization the move from ompi_datatype_create_indexed to ompi_datatype_create_indexed_block, allowing to reduce the number of temporary buffers used in the operation.

Please change the text of the PR and the log of the commit to reflect the proposed changes.

For the point 1. the use of a sdtype based datatype on rbuf on step 2 is also erroneous. If we keep tmpbuf as sdtypes, we also need to create a second new_ddt based on rdtype and replace the ompi_datatype_copy_content_same_ddt with ompi_datatype_sndrcv in the loop which would lead to a performance drop.

bosilca · 2022-06-24T05:58:26Z

I have another optimization that basically remove the need for the displ array. Instead of using MPI level functions to create the intermediary datatypes, it used OMPI capabilities to create specialized datatypes with a gap in the beginning. Unfortunately I do not have write access to your repo I cannot push.

Here is a quick manual path: replace the create_indexed_block call with the following code and remove the displs array.

        int vlen = (distance * 2) < size ? distance : size - distance;
        struct ompi_datatype_t *ddt;
        ompi_datatype_create_vector( 1,
                                     vlen,
                                     vlen * 2,
                                     sdtype, &ddt);
        new_ddt = ompi_datatype_create( ddt->super.desc.used );
        ompi_datatype_add(new_ddt, ddt, (size + distance - 1) / (distance * 2), distance * sext, vlen * 2 * sext);

        /* Commit the new datatype */
        err = ompi_datatype_commit(&new_ddt);
        if (err != MPI_SUCCESS) { line = __LINE__; goto err_hndl;  }

Using these routines we can create datatype in a more flexible way than using the MPI API, which means we can build the datatype representing the bruck exchange buffer on the fly, and skip the Step 1 (local rotation).

FIX: switch tmpbuf from sdtype to rdtype representation to avoid illegal use of sdtype on rbuf and wrong behaviour when sdtype and rdtype are different OPTIM: build custom datatype through ompi_datatype_add instead of using ompi_datatype_create_indexed_block to reduce the number of temporary buffers used Signed-off-by: Florent Germain <florent.germain@atos.net>

awlauria · 2022-07-05T15:20:29Z

/azp run

azure-pipelines · 2022-07-05T15:20:40Z

Azure Pipelines successfully started running 1 pipeline(s).

awlauria · 2022-07-05T22:01:34Z

/azp run

azure-pipelines · 2022-07-05T22:01:43Z

Azure Pipelines successfully started running 1 pipeline(s).

awlauria · 2022-07-15T18:02:04Z

/azp run

azure-pipelines · 2022-07-15T18:02:14Z

Azure Pipelines successfully started running 1 pipeline(s).

jsquyres · 2022-07-17T12:13:15Z

@FlorentGermain-Bull @bosilca @ggouaillardet Is this PR ready to merge?

FlorentGermain-Bull · 2022-07-28T07:42:13Z

@FlorentGermain-Bull @bosilca @ggouaillardet Is this PR ready to merge?

Yes, it is ok for me

@bosilca is the change in the datatype building ok for you?

gpaulsen · 2022-08-18T21:55:33Z

@bosilca ^^^

markalle · 2022-08-23T18:56:31Z

I like sndrcv() for the data copying, and I wish an analogue of ompi_datatype_add() was part of the MPI standard, that's a nice API option to have. I think it looks good to me with one caveat: shouldn't new_ddt get freed (or deleted/destroyed, but I'm thinking freed in this context)?

ggouaillardet reviewed Jun 20, 2022

View reviewed changes

ompi/mca/coll/base/coll_base_alltoall.c Outdated Show resolved Hide resolved

FlorentGermain-Bull force-pushed the fix_coll_base_alltoall_bruck branch from cd11939 to 0c82187 Compare June 20, 2022 10:23

FlorentGermain-Bull marked this pull request as ready for review June 20, 2022 10:32

jsquyres requested a review from bosilca June 22, 2022 14:46

jsquyres added Target: main Target: v4.0.x Target: v5.0.x Target: v4.1.x labels Jun 22, 2022

FlorentGermain-Bull force-pushed the fix_coll_base_alltoall_bruck branch from 0c82187 to 56ce62d Compare June 23, 2022 07:10

FlorentGermain-Bull force-pushed the fix_coll_base_alltoall_bruck branch from 56ce62d to 99ca572 Compare July 5, 2022 05:58

gpaulsen added the Severity: critical label Aug 18, 2022

bosilca approved these changes Aug 24, 2022

View reviewed changes

bosilca merged commit 7cb3ec0 into open-mpi:main Aug 24, 2022

Coll/base: FIX bruck when sdtype and rdtype are different #10488

Coll/base: FIX bruck when sdtype and rdtype are different #10488

Uh oh!

Conversation

FlorentGermain-Bull commented Jun 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ompiteam-bot commented Jun 20, 2022

Uh oh!

ggouaillardet commented Jun 20, 2022

Uh oh!

Uh oh!

ggouaillardet commented Jun 20, 2022

Uh oh!

FlorentGermain-Bull commented Jun 20, 2022

Uh oh!

awlauria commented Jun 21, 2022

Uh oh!

awlauria commented Jun 21, 2022

Uh oh!

FlorentGermain-Bull commented Jun 22, 2022

Uh oh!

bosilca commented Jun 22, 2022

Uh oh!

FlorentGermain-Bull commented Jun 23, 2022

Uh oh!

bosilca commented Jun 24, 2022

Uh oh!

awlauria commented Jul 5, 2022

Uh oh!

azure-pipelines bot commented Jul 5, 2022

Uh oh!

awlauria commented Jul 5, 2022

Uh oh!

azure-pipelines bot commented Jul 5, 2022

Uh oh!

awlauria commented Jul 15, 2022

Uh oh!

azure-pipelines bot commented Jul 15, 2022

Uh oh!

jsquyres commented Jul 17, 2022

Uh oh!

FlorentGermain-Bull commented Jul 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gpaulsen commented Aug 18, 2022

Uh oh!

markalle commented Aug 23, 2022

Uh oh!

Uh oh!

FlorentGermain-Bull commented Jun 20, 2022 •

edited

Loading

FlorentGermain-Bull commented Jul 28, 2022 •

edited

Loading