-
Notifications
You must be signed in to change notification settings - Fork 912
Closed
Description
Hi all,
I'm developer of Intel Math Kernel Library and for our cluster components we provide OpenMPI support for our customers.
But recently we faced with issue related to sending/receiving when sender and receiver use different data types.
I tried to use several OpenMPI versions such as 1.6.1 (hang), 1.8.1 and 2.1.1 (crash).
These versions have been downloaded from OpenMPI site and built from source.
Information about system:
OS: RHEL 7.2
CPU: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
Network type: N/A, within 1 node.
You can find reproducer below. The reproducer works fine with other MPI implementations, such as Intel MPI, MPICH. I run it on 2 processes and as result I observe the following error message:
[mkl:147075] *** An error occurred in MPI_Bcast
[mkl:147075] *** reported by process [3517710337,1]
[mkl:147075] *** on communicator MPI COMMUNICATOR 6 SPLIT FROM 3
[mkl:147075] *** MPI_ERR_TRUNCATE: message truncated
[mkl:147075] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[mkl:147075] *** and potentially your MPI job)
#include <mpi.h>
#include <stdlib.h>
MPI_Datatype get_type(int m, int n, int lda, MPI_Datatype type, int rank, int *count)
{
if (rank == 1) {
(*count) = m * n;
return type;
}
MPI_Datatype new_type;
MPI_Type_vector(n, m, lda, type, &new_type);
MPI_Type_commit(&new_type);
(*count) = 1;
return new_type;
}
int main(int argc, char** argv)
{
int m = 1000;
int n = 1000;
int lda = 1000;
int size, rank;
int count = -1;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
float* buffer = (float*) malloc(sizeof(float) * m * n);
MPI_Datatype my_type = get_type(m, n, lda, MPI_FLOAT, rank, &count);
MPI_Bcast((void*)buffer, count, my_type, 0, MPI_COMM_WORLD);
free(buffer);
MPI_Type_free(&my_type);
MPI_Finalize();
return 0;
}
Metadata
Metadata
Assignees
Labels
No labels