-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Async p2p bugfix #165
Async p2p bugfix #165
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
I'm approving it in advance, but please see the minor suggestions (mostly comments) to add.
int size = 0; | ||
MPI_Type_size(call->datatype, &size); | ||
int worldRank = localRankToGlobalRank(status.MPI_SOURCE, | ||
call->comm); | ||
call->comm); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is minor, but I find that the code is more readable when this argument call->comm
is indented under the first argument (under status
), as it was before.
UPDATE_REQUEST_MAP(request, MPI_REQUEST_NULL); | ||
int flag = 0; | ||
MPI_Status status; | ||
MPI_Test_internal(&request, &flag, &status, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a comment above this similar to your comment in this PR? This will help others to understand the code better. You can either insert your entire PR comment, which is very well written, or if you're searching for something shorter, perhaps:
// This is needed if an MPI_Isend was called earlier. Without this,
// large messages within the same node will fail under MPICH and some
// other MPIs. A previous call to MPI_Irecv caused only the metadata to be
// exchanged. So, MPI_Iprobe succeeds and MPI_Irecv will later fail, unless
// we force the sending of data via MPI_Test.
for (i = 0; i < g_world_size; i++) { | ||
if (g_bytesSentToUsByRank[i] > g_recvBytesByRank[i]) { | ||
return false; | ||
} | ||
} | ||
// No pending MPI_Isend request |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's have a more expressive comment. Perhaps something like:
// recvMsgIntoInternalBuffer returns false if no MPI_Isend request is found:
@@ -228,8 +239,8 @@ drainSendRecv() | |||
while (!allDrained()) { | |||
sleep(1); | |||
int numReceived = 0; | |||
// If pending MPI_Irecvs, use MPI_Test in case msg was sent. | |||
numReceived += completePendingIrecvs(); | |||
// If pending MPI_Irecvs or MPI_Isend, use MPI_Test to try to complete them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pending MPI_Irecvs
-> pending MPI_Irecv
(I know this is a previous comment. But let's polish it further now.)
to complete them
-> to complete it
("A or B" is singular, and "A and B" is plural.)
#include <unistd.h> | ||
|
||
#define MSG_SIZE 64*1024 // large message | ||
#define MSG_SIZE 64 // small message |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#define MSG_SIZE 64*1024 // large message
#ifndef MSG_SIZE
# define MSG_SIZE 64 // small message
#endif
- Some compilers will complain about re-defining MSG_SIZE without this.
The default version uses a large message in this version. But I think this is a good thing, since that's what caused the original bug. - Should we raise the large message size to
256*1024
, to make it more future-proof? In the future, maybe the optimized hardware considers even64*1024
to be a small message that can use an EAGER policy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a mistake. I was thinking of commenting out one of two definitions.
#include <stdlib.h> | ||
#include <assert.h> | ||
#include <unistd.h> | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add some comment like:
// For testing, checkpoint this during the 5-second sleep between
// MPI_Isend (rank 1) and MPI_Irecv (rank 0).
fa6caf1
to
f15dea3
Compare
f15dea3
to
9c8d52b
Compare
This is a fix of issue #161. When sharing large data with asynchronous point-to-point communication, MPICH requires a handshake between the sender and receiver. If MPI_Irecv has not been called before checkpoint time, only the metadata of the message is shared among MPI processes and the data is still in the sender's local buffer. As a result, MPI_Iprobe in the p2p draining algorithm can detect the incoming message, but MPI_Irecv can't receive the message. In this PR, I also call MPI_Test on pending MPI_Isend requests to complete potential pending handshakes.